## Case Study Project

“Case Study Project” is an initiative that looks to promote and facilitate the usage of R programming language for data manipulation, data analysis, programming, and computational aspects of statistical topics in authentic research applications. Specifically, it looks to create an extensive resource database of different packages available in R, which will help solve problems related to mathematics or analytics. Furthermore, it aims to provide faculty, researchers, and students with exposure to the entire thought process of approaching the computations of a complete data analysis project. It is entirely different from teaching a programming language. It illustrates how high-level programming language evolves.

## Case studies

Gautier paux and alex dmitrienko, introduction.

Several case studies have been created to facilitate the implementation of simulation-based Clinical Scenario Evaluation (CSE) approaches in multiple settings and help the user understand individual features of the Mediana package. Case studies are arranged in terms of increasing complexity of the underlying clinical trial setting (i.e., trial design and analysis methodology). For example, Case study 1 deals with a number of basic settings and increasingly more complex settings are considered in the subsequent case studies.

## Case study 1

This case study serves a good starting point for users who are new to the Mediana package. It focuses on clinical trials with simple designs and analysis strategies where power and sample size calculations can be performed using analytical methods.

- Trial with two treatment arms and single endpoint (normally distributed endpoint).
- Trial with two treatment arms and single endpoint (binary endpoint).
- Trial with two treatment arms and single endpoint (survival-type endpoint).
- Trial with two treatment arms and single endpoint (survival-type endpoint with censoring).
- Trial with two treatment arms and single endpoint (count-type endpoint).

## Case study 2

This case study is based on a clinical trial with three or more treatment arms . A multiplicity adjustment is required in this setting and no analytical methods are available to support power calculations.

This example also illustrates a key feature of the Mediana package, namely, a useful option to define custom functions, for example, it shows how the user can define a new criterion in the Evaluation Model.

Clinical trial in patients with schizophrenia

## Case study 3

This case study introduces a clinical trial with several patient populations (marker-positive and marker-negative patients). It demonstrates how the user can define independent samples in a data model and then specify statistical tests in an analysis model based on merging several samples, i.e., merging samples of marker-positive and marker-negative patients to carry out a test that evaluated the treatment effect in the overall population.

Clinical trial in patients with asthma

## Case study 4

This case study illustrates CSE simulations in a clinical trial with several endpoints and helps showcase the package’s ability to model multivariate outcomes in clinical trials.

Clinical trial in patients with metastatic colorectal cancer

## Case study 5

This case study is based on a clinical trial with several endpoints and multiple treatment arms and illustrates the process of performing complex multiplicity adjustments in trials with several clinical objectives.

Clinical trial in patients with rheumatoid arthritis

## Case study 6

This case study is an extension of Case study 2 and illustrates how the package can be used to assess the performance of several multiplicity adjustments. The case study also walks the reader through the process of defining customized simulation reports.

Case study 1 deals with a simple setting, namely, a clinical trial with two treatment arms (experimental treatment versus placebo) and a single endpoint. Power calculations can be performed analytically in this setting. Specifically, closed-form expressions for the power function can be derived using the central limit theorem or other approximations.

Several distribution will be illustrated in this case study:

## Normally distributed endpoint

Binary endpoint, survival-type endpoint, survival-type endpoint (with censoring), count-type endpoint.

Suppose that a sponsor is designing a Phase III clinical trial in patients with pulmonary arterial hypertension (PAH). The efficacy of experimental treatments for PAH is commonly evaluated using a six-minute walk test and the primary endpoint is defined as the change from baseline to the end of the 16-week treatment period in the six-minute walk distance.

## Define a Data Model

The first step is to initialize the data model:

After the initialization, components of the data model can be added to the DataModel object incrementally using the + operator.

The change from baseline in the six-minute walk distance is assumed to follow a normal distribution. The distribution of the primary endpoint is defined in the OutcomeDist object:

The sponsor would like to perform power evaluation over a broad range of sample sizes in each treatment arm:

As a side note, the seq function can be used to compactly define sample sizes in a data model:

The sponsor is interested in performing power calculations under two treatment effect scenarios (standard and optimistic scenarios). Under these scenarios, the experimental treatment is expected to improve the six-minute walk distance by 40 or 50 meters compared to placebo, respectively, with the common standard deviation of 70 meters.

Therefore, the mean change in the placebo arm is set to μ = 0 and the mean changes in the six-minute walk distance in the experimental arm are set to μ = 40 (standard scenario) or μ = 50 (optimistic scenario). The common standard deviation is σ = 70.

Note that the mean and standard deviation are explicitly identified in each list. This is done mainly for the user’s convenience.

After having defined the outcome parameters for each sample, two Sample objects that define the two treatment arms in this trial can be created and added to the DataModel object:

## Define an Analysis Model

Just like the data model, the analysis model needs to be initialized as follows:

Only one significance test is planned to be carried out in the PAH clinical trial (treatment versus placebo). The treatment effect will be assessed using the one-sided two-sample t -test:

According to the specifications, the two-sample t-test will be applied to Sample 1 (Placebo) and Sample 2 (Treatment). These sample IDs come from the data model defied earlier. As explained in the manual, see Analysis Model , the sample order is determined by the expected direction of the treatment effect. In this case, an increase in the six-minute walk distance indicates a beneficial effect and a numerically larger value of the primary endpoint is expected in Sample 2 (Treatment) compared to Sample 1 (Placebo). This implies that the list of samples to be passed to the t-test should include Sample 1 followed by Sample 2. It is of note that from version 1.0.6, it is possible to specify an option to indicate if a larger numeric values is expected in the Sample 2 ( larger = TRUE ) or in Sample 1 ( larger = FALSE ). By default, this argument is set to TRUE .

To illustrate the use of the Statistic object, the mean change in the six-minute walk distance in the treatment arm can be computed using the MeanStat statistic:

## Define an Evaluation Model

The data and analysis models specified above collectively define the Clinical Scenarios to be examined in the PAH clinical trial. The scenarios are evaluated using success criteria or metrics that are aligned with the clinical objectives of the trial. In this case it is most appropriate to use regular power or, more formally, marginal power . This success criterion is specified in the evaluation model.

First of all, the evaluation model must be initialized:

Secondly, the success criterion of interest (marginal power) is defined using the Criterion object:

The tests argument lists the IDs of the tests (defined in the analysis model) to which the criterion is applied (note that more than one test can be specified). The test IDs link the evaluation model with the corresponding analysis model. In this particular case, marginal power will be computed for the t-test that compares the mean change in the six-minute walk distance in the placebo and treatment arms (Placebo vs treatment).

In order to compute the average value of the mean statistic specified in the analysis model (i.e., the mean change in the six-minute walk distance in the treatment arm) over the simulation runs, another Criterion object needs to be added:

The statistics argument of this Criterion object lists the ID of the statistic (defined in the analysis model) to which this metric is applied (e.g., Mean Treatment ).

## Perform Clinical Scenario Evaluation

After the clinical scenarios (data and analysis models) and evaluation model have been defined, the user is ready to evaluate the success criteria specified in the evaluation model by calling the CSE function.

To accomplish this, the simulation parameters need to be defined in a SimParameters object:

The function call for CSE specifies the individual components of Clinical Scenario Evaluation in this case study as well as the simulation parameters:

The simulation results are saved in an CSE object ( case.study1.results ). This object contains complete information about this particular evaluation, including the data, analysis and evaluation models specified by the user. The most important component of this object is the data frame contained in the list named simulation.results ( case.study1.results$simulation.results ). This data frame includes the values of the success criteria and metrics defined in the evaluation model.

## Summarize the Simulation Results

Summary of simulation results in r console.

To facilitate the review of the simulation results produced by the CSE function, the user can invoke the summary function. This function displays the data frame containing the simulation results in the R console:

If the user is interested in generate graphical summaries of the simulation results (using the the ggplot2 package or other packages), this data frame can also be saved to an object:

## General a Simulation Report

Presentation model.

A very useful feature of the Mediana package is generation of a Microsoft Word-based report to provide a summary of Clinical Scenario Evaluation Report.

To generate a simulation report, the user needs to define a presentation model by creating a PresentationModel object. This object must be initialized as follows:

Project information can be added to the presentation model using the Project object:

The user can easily customize the simulation report by defining report sections and specifying properties of summary tables in the report. The code shown below creates a separate section within the report for each set of outcome parameters (using the Section object) and sets the sorting option for the summary tables (using the Table object). The tables will be sorted by the sample size. Further, in order to define descriptive labels for the outcome parameter scenarios and sample size scenarios, the CustomLabel object needs to be used:

## Generate a Simulation Report

This case study will also illustrate the process of customizing a Word-based simulation report. This can be accomplished by defining custom sections and subsections to provide a structured summary of the complex set of simulation results.

## Create a Customized Simulation Report

Define a presentation model.

Several presentation models will be used produce customized simulation reports:

A report without subsections.

A report with subsections.

A report with combined sections.

First of all, a default PresentationModel object ( case.study6.presentation.model.default ) will be created. This object will include the common components of the report that are shared across the presentation models. The project information ( Project object), sorting options in summary tables ( Table object) and specification of custom labels ( CustomLabel objects) are included in this object:

## Report without subsections

The first simulation report will include a section for each outcome parameter set. To accomplish this, a Section object is added to the default PresentationModel object and the report is generated:

## Report with subsections

The second report will include a section for each outcome parameter set and, in addition, a subsection will be created for each multiplicity adjustment procedure. The Section and Subsection objects are added to the default PresentationModel object as shown below and the report is generated:

## Report with combined sections

Finally, the third report will include a section for each combination of outcome parameter set and each multiplicity adjustment procedure. This is accomplished by adding a Section object to the default PresentationModel object and specifying the outcome parameter and multiplicity adjustment in the section’s by argument.

CSE report without subsections

CSE report with subsections

CSE report with combined subsections

## R news and tutorials contributed by hundreds of R bloggers

A data science case study in r.

Posted on March 13, 2017 by Robert Grünwald in R bloggers | 0 Comments

[social4i size="small" align="align-left"] --> [This article was first published on R-Programming – Statistik Service , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here ) Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Demanding data science projects are becoming more and more relevant, and the conventional evaluation procedures are often no longer sufficient. For this reason, there is a growing need for tailor-made solutions, which are individually tailored to the project’s goal, which is often implemented by R programming . T o provide our readers with support in their own R programming, we have carried out an example evaluation which demonstrates several application possibilities of the R programming.

## Data Science Projects

Approaching your first data science project can be a daunting task. Luckily, there are rough step-by-step outlines and heuristics than can help you on your way to becoming a data ninja. In this article, we review some of these guidelines and apply them to an example project in R.

For our analysis and the R programming, we will make use of the following R packages:

## Anatomy of a Data Science project

A basic data science project consists of the following six steps:

- State the problem you are trying to solve. It has to be an unambiguous question that can be answered with data and a statistical or machine learning model. At least, specify: What is being observed? What has to be predicted?
- Collect the data, then clean and prepare it. This is commonly the most time-consuming task, but it has to be done in order to fit a prediction model with the data.
- Explore the data. Get to know its properties and quirks. Check numerical summaries of your metric variables, tables of the categorical data, and plot univariate and multivariate representations of your variables. By this, you also get an overview of the quality of the data and can find outliers.
- Check if any variables may need to be transformed. Most commonly, this is a logarithmic transformation of skewed measurements such as concentrations or times. Also, some variables might have to be split up into two or more variables.
- Choose a model and train it on the data. If you have more than one candidate model, apply each and evaluate their goodness-of-fit using independent data that was not used for training the model.
- Use the best model to make your final predictions.

We apply the principles on an example data set that was used in the ASA’s 2009 Data expo . The given data are around 120 million commercial and domestic flights within the USA between 1987 and 2008. Measured variables include departure and arrival airport, airline, and scheduled and actual departure and arrival times.

We will focus on the 2008 subset of this data. Because even this is a 600MB subset, it makes sense to start a first analysis on a random sample of this data to be able to quickly explore and develop your code, and then, periodically verify on the real data set that your results are still valid on the complete data.

The following commands read in our subset data and display the first three observations:

Fortunately, the ASA provides a code book with descriptions of each variable here . For example, we now know that for the Variable DayOfWeek, a 1 denotes Monday, a 2 is Tuesday, and so on.

## The problem

With this data, it is possible to answer many interesting questions. Examples include:

## Do planes with a delayed departure fly with a faster average speed to make up for the delay?

How does the delay of arriving flights vary during the day are planes more delayed on weekends.

- How has the market share of different airlines shifted over these 20 years?
- Are there specific planes that tend to have longer delays? What characterizes them? Maybe the age, or the manufacturer?

Additionally to these concrete questions, the possibilities for explorative, sandbox-style data analysis are nearly limitless.

Here, we will focus on the first two boldened questions.

## Data cleaning

You should always check out the amount of missing values in your data. For this, we write an sapply-loop over each column in the flights data and report the percentage of missing values:

We see that most variables have at most a negligible amount of missing values. However, the last five variables, starting at the CarrierDelay, show almost 80% missing values. This is usually an alarmingly high amount of missing data that would suggest dropping this variable from the analysis altogether, since not even a sophisticated imputing procedure can help here. But, as further inspection shows, these variables only apply for delayed flights, i.e. a positive value in the ArrDelay column.

When selecting only the arrival delay and the five sub-categories of delays, we see that they add up to the total arrival delay. For our analysis here, we are not interested in the delay reason, but view only the total ArrDelay as our outcome of interest.

The pipe operator %>%, by the way, is a nice feature of the magrittr package (also implemented in dplyr) that resembles the UNIX-style pipe. The following two lines mean and do exactly the same thing, but the second version is much easier to read:

The pipe operator thus takes the output of the left expression, and makes it the first argument of the right expression.

We have surprisingly clean data where not much has to be done before proceeding to feature engineering.

## Explorative analyses

Our main variables of interest are:

- The date, which conveniently is already split up in the columns Year, Month, and DayOfMonth, and even contains the weekday in DayOfWeek. This is rarely the case, you mostly get a single column with a name like date and entries such as „2016-06-24“. In that case, the R package lubridate provides helpful functions to efficiently work with and manipulate these dates.
- CRSDepTime, the scheduled departure time. This will indicate the time of day for our analysis of when flights tend to have higher delays.
- ArrDelay, the delay in minutes at arrival. We use this variable (rather than the delay at departure) for the outcome in our first analysis, since the arrival delay is what has the impact on our day.
- For our second question of whether planes with delayed departure fly faster, we need DepDelay, the delay in minutes at departure, as well as a measure of average speed while flying. This variable is not available, but we can compute it from the available variables Distance and AirTime. We will do that in the next section, „Feature Engineering“.

Let’s have an exploratory look at all our variables of interest.

## Flight date

Since these are exploratory analyses that you usually won’t show anyone else, spending time on pretty graphics does not make sense here. For quick overviews, I mostly use the standard graphics functions from R, without much decoration in terms of titles, colors, and such.

Since we subsetted the data beforehand, it makes sense that all our flights are from 2008. We also see no big changes between the months. There is a slight drop after August, but the remaining changes can be explained by the number of days in a month.

The day of the month shows no influence on the amount of flights, as expected. The fact that the 31st has around half the flights of all other days is also obvious.

When plotting flights per weekday, however, we see that Saturday is the most quiet day of the week, with Sunday being the second most relaxed day. Between the remaining weekdays, there is little variation.

## Departure Time

A histogram of the departure time shows that the number of flights is relatively constant from 6am to around 8pm and dropping off heavily before and after that.

## Arrival and departure delay

Both arrival and departure delay show a very asymmetric, right-skewed distribution. We should keep this in mind and think about a logarithmic transformation or some other method of acknowledging this fact later.

The structure of the third plot of departure vs. arrival delay suggests that flights that start with a delay usually don’t compensate that delay during the flight. The arrival delay is almost always at least as large as the departure delay.

To get a first overview for our question of how the departure time influences the average delay, we can also plot the departure time against the arrival delay:

Aha! Something looks weird here. There seem to be periods of times with no flights at all. To see what is going on here, look at how the departure time is coded in the data:

A departure of 2:55pm is written as an integer 1455. This explains why the values from 1460 to 1499 are impossible. In the feature engineering step, we will have to recode this variable in a meaningful way to be able to model it correctly.

## Distance and AirTime

Plotting the distance against the time needed, we see a linear relationship as expected, with one large outlier. This one point denotes a flight of 2762 miles and an air time of 823 minutes, suggesting an average speed of 201mph. I doubt planes can fly at this speed, so we should maybe remove this observation.

## Feature Engineering

Feature engineering describes the manipulation of your data set to create variables that a learning algorithm can work with. Often, this consists of transforming a variable (through e.g. a logarithm), or extracting specific information from a variable (e.g. the year from a date string), or converting something like the ZIP code to a

For our data, we have the following tasks:

- Convert the weekday into a factor variable so it doesn’t get interpreted linearly.
- Create a log-transformed version of the arrival and departure delay.
- Transform the departure time so that it can be used in a model.
- Create the average speed from the distance and air time variables.

Converting the weekday into a factor is important because otherwise, it would be interpreted as a metric variable, which would result in a linear effect. We want the weekdays to be categories, however, and so we create a factor with nice labels:

## log-transform delay times

When looking at the delays, we note that there are a lot of negative values in the data. These denote flights that left or arrived earlier than scheduled. To allow a log-transformation, we set all negative values to zero, which we interpret as „on time“:

Now, since there are zeros in these variables, we create the variables log(1+ArrDelay) and log(1+DepDelay):

## Transform the departure time

The departure time is coded in the format hhmm, which is not helpful for modelling, since we need equal distances between equal durations of time. This way, the distance between 10:10pm and 10:20pm would be 10, but the distance between 10:50pm and 11:00pm, the same 10 minutes, would be 50.

For the departure time, we therefore need to convert the time format. We will use a decimal format, so that 11:00am becomes 11, 11:15am becomes 11.25, and 11:45 becomes 11.75.

The mathematical rule to transform the „old“ time in hhmm-format into a decimal format is:

Here, the first part of the sum generates the hours, and the second part takes the remainder when dividing by 100 (i.e., the last two digits), and divides them by 60 to transform the minutes into a fraction of one hour.

Let’s implement that in R:

Of course, you should always verify that your code did what you intended by checking the results.

## Create average speed

The average flight speed is not available in the data – we have to compute it from the distance and the air time:

We have a few outliers with very high, implausible average speeds. Domain knowledge or a quick Google search can tell us that speeds of more than 800mph are not maintainable with current passenger planes. Thus, we will remove these flights from the data:

## Choosing an appropriate Method

For building an actual model with your data, you have the choice between two worlds, statistical modelling and machine learning.

Broadly speaking, statistical models focus more on quantifying and interpreting the relationships between input variables and the outcome. This is the case in situations such as clinical studies, where the main goal is to describe the effect of a new medication.

Machine learning methods on the other hand focus on achieving a high accuracy in prediction, sometimes sacrificing interpretability. This results in what is called „black box“ algorithms, which are good at predicting the outcome, but it’s hard to see how a model computes a prediction for the outcome. A classic example for a question where machine learning is the appropriate answer is the product recommendation algorithm on online shopping websites.

For our questions, we are interested in the effects of certain input variables (speed and time of day / week). Thus we will make use of statistical models, namely a linear model and a generalized additive model.

To answer our first question, we first plot the variables of interest to get a first impression of the relationship. Since these plots will likely make it to the final report or at least a presentation to your supervisors, it now makes sense to spend a little time on generating a pretty image. We will use the ggplot2 package to do this:

It seems like there is a slight increase in average speed for planes that leave with a larger delay. Let’s fit a linear model to quantify the effect:

There is a highly significant effect of 0.034 for the departure delay. This represents the increase in average speed for each minute of delay. So, a plane with 60 minutes of delay will fly 2.04mph faster on average.

Even though the effect is highly significant with a p value of less than 0.0001, its actual effect is negligibly small.

For the second question of interest, we need a slightly more sophisticated model. Since we want to know the effect of the time of day on the arrival delay, we cannot assume a linear effect of the time on the delay. Let’s plot the data:

We plot both the actual delay and our transformed log-delay variable. The smoothing line of the second plot gives a better image of the shape of the delay. It seems that delays are highest at around 8pm, and lowest at 5am. This emphasizes the fact that a linear model would not be appropriate here.

We fit a generalized additive model , a GAM, to this data. Since the response variable is right skewed, a Gamma distribution seems appropriate for the model family. To be able to use it, we have to transform the delay into a strictly positive variable, so we compute the maximum of 1 and the arrival delay for each observation first.

We again see the trend of lower delays in the morning before 6am, and high delays around 8pm. To differentiate between weekdays, we now include this variable in the model:

With this model, we can create an artificial data frame x_new, which we use to plot one prediction line per weekday:

We now see several things:

- The nonlinear trend over the day is the same shape on every day of the week
- Fridays are the worst days to fly by far, with Sunday being a close second. Expected delays are around 20 minutes during rush-hour (8pm)
- Wednesdays and Saturdays are the quietest days
- If you can manage it, fly on a Wednesday morning to minimize expected delays.

## Closing remarks

As noted in the beginning of this post, this analysis is only one of many questions that can be tackled with this enormous data set. Feel free to browse the data expo website and especially the „Posters & results“ section for many other interesting analyses.

Der Beitrag A Data Science Case Study in R erschien zuerst auf Statistik Service .

To leave a comment for the author, please follow the link and comment on their blog: R-Programming – Statistik Service . R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job . Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2022 | MH Corporate basic by MH Themes

## Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

- © 2018

## Business Case Analysis with R

Simulation Tutorials to Support Complex Business Decisions

- Robert D. Brown III 0

## Cumming, USA

You can also search for this author in PubMed Google Scholar

Teaches how to use the R programming language for business case analysis

Extends the analytic tool kit of financial analysts

Establishes both a quality standard and a clearly auditable structure for ensuring that avoidable modeling errors are minimized

40k Accesses

1 Citations

2 Altmetric

- Table of contents

## About this book

Authors and affiliations, about the author, bibliographic information.

- Publish with us

## Buying options

- Available as EPUB and PDF
- Read on any device
- Instant download
- Own it forever
- Compact, lightweight edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info

Tax calculation will be finalised at checkout

## Other ways to access

This is a preview of subscription content, log in via an institution to check for access.

## Table of contents (15 chapters)

Front matter, a relief from spreadsheet misery.

Robert D. Brown III

## Setting Up the Analysis

Include uncertainty in the financial analysis, interpreting and communicating insights, it’s your move, “what should i do”, use a decision hierarchy to categorize decision types, tame decision complexity by creating a strategy table, clearly communicate the intentions of decision strategies, what comes next, subject matter expert elicitation guide, “what’s your number, pardner”, conducting sme elicitations, kinds of biases, information espresso, setting a budget for making decisions clearly, a more refined explanation of voi, building the simulation in r.

This tutorial teaches you how to use the statistical programming language R to develop a business case simulation and analysis. It presents a methodology for conducting business case analysis that minimizes decision delay by focusing stakeholders on what matters most and suggests pathways for minimizing the risk in strategic and capital allocation decisions. Business case analysis, often conducted in spreadsheets, exposes decision makers to additional risks that arise just from the use of the spreadsheet environment.

R has become one of the most widely used tools for reproducible quantitative analysis, and analysts fluent in this language are in high demand. The R language, traditionally used for statistical analysis, provides a more explicit, flexible, and extensible environment than spreadsheets for conducting business case analysis.

What You’ll Learn

- Set up a business case abstraction in an influence diagram to communicate the essence of the problem to other stakeholders
- Model the inherent uncertainties in the problem with Monte Carlo simulation using the R language
- Communicate the results graphically
- Draw appropriate insights from the results
- Develop creative decision strategies for thorough opportunity cost analysis
- Calculate the value of information on critical uncertainties between competing decision strategies to set the budget for deeper data analysis
- Construct appropriate information to satisfy the parameters for the Monte Carlo simulation when little or no empirical data are available

Who This Book Is For

Financial analysts, data practitioners, and risk/business professionals; also appropriate for graduate level finance, business, or data science students

- Business case analysis
- Risk analysis
- Decision analysis
- Engineering economics
- Quantitative analysis
- R langauage
- Business case simulation
- Monte Carlo simulation

Book Title : Business Case Analysis with R

Book Subtitle : Simulation Tutorials to Support Complex Business Decisions

Authors : Robert D. Brown III

DOI : https://doi.org/10.1007/978-1-4842-3495-2

Publisher : Apress Berkeley, CA

eBook Packages : Professional and Applied Computing , Professional and Applied Computing (R0) , Apress Access Books

Copyright Information : Robert D. Brown III 2018

Softcover ISBN : 978-1-4842-3494-5 Published: 03 March 2018

eBook ISBN : 978-1-4842-3495-2 Published: 01 March 2018

Edition Number : 1

Number of Pages : XVIII, 282

Number of Illustrations : 70 b/w illustrations

Topics : Big Data , Big Data/Analytics

Policies and ethics

- Find a journal
- Track your research

## R case studies

Practice on your own, or with others

## Our case studies collection

Here we have collected R case studies and walk-throughs so that you can practice and expand your R skills. You can access them by simply creating a free Applied Epi account and clicking the appropriate link below. You will need to have R installed to follow the case studies.

Target audience

Public health practitioners, epidemiologists, clinicians, and researchers who already have a basic competency in R who want additional practice or exposure to new uses of R in public health. All of our training materials focus on challenges and solutions for frontline practitioners and are built or curated by our team with extensive ground-level experience. Read more about our educational approach

These case studies are all either built by our team, or are open-access tutorials that have been translated by our team into R from another language (e.g. from Stata or SAS). Individual credits are provided in the case studies.

Our partners

To curate this collection of case studies, we hav partnered with the EPIET Alumni Network, Médecins Sans Frontières (MSF) / Doctors without Borders and TEPHINET.

https://github.com/appliedepi/emory_training

## COVID-19 - Fulton County, Georgia, USA

This case study walks the reader through analyzing COVID-19 data from Fulton County (near Atlanta, Georgia, USA). The result is an R Markdown report with data cleaning and analyses of demographics, temporal trends, spatial (GIS) mapping, etc.

Click here to access the website to download the RStudio project, data files, and to access the accompanying slides and instructions.

Please note that all these training materials use fake example data in which no person is identifiable and the actual values have been scrambled/jittered.

Please note that all these training materials use fake example data in which no person is identifiable and the actual values have been scrambled.

## Foodborne outbreak investigation - Stengen, Germany

This case study is coming soon

## GIS mapping case study - Am Timan, Chad

Time series case study - scotland, uk.

## Getting Started with the R Programming Language

Authors: Leah A. Wasser - Adapted from Software Carpentry

Last Updated: Apr 8, 2021

R is a versatile, open source programming language that was specifically designed for data analysis. R is extremely useful for data management, statistics and analyzing data.

This tutorial should be seem more as a reference on the basics of R and not a tutorial for learning to use R. Here we define many of the basics, however, this can get overwhelming if you are brand new to R.

## Learning Objectives

After completing this tutorial, you will be able to:

- Use basic R syntax
- Explain the concepts of objects and assignment
- Explain the concepts of vector and data types
- Describe why you would or would not use factors
- Use basic few functions

## Things You’ll Need To Complete This Tutorial

You will need the most current version of R and, preferably, RStudio loaded on your computer to complete this tutorial.

Set Working Directory: This lesson assumes that you have set your working directory to the location of the downloaded and unzipped data subsets.

An overview of setting the working directory in R can be found here.

R Script & Challenge Code: NEON data lessons often contain challenges that reinforce learned skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

## The Very Basics of R

- Open source software under a GNU General Public License (GPL) .
- A good alternative to commercial analysis tools. R has over 5,000 user contributed packages (as of 2014) and is widely used both in academia and industry.
- Available on all platforms.
- Not just for statistics, but also general purpose programming.
- Supported by a large and growing community of peers.

## Introduction to R

You can use R alone or with a user interace like RStudio to write your code. Some people prefer RStudio as it provides a graphic interface where you can see what objects have been created and you can also set variables like your working directory, using menu options.

Learn more about RStudio with their online learning materials .

We want to use R to create code and a workflow is more reproducible. We can document everything that we do. Our end goal is not just to "do stuff" but to do it in a way that anyone can easily and exactly replicate our workflow and results -- this includes ourselves in 3 months when the paper reviews come back!

## Code & Comments in R

Everything you type into an R script is code, unless you demark it otherwise.

Anything to the right of a # is ignored by R. Use these comments within the code to describe what it is that you code is doing. Comment liberally in your R scripts. This will help you when you return to it and will also help others understand your scripts and analyses.

## Basic Operations in R

Let's take a few moments to play with R. You can get output from R simply by typing in math

or by typing words, with the command writeLines() . Words that you want to be recognized as text (as opposed to a field name or other text that signifies an object) must be enclosed within quotes.

We can assign our results to an object and name the object. Objects names cannot contain spaces.

We can then return the value of an object we created.

Or create a new object with existing ones.

The result of the operation on the right hand side of <- is assigned to an object with the name specified on the left hand side of <- . The result could be any type of R object, including your own functions (see the Build & Work With Functions in R tutorial ).

## Assignment Operator: Drop the Equals Sign

The assignment operator is <- . It assigns values on the right to objects on the left. It is similar to = but there are some subtle differences. Learn to use <- as it is good programming practice. Using = in place of <- can lead to issues down the line.

## List All Objects in the Environment

Some functions are the same as in other languages. These might be familiar from command line.

- ls() : to list objects in your current environment.
- rm() : remove objects from your current environment.

Now try them in the console.

Using rm(list=ls()) , you combine several functions to remove all objects. If you typed x on the console now you will get Error: object 'x' not found' .

## Data Types and Structures

To make the best of the R language, you'll need a strong understanding of the basic data types and data structures and how to operate on those. These are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversions is one of the most common sources of frustration for beginners.

First, everything in R is an object. But there are different types of objects. One of the basic differences in in the data structures which are different ways data are stored.

R has many different data structures . These include

- atomic vector

These data structures vary by the dimensionality of the data and if they can handle data elements of a simgle type ( homogeneous ) or multiple types ( heterogeneous ).

A vector is the most common and basic data structure in R and is the workhorse of R. Technically, vectors can be one of two types:

- atomic vectors

although the term "vector" most commonly refers to the atomic types not to lists.

## Atomic Vectors

R has 6 atomic vector types.

- numeric (real or decimal)
- raw (not discussed in this tutorial)

By atomic , we mean the vector only holds data of a single type.

- character : "a" , "swc"
- numeric : 2 , 15.5
- integer : 2L (the L tells R to store this as an integer)
- logical : TRUE , FALSE
- complex : 1+4i (complex numbers with real and imaginary parts)

R provides many functions to examine features of vectors and other objects, for example

- typeof() - what is it?
- length() - how long is it? What about two dimensional objects?
- attributes() - does it have any metadata?

Let's look at some examples:

A vector is a collection of elements that are most commonly character , logical , integer or numeric .

You can create an empty vector with vector() . (By default the mode is logical . You can be more explicit as shown in the examples below.) It is more common to use direct constructors such as character() , numeric() , etc.

x is a numeric vector. These are the most common kind. They are numeric objects and are treated as double precision real numbers (they can store decimal points). To explicitly create integers (no decimal points), add an L to each (or coerce to the integer type using as.integer() .

You can also have logical vectors.

Finally, you can have character vectors.

You can also add to a list or vector

More examples of how to create vectors

- x <- c(0.5, 0.7)
- x <- c(TRUE, FALSE)
- x <- c("a", "b", "c", "d", "e")
- x <- 9:100
- x <- c(1 + (0 + 0i), 2 + (0 + 4i))

You can also create vectors as a sequence of numbers.

You can also get non-numeric outputs.

- Inf is infinity. You can have either positive or negative infinity.
- NaN means Not a Number. It's an undefined value.

Try it out in the console.

Vectors have positions, these positions are ordered and can be called using object[index]

Objects can have attributes . Attribues are part of the object. These include:

- names : the field or variable name within the object
- attributes : this contain metadata

You can also glean other attribute-like information such as length() (works on vectors and lists) or number of characters nchar() (for character strings).

## Heterogeneous Data - Mixing Types?

When you mix types, R will create a resulting vector that is the least common denominator. The coercion will move towards the one that's easiest to coerce to.

Guess what the following do:

- m <- c(1.7, "a")
- n <- c(TRUE, 2)
- o <- c("a", TRUE)

Were you correct?

This is called implicit coercion. You can also coerce vectors explicitly using the as.<class_name> .

In R, matrices are an extension of the numeric or character vectors. They are not a separate type of object but simply an atomic vector with dimensions; the number of rows and columns.

Matrices in R are by default filled column-wise. You can also use the byrow argument to specify how the matrix is filled.

dim() takes a vector and transform into a matrix with 2 rows and 5 columns. Another way to shape your matrix is to bind columns cbind() or rows rbind() .

## Matrix Indexing

We can call elements of a matrix with square brackets just like a vector, except now we must specify a row and a column.

In R, lists act as containers. Unlike atomic vectors, the contents of a list are not restricted to a single mode and can encompass any mixture of data types. Lists are sometimes called generic vectors, because the elements of a list can by of any type of R object, even lists containing further lists. This property makes them fundamentally different from atomic vectors.

A list is different from an atomic vector because each element can be a different type -- it can contain heterogeneous data types.

Create lists using list() or coerce other objects using as.list() . An empty list of the required length can be created using vector()

- What is the class of x[1] ?
- What about x[[1]] ?

Try it out.

We can also give the elements of our list names, then call those elements with the $ operator.

- What is the length of this object? What about its structure?
- Lists can be extremely useful inside functions. You can “staple” together lots of different kinds of results into a single object that a function can return.
- A list does not print to the console like a vector. Instead, each element of the list starts on a new line.
- Elements are indexed by double brackets. Single brackets will still return a(nother) list.

Factors are special vectors that represent categorical data. Factors can be ordered or unordered and are important for modelling functions such as lm() and glm() and also in plot() methods. Once created, factors can only contain a pre-defined set values, known as levels .

Factors are stored as integers that have labels associated the unique integers. While factors look (and often behave) like character vectors, they are actually integers under the hood. You need to be careful when treating them like strings. Some string methods will coerce factors to strings, while others will throw an error.

- Sometimes factors can be left unordered. Example: male, female.
- Other times you might want factors to be ordered (or ranked). Example: low, medium, high.
- Underlying it's represented by numbers 1, 2, 3.
- They are better than using simple integer labels because factors are what are called self describing. male and female is more descriptive than 1s and 2s. Helpful when there is no additional metadata.

Which is male? 1 or 2? You wouldn't be able to tell with just integer data. Factors have this information built in.

Factors can be created with factor() . Input is often a character vector.

table(x) will return a frequency table counting the number of elements in each level.

If you need to convert a factor to a character vector, simply use

To see the integer version of the factor levels, use as.numeric

To convert a factor to a numeric vector, go via a character. Compare

In modeling functions, it is important to know what the 'baseline' level is. This is the first factor, but by default the ordering is determined by alphanumerical order of elements. You can change this by speciying the levels (another option is to use the function relevel() ).

## Data Frames

A data frame is a very important data type in R. It's pretty much the de facto data structure for most tabular data and what we use for statistics.

- A data frame is a special type of list where every element of the list has same length.
- Data frames can have additional attributes such as rownames() , which can be useful for annotating data, like subject_id or sample_id . But most of the time they are not used.

Some additional information on data frames:

- Usually created by read.csv() and read.table() .
- Can convert to matrix with data.matrix() (preferred) or as.matrix()
- Coercion will be forced and not always what you expect.
- Can also create with data.frame() function.
- Find the number of rows and columns with nrow(dat) and ncol(dat) , respectively.
- Rownames are usually 1, 2, ..., n.

## Manually Create Data Frames

You can manually create a data frame using data.frame .

## Useful Data Frame Functions

- head() - shown first 6 rows
- tail() - show last 6 rows
- dim() - returns the dimensions
- nrow() - number of rows
- ncol() - number of columns
- str() - structure of each column
- names() - shows the names attribute for a data frame, which gives the column names.

See that it is actually a special type of list:

Instead of a list of single items, a data frame is a list of vectors!

A recap of the different data types

A function is an R object that takes inputs to perform a task. Functions take in information and may return desired outputs.

output <- name_of_function(inputs)

All functions come with a help screen. It is critical that you learn to read the help screens since they provide important information on what the function does, how it works, and usually sample examples at the very bottom. You can use help(function) or more simply ??function

You can't ever learn all of R as it is ever changing with new packages and new tools, but once you have the basics and know how to find help to do the things that you want to do, you'll be able to use R in your science.

## Sample Data

R comes with sample datasets. You will often find these as the date sets in documentation files or responses to inquires on public forums like StackOverflow . To see all available sample datasets you can type in data() to the console.

## Packages in R

R comes with a set of functions or commands that perform particular sets of calculations. For example, in the equation 1+2 , R knows that the "+" means to add the two numbers, 1 and 2 together. However, you can expand the capability of R by installing packages that contain suites of functions and compiled code that you can also use in your code.

## Get Lesson Code

If you have questions or comments on this content, please contact us.

Statistics Made Easy

## How to Write a Case Statement in R (With Example)

A case statement is a type of statement that goes through conditions and returns a value when the first condition is met.

The easiest way to implement a case statement in R is by using the case_when() function from the dplyr package:

This particular function looks at the value in the column called col1 and returns:

- “ value1 ” if the value in col1 is less than 9
- “ value2 ” if the value in col1 is less than 12
- “ value3 ” if the value in col2 is less than 15
- “ value4 ” if none of the previous conditions are true

Note that TRUE is equivalent to an “else” statement.

The following example shows how to use this function in practice.

## Example: Case Statement in R

Suppose we have the following data frame in R:

We can use the following syntax to write a case statement that creates a new column called class whose values are determined by the values in the points column:

The case statement looked at the value in the points column and returned:

- “ Bad ” if the value in the points column was less than 9
- “ OK ” if the value in the points column was less than 12
- “ Good ” if the value in the points column was less than 15
- “ Great ” if none of the previous conditions are true

The new column is called class , since this is the name we specified in the mutate() function.

## Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Use If Statement with Multiple Conditions in R How to Write a Nested If Else Statement in R How to Write Your First tryCatch() Function in R

## Published by Zach

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

## How to Write a Case Statement in R (With Example)

A case statement is a type of statement that goes through conditions and returns a value when the first condition is met.

The easiest way to implement a case statement in R is by using the case_when() function from the dplyr package:

This particular function looks at the value in the column called col1 and returns:

- “ value1 ” if the value in col1 is less than 9
- “ value2 ” if the value in col1 is less than 12
- “ value3 ” if the value in col2 is less than 15
- “ value4 ” if none of the previous conditions are true

Note that TRUE is equivalent to an “else” statement.

The following example shows how to use this function in practice.

## Example: Case Statement in R

Suppose we have the following data frame in R:

We can use the following syntax to write a case statement that creates a new column called class whose values are determined by the values in the points column:

The case statement looked at the value in the points column and returned:

- “ Bad ” if the value in the points column was less than 9
- “ OK ” if the value in the points column was less than 12
- “ Good ” if the value in the points column was less than 15
- “ Great ” if none of the previous conditions are true

The new column is called class , since this is the name we specified in the mutate() function.

## Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Use If Statement with Multiple Conditions in R How to Write a Nested If Else Statement in R How to Write Your First tryCatch() Function in R

## How to Change Colors of Bars in Stacked Bart Chart in ggplot2

The difference between cat() and paste() in r, related posts, how to create a stem-and-leaf plot in spss, how to create a correlation matrix in spss, how to convert date of birth to age..., excel: how to highlight entire row based on..., how to add target line to graph in..., excel: how to use if function with negative..., excel: how to use if function with text..., excel: how to use greater than or equal..., excel: how to use if function with multiple..., how to extract number from string in pandas.

## IMAGES

## VIDEO

## COMMENTS

R's Bioconductor package provides facilities for analyzing genomic data. Lastly, R is a god-send for pre-clinical trials of all new drugs and medical techniques. R Use Case in Healthcare. Merck: Merck & co. use the R programming language for clinical trials and drug testing.

Case Study Project. "Case Study Project" is an initiative that looks to promote and facilitate the usage of R programming language for data manipulation, data analysis, programming, and computational aspects of statistical topics in authentic research applications. Specifically, it looks to create an extensive resource database of different ...

Using R for Data Analysis and Graphics is a comprehensive guide to the R language and its applications in various fields. The pdf covers topics such as data manipulation, graphics, statistical methods, simulation, and programming. It also provides examples and exercises to help readers learn and practice R skills.

R programming language overview. R is a free, open-source programming language, meaning anyone can use, modify, and distribute it. It was initially written by Ross Ihaka and Robert Gentleman (also known as R&R) of the University of Auckland's Statistics Department. However, the statistical programming language we know today as R is a ...

Case study 3. This case study introduces a clinical trial with several patient populations (marker-positive and marker-negative patients). It demonstrates how the user can define independent samples in a data model and then specify statistical tests in an analysis model based on merging several samples, i.e., merging samples of marker-positive and marker-negative patients to carry out a test ...

R packages. For our analysis and the R programming, we will make use of the following R packages: library (dplyr) # Easy data cleaning, and the very convenient pipe operator %>% library (ggplot2) # Beautiful plots library (mgcv) # Package to fit generalized additive models.

Case Study: Exploratory Data Analysis in R. 4.8 +. 15 reviews. Beginner. Use data manipulation and visualization skills to explore the historical voting of the United Nations General Assembly. Start Course for Free. 4 Hours 15 Videos 58 Exercises. 52,180 Learners Statement of Accomplishment.

The R language, traditionally used for statistical analysis, provides a more explicit, flexible, and extensible environment than spreadsheets for conducting business case analysis. The main tutorial follows the case in which a chemical manufacturing company considers constructing a chemical reactor and production facility to bring a new ...

Abstract. Data Mining with R: Learning with Case Studies, Second Edition uses practical examples to illustrate the power of R and data mining. Providing an extensive update to the best-selling first edition, this new edition is divided into two parts. The first part will feature introductory material, including a new chapter that provides an ...

The R programming language was designed to work with data at all stages of the data analysis process. In this part of the course, you'll examine how R can help you structure, organize, and clean your data using functions and other processes. You'll learn about data frames and how to work with them in R. You'll also revisit the issue of ...

A Case Study of R Performance Analysis and Optimization Ruizhu Huang Texas Advanced Computing Center University of Texas Austin, Texas [email protected] Weijia Xu Texas Advanced Computing Center University of Texas Austin, Texas [email protected] Silvia Liverani School of Mathematical Science Queen Mary University of London London, U.K.

Think of the projects like a series of steps — each one should set the bar a little higher, and be a little more challenging than the one before. Step 5. Ramp Up the Difficulty. Working on projects is great, but if you want to learn R then you need to ensure that you keep learning.

In this step-by-step tutorial you will: Download and install R and get the most useful package for machine learning in R. Load a dataset and understand it's structure using statistical summaries and data visualization. Create 5 machine learning models, pick the best and build confidence that the accuracy is reliable.

Course Description. In this course, you will strengthen your knowledge of time series topics through interactive exercises and interesting datasets. You'll explore a variety of datasets about Boston, including data on flights, weather, economic trends, and local sports teams. 1.

The R programming language is extremely popular among data-centric professionals, including statisticians, data miners, and data scientists. While R has become widely popular due to the open-source environment and range of tools available, RStudio is an integrated development environment that combines certain functionalities and makes it easier to navigate than the traditional R platform.

This case study walks the reader through analyzing COVID-19 data from Fulton County (near Atlanta, Georgia, USA). The result is an R Markdown report with data cleaning and analyses of demographics, temporal trends, spatial (GIS) mapping, etc. Click here to access the website to download the RStudio project, data files, and to access the ...

R is a popular programming language that allows people to adeptly handle mass amounts of data, generate publication-quality visualizations, and perform a range of statistical and analytic computing tasks. Used in fields including data science, finance, academia, and more, R is powerful, flexible, and extensible.

R objects are the foundations to understand R language as a whole. The basic ones are: Vectors; Lists; Arrays; Matrices; As you study them, you will get into contact with two important features that will define how you can interact with them, namely: uni-type vs. multi-type objects; uni-dimensional vs. multidimensional objects.

R is a versatile, open source programming language that was specifically designed for data analysis. R is extremely useful for data management, statistics and analyzing data. **Cool Fact:** R was inspired by the programming language S. R is: Open source software under a GNU General Public License (GPL). A good alternative to commercial analysis ...

This checks each value of test_score_vector to see if the value is greater than or equal to 60. If the value meets this condition, case_when returns 'Pass'. However, if a value does not match that condition, then case_when moves to the next condition. You'll see on the second line, we have the expression TRUE ~ 'Fail'.

A case statement is a type of statement that goes through conditions and returns a value when the first condition is met.. The easiest way to implement a case statement in R is by using the case_when() function from the dplyr package:. library (dplyr) df %>% mutate(new_column = case_when( col1 < 9 ~ ' value1 ', col1 < 12 ~ ' value2 ', col1 < 15 ~ ' value3 ', TRUE ~ ' Great '))

Section 1 - explains how to get R, R Studio, Understand environment and data for workout. Section 2 - explains the R syntax through examples. Section 3 - explains some other syntax needed for working. Section 4 - Practice Case Studies - apply your knowledge to solve business problems. There are 10 workouts included in the course, which will ...

A case statement is a type of statement that goes through conditions and returns a value when the first condition is met.. The easiest way to implement a case statement in R is by using the case_when() function from the dplyr package:. library (dplyr) df %>% mutate(new_column = case_when( col1 value1', col1 value2', col1 value3', TRUE ~ ' Great '))