Chapter 10 Dplyr - Data Manipulation | Tabular Data Analysis With R ...

  • Tabular data analysis with R and Tidyverse: NHANES Environmental Health Datasets
  • Preamble
    • Learning goals
    • Software used during this tutorial
  • 1 Introduction
    • 1.1 Software installation
    • 1.2 Installing R packages
    • 1.3 Datasets: NHANES
      • 1.3.1 NHANES 2015-2016
    • 1.4 Datasets: included in R
  • 2 How R works
    • 2.1 R is a software
    • 2.2 R is a language
      • 2.2.1 Classic R vs Tidyverse
    • 2.3 Working with R: objects and workspace
  • 3 Getting started
    • 3.1 Launch RStudio
    • 3.2 Organize with an RStudio project
    • 3.3 Creating an R script
      • 3.3.1 Script Editor
      • 3.3.2 Comments
      • 3.3.3 Executing commands
    • 3.4 Working directory
  • 4 Working with R
    • 4.1 Creating R objects
    • 4.2 Functions and their arguments
    • 4.3 Built-in functions
      • 4.3.1 list: ls()
      • 4.3.2 class()
      • 4.3.3 combine: c()
      • 4.3.4 length()
      • 4.3.5 Working directory: getwd() and setwd()
    • 4.4 Getting help
    • 4.5 Vectorisation
    • 4.6 More complex data
      • 4.6.1 Vectors
      • 4.6.2 Matrix
      • 4.6.3 Combining vectors to create a matrix
    • 4.7 Dataframes
      • 4.7.1 Dataframe manipulation
    • 4.8 Generating data
      • 4.8.1 Regular sequences
      • 4.8.2 Repeat and sequence functions:
      • 4.8.3 Levels: gl() and expand.grid()
      • 4.8.4 Random numbers
    • 4.9 Conditional statements
      • 4.9.1 Function ifelse()
    • 4.10 Simple graphics with plot()
  • 5 Working with tabular data in R
    • 5.1 Airquality dataset
    • 5.2 Exploring airquality
    • 5.3 Subsetting
    • 5.4 Base R Graphics exploration
    • 5.5 Boxplots
    • 5.6 Scatter plots
    • 5.7 Simple linear regression
    • 5.8 Fancier Graphics exploration
      • 5.8.1 Boxplots
      • 5.8.2 Scatter plots
  • 6 Importing data
    • 6.1 Importing from local files
    • 6.2 Downloading Nhanes data
      • 6.2.1 PFAS_I
    • 6.3 Exploring PFAS_I data
      • 6.3.1 PFAS_I boxplot
      • 6.3.2 PFAS_I histogram
      • 6.3.3 Fancier boxplot with qplot
    • 6.4 Merging data files
      • 6.4.1 Merge() function
      • 6.4.2 Merging demographics data
  • 7 Creatinine adjustment
    • 7.1 Creatinine data
      • 7.1.1 Downloading, merging PFAS and creatinine
    • 7.2 Analyte measurement units
    • 7.3 Reduced set
    • 7.4 Computing Analyte / Creatinine ratio
    • 7.5 Exposure - Outcome
      • 7.5.1 Illusions
      • 7.5.2 qplot version
    • 7.6 Creating a master data file
  • 8 Tidyverse: another R Universe
    • 8.1 Magrittr - pipe and pipelines
    • 8.2 Tibble
    • 8.3 dplyr - overview
      • 8.3.1 Demo 1: all together pipeline
  • 9 Intermission: data wrangling
    • 9.1 Part 3 here
  • 10 dplyr - data manipulation
    • 10.1 selecting columns
    • 10.2 Filtering rows
    • 10.3 Arrange data
    • 10.4 mutating data
      • 10.4.1 mutate with conditional statement
    • 10.5 Summarising and grouping data
    • 10.6 Recoding: string replacement
    • 10.7 Getting it all together
      • 10.7.1 Example 1: by gender
      • 10.7.2 Example 2: by gender and age
      • 10.7.3 Base R Bar plot
      • 10.7.4 ggplot2 versions
  • 11 ggplot2
    • 11.1 Tutorials
    • 11.2 ggplot2 using dplyr chapter results
      • 11.2.1 Barplot with qplot / ggplot
      • 11.2.2 Error bars and meanTChol
  • 12 Using NHANES weights
    • 12.1 Header comments and packages
    • 12.2 Acquiring NHANES data
    • 12.3 Data wrangling: renaming and selecting data
      • 12.3.1 Renaming columns
      • 12.3.2 Selecting columns
      • 12.3.3 Changing variable status to a factor
      • 12.3.4 Adding the weight information
      • 12.3.5 Statistics
  • 13 Markdown and Reproducible research
    • 13.1 Markdown
      • 13.1.1 Markdown syntax
    • 13.2 R markdown magic
      • 13.2.1 Before your start
      • 13.2.2 How to create an R markdown file
      • 13.2.3 Adding R code
      • 13.2.4 Very tiny Rmd file: Inline code
    • 13.3 Other formats
    • 13.4 A word on YAML
      • 13.4.1 Limits
      • 13.4.2 Indentation and White space
      • 13.4.3 Automatic modifications
      • 13.4.4 Quotes
      • 13.4.5 Date
      • 13.4.6 YAML resources
  • 14 Report-template
    • 14.1 Overall template format
    • 14.2 YAML example
    • 14.3 General chunk options
    • 14.4 Preamble, Preface and Introduction
    • 14.5 Activating packages
    • 14.6 Live web links
    • 14.7 Embedding graphs
    • 14.8 Inline code
    • 14.9 Math formula
    • 14.10 Addendum
  • 15 Report resources
    • 15.1 Illustrations
      • 15.1.1 Adding and sizing images
    • 15.2 Markdown tables
  • Appendix
  • A The story of R
  • B Simple math
    • B.1 Arithmetic operators
    • B.2 Boolean values
    • B.3 Rational operators
    • B.4 Logical operators
  • C Import NHANES sample code
  • D Merge Downloads into a Master file
    • D.1 Download into R object with NHANES code
    • D.2 Combine files into Master
    • D.3 Save/Write Master file to disk
    • D.4 Alternate download to R object with haven
    • D.5 Download, save XPT files to hard drive
  • E PFAS_I codes
  • F Perfluoroalkyl and polyfluoroalkyl
  • G ggplot2 tutorials online
  • H Rmarkdown resources
  • I The Story of Vector V: an R markdown example
  • About the authors
  • Acknowledgments
    • I.1 R packages used for the book
    • I.2 Extra Icons used:
      • I.2.1 Exercise / Homework
      • I.2.2 Study at home:
  • References
  • Published with bookdown

Tabular data analysis with R and Tidyverse: Environmental Health

Chapter 10 dplyr - data manipulation

While base R has many tools that can do the job, dplyr and other Tidyverse packages can easily work together and allow the easy creation of pipelines to accomplish a task as was demonstration in the previous chapter 8.3.1.

In this chapter we’ll explore a few dplyr verbs

  • select to choose columns
  • filter to check data on rows
  • arrange to order data
  • mutate to compute new values
  • summarise to create condensed data
  • group_by to select specific data

In addition we’ll learn about some conditional selection within the functions described by these verbs.

We’ll use the Master4 large data file that was created previously. (See all download and merge code in Appendix D if you need to recreated it.) Master4 is a merge, in order of the following datasets (links are to the documentation):

DEMO_I, BMI_I, PFAS_I, TCHOL_I, ALB_CR_I

10.1 selecting columns

To limit the output length most examples will be piped in the head() function to only print the column headers followed by 6 data lines. %>% head() may terminate the command but the pipeline can be extended further.

The selection of columns is easy and just requires the name of the data frame containing the data and specifying the column names that we want to keep for further use in an analysis. The total length (number of rows) would be printed, here limited by piping into head().

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% head() SEQN LBXMFOS URXUCR BMXBMI 1 83732 NA 41 27.8 2 83733 NA 181 30.8 3 83734 NA 70 28.8 4 83735 NA 102 42.4 5 83736 0.6 315 20.3 6 83737 NA 64 28.6

We can note that in column LBXMFOS there are 5 NA, in the next section we’ll see how to get rid of them.

Note that the square bracket subsetting also works, for example to select the first 7 columns we could write select(Master4, 1:7) which would be equivallent to Master4[, 1:7] in this case.

10.2 Filtering rows

The filtering of rows depends on the desired outcome. One step that is useful and often necessary is to remove the NA values. If we continue with the selected columns we can remove those with:

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% head() %>% filter(!is.na(LBXMFOS)) SEQN LBXMFOS URXUCR BMXBMI 1 83736 0.6 315 20.3

Since the head() function was first, we are only filtering the first 6 rows, and 5 are therefore eliminated. If we move the head() function after, we’ll select the first 6 rows that do not have any NA.

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% filter(!is.na(LBXMFOS)) %>% head() SEQN LBXMFOS URXUCR BMXBMI 1 83736 0.6 315 20.3 2 83745 0.8 178 25.0 3 83750 1.9 81 24.1 4 83754 5.4 148 43.7 5 83762 0.4 317 38.0 6 83767 1.0 65 26.3

We can use other operator (Appendix B) to filter the rows with conditional statements. For example we could ask to keep the BMI values below or equal to 25.0 and more, and at the same time removing NA values from chosen columns:

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% filter(BMXBMI <= 25.0) %>% head() SEQN LBXMFOS URXUCR BMXBMI 1 83736 0.6 315 20.3 2 83738 NA 100 18.1 3 83739 NA 25 15.7 4 83745 0.8 178 25.0 5 83746 NA 34 16.1 6 83748 NA 14 16.1

We could have the filter() function more than once within the pipeline, but we can also have multiple statements at the same time:

select(Master4, SEQN, LBXMFOS, URXUCR, BMXBMI) %>% filter(!is.na(BMXBMI), BMXBMI <= 25.0, !is.na(LBXMFOS), URXUCR == 25.0) %>% head() SEQN LBXMFOS URXUCR BMXBMI 1 85253 2.8 25 24.2 2 86086 1.1 25 20.3 3 88829 0.5 25 20.1 4 89326 0.7 25 22.9 5 89399 1.1 25 24.0 6 90178 0.6 25 20.4

Note on style: writing each conditional statement on a separate line makes it easier to read and understand the code.

The dyplr function drop_na() would also remove NA from all rows, or specified rows.

10.3 Arrange data

For dplyr to arrange data is to sort data in order. The following example accomplishes many things at the same time and the “seed” makes the results always the same. To see just 6 lines of ordered data based on age, the data is first randomly sampled to keep only 100 rows with a dyplr function sample_n(). All rows in the age column RIDAGEYR without value are removed with drop_na(). Four columns are selected, and the age is filtered to keep only ages above 12. The data is then arranged (ordered) by the first (age) and then second column (creatinine) as designated. The data is first ordered by age and then by the second parameter.

set.seed(18) ; Master4 %>% sample_n(100) %>% drop_na(RIDAGEYR) %>% select(SEQN, URXUCR, BMXBMI, RIDAGEYR) %>% filter(RIDAGEYR >= 12) %>% arrange(RIDAGEYR, URXUCR) %>% head() SEQN URXUCR BMXBMI RIDAGEYR 1 89664 114 16.1 12 2 91777 148 20.3 13 3 90183 330 26.2 13 4 92077 116 17.8 14 5 84253 144 20.6 14 6 92721 161 25.2 14

Note that the order in the pipe may be important, but the modularity of the code makes it easy to try if there is an effect. For example, since we are only looking at 6 values here, it makes no difference if the sampling of 100 rows occurs before or after the dropping of NA in the age column.

10.4 mutating data

As we explored in the previous demonstration (8.3.1) we can create a new column when computing data with the mutate() function. As an example we can compute a ratio of \(\frac{height}{weight}\) as it is often a useful measure. In the BMX_I data we can find the relevant column names:

  • BMXWT - Weight (kg)
  • BMXHT - Standing Height (cm)
  • BMXBMI - Body Mass Index (kg/m**2)

We can then make a selection of columns and compute the ratio which will be saved in a new column with the name that we define for example HWRATIO. As an example the data is then rounded to just 1 decimal point as most of the rest of the data:

Master4 %>% select(SEQN, BMXWT, BMXHT, BMXBMI) %>% mutate(HWRATIO = BMXHT / BMXWT, HWRATIO2 = round(HWRATIO, digit=1)) %>% head() SEQN BMXWT BMXHT BMXBMI HWRATIO HWRATIO2 1 83732 94.8 184.5 27.8 1.946203 1.9 2 83733 90.4 171.4 30.8 1.896018 1.9 3 83734 83.4 170.1 28.8 2.039568 2.0 4 83735 109.8 160.9 42.4 1.465392 1.5 5 83736 55.2 164.9 20.3 2.987319 3.0 6 83737 64.4 150.0 28.6 2.329193 2.3

10.4.1 mutate with conditional statement

When analyzing tabular data we may have the choice will be made for each row. Here is an example found online: Creating New Variables in R with mutate() and ifelse()32.

# Create 3 vectors section <- c("MATH111", "MATH111", "ENG111") grade <- c(78, 93, 56) student <- c("David", "Kristina", "Mycroft") # Combine vectors into a dataframe gradebook <- data.frame(section, grade, student) # Test of grade level and assign an outcome mutate(gradebook, Pass.Fail = ifelse(grade > 60, "Pass", "Fail")) section grade student Pass.Fail 1 MATH111 78 David Pass 2 MATH111 93 Kristina Pass 3 ENG111 56 Mycroft Fail # Results are printed out

By “nesting” multiple ifelse() statements we can manage to provide multiple choices: the second argument is replaced by another ifelse() statements in succession. The final iteration has a single option that is not an ifelse() statements. In the example below, from the same source, the gradebook is read one row at a time, and then we need to provide a grade based on the letter system: A through D or F.

mutate(gradebook, letter = ifelse(grade %in% 60:69, "D", ifelse(grade %in% 70:79, "C", ifelse(grade %in% 80:89, "B", ifelse(grade %in% 90:99, "A", "F"))))) section grade student letter 1 MATH111 78 David C 2 MATH111 93 Kristina A 3 ENG111 56 Mycroft F

This makes use of the %in% operator that can easily be understood as being able to test if a value is within the proposed range. On the first line we ask if the test of column grade for each row is between 60 and 69. If the answer is “yes” which means the test is TRUE, then the value will be “D”. If the test is FALSE we’ll go to the next question and so on until the last line where the final choice is F. (Note that for readability the code is indented, and “F” is the alternate option for the test grade %in% 90:99.)

We’ll use this statement is a very useful way in a future section (10.7.)

10.5 Summarising and grouping data

The following data can be found in the DEMO_I demographic data file:

  • RIAGENDR: 1-male, 2-female (none missing)
  • DMDMARTL: 1 Married, 2 Widowed, etc.
  • DMDEDUC2 - Education level - Adults 20+
  • INDHHIN2 - Annual household income

We want to get some information by group, in this case we’ll group by marital status and then count the total number of observations for each case and we’ll store this in column Counts. For each marital status code, we’ll then count how many women and men are in each category that we’ll report in columns Mnum and Wnum (the sum of these 2 on each line should be equal to the reported Counts column.)

For simplicity with income and education we’ll simply compute the mean of the codes, which should still give us an indication of the level of income and education for each category of marital status.

We’ll place the results in a user-defined R object so that we can reuse it in the next section without re-writing the complete pipeline. Let’s call it Xsum for example

Xsum <- Master4 %>% select(SEQN, RIAGENDR, DMDMARTL, DMDEDUC2, INDHHIN2) %>% drop_na() %>% group_by(DMDMARTL) %>% summarise(Counts = n() , Mnum = sum(RIAGENDR == 1), Wnum = sum(RIAGENDR == 2), MeanIncCode = mean(INDHHIN2), MeanEducCode = mean(DMDEDUC2)) Xsum # print output # A tibble: 8 × 6 DMDMARTL Counts Mnum Wnum MeanIncCode MeanEducCode <dbl> <int> <int> <int> <dbl> <dbl> 1 1 2792 1481 1311 12.5 3.52 2 2 401 106 295 11.1 3.00 3 3 597 239 358 10.1 3.49 4 4 184 66 118 9.48 2.80 5 5 999 475 524 11.6 3.60 6 6 533 270 263 10.7 3.25 7 77 2 1 1 42 4 8 99 1 0 1 99 9

10.6 Recoding: string replacement

The above results are OK but it would be nice to be able to change some of the code numbers to actual English word such as “married” or “single” in text. This can be accomplished in many ways in dplyr but one of the simplest is to use recode() as a special case within mutate() to overwrite a column.

The DMDMARTL codes from NHANES DEMO_I
Code or Value Value Description Count Cumulative
1 Married 2886 2886
2 Widowed 421 3307
3 Divorced 614 3921
4 Separated 192 4113
5 Never married 1048 5161
6 Living with partner 555 5716
77 Refused 2 5718
99 Don’t Know 1 5719
. Missing 4252 9971

We can then recode the DMDMARTL column of Xsum with the following pipeline:

Xsum2 <- Xsum %>% mutate(DMDMARTL = recode(DMDMARTL, `1` = "Married", `2` = "Widowed", `3` = "Divorced", `4` = "Separated", `5` = "Nevermarried", `6` = "Living with partner", `77` = "Refused", `99` = "Don' Know") ) Xsum2 # print out # A tibble: 8 × 6 DMDMARTL Counts Mnum Wnum MeanIncCode MeanEducCode <chr> <int> <int> <int> <dbl> <dbl> 1 Married 2792 1481 1311 12.5 3.52 2 Widowed 401 106 295 11.1 3.00 3 Divorced 597 239 358 10.1 3.49 4 Separated 184 66 118 9.48 2.80 5 Nevermarried 999 475 524 11.6 3.60 6 Living with partner 533 270 263 10.7 3.25 7 Refused 2 1 1 42 4 8 Don' Know 1 0 1 99 9

(Credit: this example inspired by How to Recode a Column with dplyr in R33)

In fact 2 of the columns are not well named and we should rename Mnum and Wnum to simply Men and Women in the table. This is done with the dplyr rename() function:

Xsum2 %>% rename(Men = Mnum, Women = Wnum) # A tibble: 8 × 6 DMDMARTL Counts Men Women MeanIncCode MeanEducCode <chr> <int> <int> <int> <dbl> <dbl> 1 Married 2792 1481 1311 12.5 3.52 2 Widowed 401 106 295 11.1 3.00 3 Divorced 597 239 358 10.1 3.49 4 Separated 184 66 118 9.48 2.80 5 Nevermarried 999 475 524 11.6 3.60 6 Living with partner 533 270 263 10.7 3.25 7 Refused 2 1 1 42 4 8 Don' Know 1 0 1 99 9

10.7 Getting it all together

We now know enough that we should be able to get it all together in an annotated pipeline starting with Master4.

10.7.1 Example 1: by gender

Let’s try to answer the question: “What is the average level of total cholesterol in men and women from the Master4 dataset.?

To answer the question we’ll need the following:

  1. create an object to contain the results and redirect to it
  2. Start with Master4 (inject data in the pipeline)
  3. Select relevant columns
  4. Remove NA rows
  5. Group by gender
  6. Summarize.

We now know how to do this in a data stream with a pipeline. Let’s write it and add comment lines so we can remember later the purpose of the code:

TcholGender <- Master4 %>% # select columns select(SEQN, RIAGENDR, LBXTC ) %>% # fitler all rows to remove NAs drop_na() %>% # Group by gender group_by(RIAGENDR) %>% summarise( Men = sum(RIAGENDR == 1), Women = sum(RIAGENDR == 2), MeanTChol = mean(LBXTC)) # Print results: TcholGender # A tibble: 2 × 4 RIAGENDR Men Women MeanTChol <dbl> <int> <int> <dbl> 1 1 3545 0 178. 2 2 0 3711 183.

The final table will be short but that is exactly what a summary should be.

The value for men is 177.75 based on a total of 3545 observations.

For women the we see is 182.65 based on a total of 3711 observations.

(See section 13.2.4 later to learn how these numbers were embedded within the text automatically without copy/paste!)

10.7.2 Example 2: by gender and age

Let’s make things a bit more interesting with the question:

Base on gender and age group, what is the mean and standard deviation of PFAS compounds, as well as the mean, standard deviation, minimum, and maximum values of total cholesterol in men and women?

One of the key word is “age group” as in the NHANES data age is a single integer number with a range from 1 to 80 in RIDAGEYR column and therefore it is “up to us” to create the age groups! This can be done rather easily within a combination of mutate() and with nested ifelse() statements within. Nesting the ifelse() function within itself allows making multiple choices. (For a refresher on ifelse() see section 10.4.1.)

We’ll have to use the appropriate columns of data, compute the creatinine adjustment and summarize these values by age group for men and women.

A new base R function is introduced here formatC() which will force the numbers to be printed in scientific exponent for better clarity. (Try without and see the difference! - this option was suggested on Stack Overflow.)

The pipeline below is annotated to specify the function of each line and the results are saved within a user-defined R object Example2.

Example2 <- Master4 %>% # select columns select(SEQN, RIAGENDR, RIDAGEYR, LBXMFOS, URXUCR, LBXTC ) %>% # fitler all rows to remove NAs drop_na() %>% # Creatinine adjustment mutate (RATIO = (LBXMFOS/URXUCR)*10^-4) %>% # categorize ages in 5 groups: # Children: G0TO18, younger adults: G19TO35, # and older adults: G36TO65, seniors: G66TO79, # and 80 and older: G80. mutate(AGEGROUP = ifelse(RIDAGEYR %in% 0:18, "G0TO18", ifelse(RIDAGEYR %in% 19:35, "G19TO35", ifelse(RIDAGEYR %in% 36:65, "G36TO65", ifelse(RIDAGEYR %in% 66:79, "G66TO79", "G80"))))) %>% # # NOTE: this would be a place to split the pipeline in 2 sections # # Group by gender first, age second group_by(RIAGENDR, AGEGROUP) %>% # Summarize: count totals, summarise( Men = sum(RIAGENDR == 1), Women = sum(RIAGENDR == 2), MeanPFASR = formatC(mean(RATIO), format = "e", digits = 2), sdPFASR = formatC(sd(RATIO), format = "e", digits = 2), MeanTChol = mean(LBXTC), sdTChol = sd(LBXTC), minTChol = min(LBXTC), maxTChol = max(LBXTC) ) `summarise()` has grouped output by 'RIAGENDR'. You can override using the `.groups` argument. # Print out the results Example2 # A tibble: 10 × 10 # Groups: RIAGENDR [2] RIAGENDR AGEGROUP Men Women MeanPFASR sdPFASR MeanTChol sdTChol minTChol <dbl> <chr> <int> <int> <chr> <chr> <dbl> <dbl> <dbl> 1 1 G0TO18 175 0 8.47e-07 8.68e-07 152. 28.6 94 2 1 G19TO35 235 0 1.68e-06 1.88e-06 180. 35.6 98 3 1 G36TO65 382 0 3.24e-06 5.99e-06 198. 46.7 111 4 1 G66TO79 117 0 3.95e-06 3.92e-06 169. 35.1 90 5 1 G80 44 0 5.12e-06 6.42e-06 165. 35.0 101 6 2 G0TO18 0 141 7.70e-07 7.88e-07 158. 29.5 103 7 2 G19TO35 0 238 9.68e-07 1.31e-06 176. 38.5 100 8 2 G36TO65 0 459 2.35e-06 3.80e-06 201. 36.3 106 9 2 G66TO79 0 124 4.58e-06 6.32e-06 202. 48.5 84 10 2 G80 0 53 5.86e-06 6.29e-06 187. 34.1 126 # … with 1 more variable: maxTChol <dbl>

The print-out might be truncated (depending on the format of this document.)

As expected the output is a tibble. A subtle but important note is in the second line of the output: # Groups: RIAGENDR [2]. It may be important at a later stage to use the dplyr function ungroup() to modify e.g. the column name or the values within that column.

The highest cholesterol value is 545 and is for a person of gender code 1 in age group G36TO65.

(See section 13.2.4 later to learn how these numbers were embedded within the text automatically without copy/paste!)

10.7.2.1 Sorting / arranging

The Example2 is a summary table (in tibble / data frame format) and can be further sorted with the arrange() function. For example to see a table with the cholesterol levels:

Example2 %>% select(RIAGENDR, AGEGROUP, maxTChol) %>% arrange(-maxTChol) %>% head(2) # A tibble: 2 × 3 # Groups: RIAGENDR [2] RIAGENDR AGEGROUP maxTChol <dbl> <chr> <dbl> 1 1 G36TO65 545 2 2 G66TO79 358

10.7.3 Base R Bar plot

This summary table can be used with the base R function barplot() to create a representation of the mean cholesterol data. The default plot yields just gray bars and not horizontal label. Here are examples plotted together on a single graph that show the effect of some options.

As a reminder:

  • las=2 is rotating the labels (seen in section 6.3.)
  • col = RIAGENDR + 1 selects two of the nine colors in the base R color palette (see colors in section 5.5.) Since RIAGENDR values are 1 and 2 the resulting colors would be black and red which is not very appealing. Adding + 1 will choose red and green. Other numbers will select the next colors in the list.
  • names.arg= specifies the source of the horizontal axis names displayed.

These commands use the with() function but could also be written with the $ method, for example with Example2$AGEGROUP.

par(mfrow = c(1,2)) # Plain version with(Example2,barplot(MeanTChol)) # Add white/gray alternate colors, rotated horiz labels with(Example2,barplot(MeanTChol, col = RIAGENDR + 1, names.arg=AGEGROUP, las =2 ))

par(mfrow = c(1,1))

Error bars? There are methods in base R to add error bars and various examples can be found online. However, the methods are all quirky and the best is now to use ggplot to create such graphs.

10.7.4 ggplot2 versions

Example plots have been moved to section 11.2.

  1. https://rpubs.com/daranzolin/mutateifelse↩︎

  2. https://cmdlinetips.com/2019/04/how-to-recode-a-column-with-dplyr-in-r/↩︎

All illustrations and materials CC-BY-NC unless noted otherwise.

Tag » How To Group Ages In Rstudio