Data Wrangling with dplyr and tidyr

Intro to the ‘tidyverse’

Now that we have dipped our feet into plots and stats in R, I think you are getting a better sense of the fact that ‘wrangling’ or ‘manipulating’ data is one of the biggest steps to becoming proficient in R and all that it has to offer.

For example, for any given analysis, you may have to subset data, filter out certain data that don’t meet some criteria, focus in on a select set of variables of interest, calculate means, and variance for different groups, etc. etc.

These tasks are where the packages dplyr, tidyr and other packages in the tidyverse–a series of packages designed for all kinds of data tasks. This also includes the popular ggplot2 package for graphics.

The tidyverse packages are constructed by Hadley Wickam. There are several books that cover how to use these packages, including R for Data Science which is available for free as an online book

In this module, we’ll be learning some functions from the packages dplyr and tidyr.

We will do this by playing with data from the World Bank.

Installing and loading packages we need for this module

One can install each package separately, but you can also just install all “tidyverse” packages simply by running this command:

install.packages("tidyverse")
install.packages("wbstats")

Note that this simply downloads the packages onto your computer. When you are ready to use them, you will have to load the package onto the environment by running the function

You now have the package downloaded on your computer, but to actually use it, you have to load the package. We can load the entire tidyverse package (or, if you prefer, you can just load the tidyr package).

library(wbstats)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Two important thing to notice here. First, the message tells you what packages were actually loaded as part of the tidyverse “metapackage”. You see that this includes 8 packages: ggplot2,tibble, tidyr, readr, purrr, dplyr, stringr, and forcats. Second, the message tells you that there are two functions in the dplyr package that conflict with existing functions: filter() and lag(). This is sometimes very important to know! This means that the filter() function works differently before and after loading this package.

Some things to know about getting started with ‘tidyverse’

Pipe Operator (%>%): tidyverse makes use of the pipe operator %>%, which allows you to carry over the output of one function to the next function. This can make series of data manipulation sequences much more efficient.

Tibbles: “tibble” is a special class of dataframe that is used in tidyverse. It is largely the same as a dataframe but it has some features (or rather, lack of features) that make for ‘defensive coding’. That is, it forces you to avoid dangerous operations, such as changing variable names or types (you have to explicitly do this) or allow “partial matching”.

To learn more about tibbles, start here

2. Working with tidyr and dplyr

2.1. tidyr

I use tidyr mostly for reshaping data to move between “long-format” and “wide-format” data.

Here is link to the tidyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/master/tidyr.pdf

Main tidyr functions

pivot_longer(): “lengthen” data by collapsing several columns into two.
pivot_wider(): “widen” data by expanding two columns into multiple columns
drop_na(): remove rows that contain NA
separate(): separate values in a column into multiple columns
unite(): paste together values in two columns

2.2. dplyr

dplyr is a package that helps you wrangle your data into shape to aid you in the process of visualization and analysis.

Here is a link to the dplyr cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/data-transformation.pdf

Main dplyr functions

pull(): select one column and save as a vector.
select(): select columns by criteria
filter(): filter rows by criteria
mutate(): add new variable using functions
group_by(): group the data together based on a given variable (or variables). Often used when calculating summary stats.
summarise(): calculate summary statistic for a given variable
arrange(): change order of rows
left_join(), right_join(), inner_join(), full_join(): set of functions to help merge data tables.
nest_join(): create nested datasets (advanced… I don’t know how to use this yet)

3. Demonstrating the basic functions

3.1 Using pipes (`%>%`) to chain together sequence of actions!

First, I’m going to introduce the “pipe”–perhaps the most useful part of the tidyverse grammar (which actually comes from another amazing package called magrittr, if you care…).

Basically, piping is when the %>% operator is used to forward a value, or the result of an expression, into the next function call/expression.

Take for example the billboard dataset, which is part of the tidyr package. It contains data on rankings of songs on the Billboard Top 100 in the year 2000.

billboard

## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

You can use the filter() function (see more below) to show just the data for songs by The Backstreet Boys.

filter(billboard, artist=="Backstreet Boys, The")

## # A tibble: 3 × 79
##   artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14      39    25    24    15    12    12    10     9
## 2 Backstreet… Show… 2000-01-01      74    62    55    25    16    14    12    10
## 3 Backstreet… The … 2000-05-27      58    50    43    37    31    30    39    47
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

But you can run the same code by using %>%, like this:

billboard %>% filter(artist=="Backstreet Boys, The")

## # A tibble: 3 × 79
##   artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14      39    25    24    15    12    12    10     9
## 2 Backstreet… Show… 2000-01-01      74    62    55    25    16    14    12    10
## 3 Backstreet… The … 2000-05-27      58    50    43    37    31    30    39    47
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

What this does is tell R: “Take billboard. Then, filter the data to show just the songs by The Backstreet Boys. And in this case, the filter(., ) says that the forwarded value should be used in place of”.”.

… but you actually don’t even need to include the “.” here–tidyverse automatically applies the forwarded value as the first entry in the function, so you can just do:

billboard %>% filter(artist=="Backstreet Boys, The")

## # A tibble: 3 × 79
##   artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14      39    25    24    15    12    12    10     9
## 2 Backstreet… Show… 2000-01-01      74    62    55    25    16    14    12    10
## 3 Backstreet… The … 2000-05-27      58    50    43    37    31    30    39    47
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

Right now, this seems a bit puzzling and not that useful… but, you will quickly see how the %>% operator can help you build nice pipelines (pun intended) for data wrangling!

From here on out, I will build the codes using pipes as a default.

3.2. Filter by row values

As I’ve shown already, you can use the filter() function in dplyr to select rows based on some criteria.

I can actually use multiple criteria to filter data. Let’s say I now want to see the data for songs by either The Backstreet Boys or N’Sync:

billboard %>% filter(artist=="Backstreet Boys, The" | artist=="N'Sync")

## # A tibble: 6 × 79
##   artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14      39    25    24    15    12    12    10     9
## 2 Backstreet… Show… 2000-01-01      74    62    55    25    16    14    12    10
## 3 Backstreet… The … 2000-05-27      58    50    43    37    31    30    39    47
## 4 N'Sync      Bye … 2000-01-29      42    20    19    14    13     7     6     5
## 5 N'Sync      It's… 2000-05-06      82    70    51    39    26    19    15     9
## 6 N'Sync      This… 2000-09-30      68    31    19    15    11     6     7     7
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

Recall that there is a conflict over the function name filter()–there is another filter() function in base R that is separate from the tidyverse version. To make sure you are using the version from the dplyr package, I often use dplyr::filter() when I run this function.

billboard %>% dplyr::filter(artist=="Backstreet Boys, The" | artist=="N'Sync")

## # A tibble: 6 × 79
##   artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14      39    25    24    15    12    12    10     9
## 2 Backstreet… Show… 2000-01-01      74    62    55    25    16    14    12    10
## 3 Backstreet… The … 2000-05-27      58    50    43    37    31    30    39    47
## 4 N'Sync      Bye … 2000-01-29      42    20    19    14    13     7     6     5
## 5 N'Sync      It's… 2000-05-06      82    70    51    39    26    19    15     9
## 6 N'Sync      This… 2000-09-30      68    31    19    15    11     6     7     7
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

3.3. Select columns

Sometimes, you don’t need all of the data. Let’s say we just want to see what position each song started at–so we just want the artist, track, date entered and the week 1 rank. You can do this with select()

billboard %>% select(artist, track, date.entered, wk1)

## # A tibble: 317 × 4
##    artist         track                   date.entered   wk1
##    <chr>          <chr>                   <date>       <dbl>
##  1 2 Pac          Baby Don't Cry (Keep... 2000-02-26      87
##  2 2Ge+her        The Hardest Part Of ... 2000-09-02      91
##  3 3 Doors Down   Kryptonite              2000-04-08      81
##  4 3 Doors Down   Loser                   2000-10-21      76
##  5 504 Boyz       Wobble Wobble           2000-04-15      57
##  6 98^0           Give Me Just One Nig... 2000-08-19      51
##  7 A*Teens        Dancing Queen           2000-07-08      97
##  8 Aaliyah        I Don't Wanna           2000-01-29      84
##  9 Aaliyah        Try Again               2000-03-18      59
## 10 Adams, Yolanda Open My Heart           2000-08-26      76
## # ℹ 307 more rows

The nice thing about the select function is that you don’t need to put the column names in quotes or anything–just type in the columns you want.

or, type in the columns you DON’T want:

billboard %>% select(-track, -date.entered)

## # A tibble: 317 × 77
##    artist        wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8   wk9  wk10  wk11
##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac          87    82    72    77    87    94    99    NA    NA    NA    NA
##  2 2Ge+her        91    87    92    NA    NA    NA    NA    NA    NA    NA    NA
##  3 3 Doors Do…    81    70    68    67    66    57    54    53    51    51    51
##  4 3 Doors Do…    76    76    72    69    67    65    55    59    62    61    61
##  5 504 Boyz       57    34    25    17    17    31    36    49    53    57    64
##  6 98^0           51    39    34    26    26    19     2     2     3     6     7
##  7 A*Teens        97    97    96    95   100    NA    NA    NA    NA    NA    NA
##  8 Aaliyah        84    62    51    41    38    35    35    38    38    36    37
##  9 Aaliyah        59    53    38    28    21    18    16    14    12    10     9
## 10 Adams, Yol…    76    76    74    69    68    67    61    58    57    59    66
## # ℹ 307 more rows
## # ℹ 65 more variables: wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>,
## #   wk16 <dbl>, wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>,
## #   wk22 <dbl>, wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>,
## #   wk28 <dbl>, wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,
## #   wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>,
## #   wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, …

Combining the filter() and select() functions allow you to manage the data in flexible ways. And piping makes it easy to do this:

billboard %>% dplyr::filter(artist=="Backstreet Boys, The") %>% select(artist, track, date.entered, wk1)

## # A tibble: 3 × 4
##   artist               track                   date.entered   wk1
##   <chr>                <chr>                   <date>       <dbl>
## 1 Backstreet Boys, The Shape Of My Heart       2000-10-14      39
## 2 Backstreet Boys, The Show Me The Meaning ... 2000-01-01      74
## 3 Backstreet Boys, The The One                 2000-05-27      58

3.4. Converting wide-format data to long-format data with `pivot_longer()`

“Wide-format” data is one in which each row is a subject/entity that is measured repeatedly, and each measurement appears on different columns. The billboard data is a prime example of wide-format data because it lists the rankings of a song for each week in separate columns.

Let’s now convert this into a long-format, in which we have a column for “week” and then the rank of that song for that week in a column called “rank”:

billboard %>% pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank")

## # A tibble: 24,092 × 5
##    artist track                   date.entered week   rank
##    <chr>  <chr>                   <date>       <chr> <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
## # ℹ 24,082 more rows

What happened here?

The minimal arguments that are required in here are:

data: (self explanatory)
cols: The columns that we want to collapse into a single column. Here, we want all the columns that start with “wk”, which contain the ranking of that song in that week. tidyverse has a friendly function called starts_with() that we use here. An alternative way to do this would be to exclude all the other columns, which we could do with -c(artist, track, date.entered)
names_to: The name of a new column that will contain names of the columns that you collapsed. Here, naming this “week” makes sense.
values_to: The name of a new column that will have the values for each ID x names_from combination. In this case, this is “rank”

3.5. Converting long-format data to wide-format data with `pivot_wider()`

Let’s save the long-format version of the billboard data:

bb_long=billboard %>% pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank")

We can then revert back to the wide-format version this way:

bb_long %>% pivot_wider(names_from = "week", values_from = "rank")

## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

So, what happened here? The minimal arguments that are required in here are:

data: is bb_long, forwarded by the %>%
names_from: The column that you want to expand into different columns. In this case, it is “age” because you want to see the heights of each tree at different ages as different columns.
values_from: The column that will have the values for each ID x names_from combination. In this case, this is “height”

3.6. Group and Summarize data

dplyr makes the craft of summarizing data much easier… if you get comfortable with the grammar. Here, I will show you how to use group_by() and summarise() functions to get summary data by artist, like their best ranking and how many weeks they spent on the Top 100 chart.

Let’s go back to the code that created the long-format version of the billboard data, but let’s drop the rows that contain NAs using drop_na()

billboard %>% 
  pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
  drop_na()

## # A tibble: 5,307 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92
## # ℹ 5,297 more rows

Now, we can calculate the “best rank” of each artist by grouping the data by “artist”, and finding the minimum ranking that the artist had:

billboard %>% 
  pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
  drop_na() %>%
  group_by(artist) %>%
  summarise(artist.best=min(rank, na.rm=T))

## # A tibble: 228 × 2
##    artist              artist.best
##    <chr>                     <dbl>
##  1 2 Pac                        72
##  2 2Ge+her                      87
##  3 3 Doors Down                  3
##  4 504 Boyz                     17
##  5 98^0                          2
##  6 A*Teens                      95
##  7 Aaliyah                       1
##  8 Adams, Yolanda               57
##  9 Adkins, Trace                65
## 10 Aguilera, Christina           1
## # ℹ 218 more rows

Let’s try another one: we can calculate the number of weeks that an artist spent on Top 100 by simply counting the number of rows that the artist shows up in (after having dropped all of the rows containing NAs, that should be what is left).

billboard %>% 
  pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
  drop_na() %>%
  group_by(artist) %>%
  summarise(artist.weeks=n())

## # A tibble: 228 × 2
##    artist              artist.weeks
##    <chr>                      <int>
##  1 2 Pac                          7
##  2 2Ge+her                        3
##  3 3 Doors Down                  73
##  4 504 Boyz                      18
##  5 98^0                          20
##  6 A*Teens                        5
##  7 Aaliyah                       52
##  8 Adams, Yolanda                20
##  9 Adkins, Trace                 11
## 10 Aguilera, Christina           67
## # ℹ 218 more rows

We can also do both summary functions in one go. Let’s do that and save the table as artist_dat:

artist_dat=billboard %>% 
  pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
  drop_na() %>%
  group_by(artist) %>%
  summarise(artist.weeks=n(), artist.best=min(rank))

artist_dat

## # A tibble: 228 × 3
##    artist              artist.weeks artist.best
##    <chr>                      <int>       <dbl>
##  1 2 Pac                          7          72
##  2 2Ge+her                        3          87
##  3 3 Doors Down                  73           3
##  4 504 Boyz                      18          17
##  5 98^0                          20           2
##  6 A*Teens                        5          95
##  7 Aaliyah                       52           1
##  8 Adams, Yolanda                20          57
##  9 Adkins, Trace                 11          65
## 10 Aguilera, Christina           67           1
## # ℹ 218 more rows

3.7. Add new variables using `mutate()`

You can make new variables (columns). You’ll often do this if want to calculate some new variable based on existing variables.

artist_dat %>%
  mutate(top5= artist.best <=5) %>% #TRUE if artist's song was ever in Top 5
  mutate(weeks.prop = artist.weeks/max(artist.weeks)) #weeks on list as proportion of maximum value

## # A tibble: 228 × 5
##    artist              artist.weeks artist.best top5  weeks.prop
##    <chr>                      <int>       <dbl> <lgl>      <dbl>
##  1 2 Pac                          7          72 FALSE     0.0673
##  2 2Ge+her                        3          87 FALSE     0.0288
##  3 3 Doors Down                  73           3 TRUE      0.702 
##  4 504 Boyz                      18          17 FALSE     0.173 
##  5 98^0                          20           2 TRUE      0.192 
##  6 A*Teens                        5          95 FALSE     0.0481
##  7 Aaliyah                       52           1 TRUE      0.5   
##  8 Adams, Yolanda                20          57 FALSE     0.192 
##  9 Adkins, Trace                 11          65 FALSE     0.106 
## 10 Aguilera, Christina           67           1 TRUE      0.644 
## # ℹ 218 more rows

3.8. Join two different data

The four main join functions all seek to merge data using matching columns (either matching column names, or manually designated using the by= argument). But they differ in which rows they will keep:

left_join(x, y): match up the values in designated columns of x and y, and keep all rows in x. NAs show up when a value is present in x but not y.
right_join(x, y): match up the values in designated columns of x and y, and keep all rows in y. NAs show up when a value is present in y but not x.
inner_join(x, y): match up the values in designated columns of x and y, and keep only rows in which x and y values matched. No NAs show up.
`full_join(x, y): match up the values in designated columns of x and y, and keep all rows in x and y, even if they don’t match. NAs whenever value in one table doesn’t have a match in the other.

Let’s demonstrate this by comparing the billboard data with the artist_dat table that we created above. The billboard data has more rows than artist_dat because some artists show up in multiple rows (if they have multiple songs in the charts). We will remove the columns with “wk” so we can see the results more easily.

left_join(billboard, artist_dat) %>%
  select(-starts_with("wk"), -date.entered)

## # A tibble: 317 × 4
##    artist         track                   artist.weeks artist.best
##    <chr>          <chr>                          <int>       <dbl>
##  1 2 Pac          Baby Don't Cry (Keep...            7          72
##  2 2Ge+her        The Hardest Part Of ...            3          87
##  3 3 Doors Down   Kryptonite                        73           3
##  4 3 Doors Down   Loser                             73           3
##  5 504 Boyz       Wobble Wobble                     18          17
##  6 98^0           Give Me Just One Nig...           20           2
##  7 A*Teens        Dancing Queen                      5          95
##  8 Aaliyah        I Don't Wanna                     52           1
##  9 Aaliyah        Try Again                         52           1
## 10 Adams, Yolanda Open My Heart                     20          57
## # ℹ 307 more rows

You can see that whenever the artist has multiple songs on the list, the “artist.weeks” and “artist.best” variables are repeated (because these are the stats for the artist).

3.9 Why do we need to know how to deal with long- vs. wide-format data, and how to plot results of time-varying data in ggplot.

Long- and Wide-format data have different strengths and weaknesses.

In some ways, wide-format data is easier for us to keep track, especially if you have something that is repeatedly measured for each subject across time (like weekly rankings of a song). However, it is not necessarily the easiest way for R to handle your data.

Long-format data is what packages like ggplot likes to deal with.

library(stringr)
bb_long=bb_long %>% mutate(week=as.numeric(str_replace(week, "wk", "")))
ggplot(bb_long, aes(x=week, y=-rank, group=track)) +
  geom_line(alpha=0.5)

## Warning: Removed 18723 rows containing missing values (`geom_line()`).

Get more practice with Worked Examples

Go to the Worked Example on “Wrangling World Bank Data”

Data Wrangling with dplyr and tidyr

Dai Shizuka

updated 09/26/23

Intro to the ‘tidyverse’

Installing and loading packages we need for this module

2. Working with tidyr and dplyr

2.1. tidyr

Main tidyr functions

2.2. dplyr

Main dplyr functions

3. Demonstrating the basic functions

3.1 Using pipes (`%>%`) to chain together sequence of actions!

3.2. Filter by row values

3.3. Select columns

3.4. Converting wide-format data to long-format data with `pivot_longer()`

3.5. Converting long-format data to wide-format data with `pivot_wider()`

3.6. Group and Summarize data

3.7. Add new variables using `mutate()`

3.8. Join two different data

3.9 Why do we need to know how to deal with long- vs. wide-format data, and how to plot results of time-varying data in ggplot.

Get more practice with Worked Examples

Data Wrangling with dplyr and tidyr

Dai Shizuka

updated 09/26/23

Intro to the ‘tidyverse’

Installing and loading packages we need for this module

2. Working with tidyr and dplyr

2.1. tidyr

Main tidyr functions

2.2. dplyr

Main dplyr functions

3. Demonstrating the basic functions

3.1 Using pipes (%>%) to chain together sequence of actions!

3.2. Filter by row values

3.3. Select columns

3.4. Converting wide-format data to long-format data with pivot_longer()

3.5. Converting long-format data to wide-format data with pivot_wider()

3.6. Group and Summarize data

3.7. Add new variables using mutate()

3.8. Join two different data

3.9 Why do we need to know how to deal with long- vs. wide-format data, and how to plot results of time-varying data in ggplot.

Get more practice with Worked Examples

3.1 Using pipes (`%>%`) to chain together sequence of actions!

3.4. Converting wide-format data to long-format data with `pivot_longer()`

3.5. Converting long-format data to wide-format data with `pivot_wider()`

3.7. Add new variables using `mutate()`