3. Demonstrating the basic functions
3.1 Using pipes (%>%
) to chain together sequence of
actions!
First, I’m going to introduce the “pipe”–perhaps the most useful part
of the tidyverse grammar (which actually comes from another amazing
package called magrittr
, if you care…).
Basically, piping is when the %>%
operator is used to
forward a value, or the result of an expression, into the next function
call/expression.
Take for example the billboard
dataset, which is part of
the tidyr package. It contains data on rankings of songs on the
Billboard Top 100 in the year 2000.
billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
You can use the filter()
function (see more below) to
show just the data for songs by The Backstreet Boys.
filter(billboard, artist=="Backstreet Boys, The")
## # A tibble: 3 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14 39 25 24 15 12 12 10 9
## 2 Backstreet… Show… 2000-01-01 74 62 55 25 16 14 12 10
## 3 Backstreet… The … 2000-05-27 58 50 43 37 31 30 39 47
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
But you can run the same code by using %>%
, like
this:
billboard %>% filter(artist=="Backstreet Boys, The")
## # A tibble: 3 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14 39 25 24 15 12 12 10 9
## 2 Backstreet… Show… 2000-01-01 74 62 55 25 16 14 12 10
## 3 Backstreet… The … 2000-05-27 58 50 43 37 31 30 39 47
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
What this does is tell R: “Take billboard
. Then, filter
the data to show just the songs by The Backstreet Boys. And in this
case, the filter(., )
says that the forwarded value should
be used in place of”.”.
… but you actually don’t even need to include the “.” here–tidyverse
automatically applies the forwarded value as the first entry in the
function, so you can just do:
billboard %>% filter(artist=="Backstreet Boys, The")
## # A tibble: 3 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14 39 25 24 15 12 12 10 9
## 2 Backstreet… Show… 2000-01-01 74 62 55 25 16 14 12 10
## 3 Backstreet… The … 2000-05-27 58 50 43 37 31 30 39 47
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
Right now, this seems a bit puzzling and not that useful… but, you
will quickly see how the %>%
operator can help you build
nice pipelines (pun intended) for data wrangling!
From here on out, I will build the codes using pipes as a
default.
3.2. Filter by row values
As I’ve shown already, you can use the filter()
function
in dplyr to select rows based on some criteria.
I can actually use multiple criteria to filter data. Let’s say I now
want to see the data for songs by either The Backstreet Boys or
N’Sync:
billboard %>% filter(artist=="Backstreet Boys, The" | artist=="N'Sync")
## # A tibble: 6 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14 39 25 24 15 12 12 10 9
## 2 Backstreet… Show… 2000-01-01 74 62 55 25 16 14 12 10
## 3 Backstreet… The … 2000-05-27 58 50 43 37 31 30 39 47
## 4 N'Sync Bye … 2000-01-29 42 20 19 14 13 7 6 5
## 5 N'Sync It's… 2000-05-06 82 70 51 39 26 19 15 9
## 6 N'Sync This… 2000-09-30 68 31 19 15 11 6 7 7
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
Recall that there is a conflict over the function name
filter()
–there is another filter()
function in
base R that is separate from the tidyverse version. To make sure you are
using the version from the dplyr package, I often use
dplyr::filter()
when I run this function.
billboard %>% dplyr::filter(artist=="Backstreet Boys, The" | artist=="N'Sync")
## # A tibble: 6 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Backstreet… Shap… 2000-10-14 39 25 24 15 12 12 10 9
## 2 Backstreet… Show… 2000-01-01 74 62 55 25 16 14 12 10
## 3 Backstreet… The … 2000-05-27 58 50 43 37 31 30 39 47
## 4 N'Sync Bye … 2000-01-29 42 20 19 14 13 7 6 5
## 5 N'Sync It's… 2000-05-06 82 70 51 39 26 19 15 9
## 6 N'Sync This… 2000-09-30 68 31 19 15 11 6 7 7
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
3.3. Select columns
Sometimes, you don’t need all of the data. Let’s say we just want to
see what position each song started at–so we just want the artist,
track, date entered and the week 1 rank. You can do this with
select()
billboard %>% select(artist, track, date.entered, wk1)
## # A tibble: 317 × 4
## artist track date.entered wk1
## <chr> <chr> <date> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 87
## 2 2Ge+her The Hardest Part Of ... 2000-09-02 91
## 3 3 Doors Down Kryptonite 2000-04-08 81
## 4 3 Doors Down Loser 2000-10-21 76
## 5 504 Boyz Wobble Wobble 2000-04-15 57
## 6 98^0 Give Me Just One Nig... 2000-08-19 51
## 7 A*Teens Dancing Queen 2000-07-08 97
## 8 Aaliyah I Don't Wanna 2000-01-29 84
## 9 Aaliyah Try Again 2000-03-18 59
## 10 Adams, Yolanda Open My Heart 2000-08-26 76
## # ℹ 307 more rows
The nice thing about the select function is that you don’t need to
put the column names in quotes or anything–just type in the columns you
want.
or, type in the columns you DON’T want:
billboard %>% select(-track, -date.entered)
## # A tibble: 317 × 77
## artist wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac 87 82 72 77 87 94 99 NA NA NA NA
## 2 2Ge+her 91 87 92 NA NA NA NA NA NA NA NA
## 3 3 Doors Do… 81 70 68 67 66 57 54 53 51 51 51
## 4 3 Doors Do… 76 76 72 69 67 65 55 59 62 61 61
## 5 504 Boyz 57 34 25 17 17 31 36 49 53 57 64
## 6 98^0 51 39 34 26 26 19 2 2 3 6 7
## 7 A*Teens 97 97 96 95 100 NA NA NA NA NA NA
## 8 Aaliyah 84 62 51 41 38 35 35 38 38 36 37
## 9 Aaliyah 59 53 38 28 21 18 16 14 12 10 9
## 10 Adams, Yol… 76 76 74 69 68 67 61 58 57 59 66
## # ℹ 307 more rows
## # ℹ 65 more variables: wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>,
## # wk16 <dbl>, wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>,
## # wk22 <dbl>, wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>,
## # wk28 <dbl>, wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,
## # wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>,
## # wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, …
Combining the filter()
and select()
functions allow you to manage the data in flexible ways. And piping
makes it easy to do this:
billboard %>% dplyr::filter(artist=="Backstreet Boys, The") %>% select(artist, track, date.entered, wk1)
## # A tibble: 3 × 4
## artist track date.entered wk1
## <chr> <chr> <date> <dbl>
## 1 Backstreet Boys, The Shape Of My Heart 2000-10-14 39
## 2 Backstreet Boys, The Show Me The Meaning ... 2000-01-01 74
## 3 Backstreet Boys, The The One 2000-05-27 58
3.5. Converting long-format data to wide-format data with
pivot_wider()
Let’s save the long-format version of the billboard data:
bb_long=billboard %>% pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank")
We can then revert back to the wide-format version this way:
bb_long %>% pivot_wider(names_from = "week", values_from = "rank")
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
So, what happened here? The minimal arguments that are required in
here are:
data
: is bb_long
, forwarded by the
%>%
names_from
: The column that you want to expand into
different columns. In this case, it is “age” because you want to see the
heights of each tree at different ages as different columns.
values_from
: The column that will have the values for
each ID x names_from combination. In this case, this is “height”
3.6. Group and Summarize data
dplyr makes the craft of summarizing data much easier… if
you get comfortable with the grammar. Here, I will show you how to use
group_by()
and summarise()
functions to get
summary data by artist, like their best ranking and how many weeks they
spent on the Top 100 chart.
Let’s go back to the code that created the long-format version of the
billboard data, but let’s drop the rows that contain NAs using
drop_na()
billboard %>%
pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
drop_na()
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Now, we can calculate the “best rank” of each artist by grouping the
data by “artist”, and finding the minimum ranking that the artist
had:
billboard %>%
pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
drop_na() %>%
group_by(artist) %>%
summarise(artist.best=min(rank, na.rm=T))
## # A tibble: 228 × 2
## artist artist.best
## <chr> <dbl>
## 1 2 Pac 72
## 2 2Ge+her 87
## 3 3 Doors Down 3
## 4 504 Boyz 17
## 5 98^0 2
## 6 A*Teens 95
## 7 Aaliyah 1
## 8 Adams, Yolanda 57
## 9 Adkins, Trace 65
## 10 Aguilera, Christina 1
## # ℹ 218 more rows
Let’s try another one: we can calculate the number of weeks that an
artist spent on Top 100 by simply counting the number of rows that the
artist shows up in (after having dropped all of the rows containing NAs,
that should be what is left).
billboard %>%
pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
drop_na() %>%
group_by(artist) %>%
summarise(artist.weeks=n())
## # A tibble: 228 × 2
## artist artist.weeks
## <chr> <int>
## 1 2 Pac 7
## 2 2Ge+her 3
## 3 3 Doors Down 73
## 4 504 Boyz 18
## 5 98^0 20
## 6 A*Teens 5
## 7 Aaliyah 52
## 8 Adams, Yolanda 20
## 9 Adkins, Trace 11
## 10 Aguilera, Christina 67
## # ℹ 218 more rows
We can also do both summary functions in one go. Let’s do that and
save the table as artist_dat:
artist_dat=billboard %>%
pivot_longer(cols=starts_with("wk"), names_to="week", values_to="rank") %>%
drop_na() %>%
group_by(artist) %>%
summarise(artist.weeks=n(), artist.best=min(rank))
artist_dat
## # A tibble: 228 × 3
## artist artist.weeks artist.best
## <chr> <int> <dbl>
## 1 2 Pac 7 72
## 2 2Ge+her 3 87
## 3 3 Doors Down 73 3
## 4 504 Boyz 18 17
## 5 98^0 20 2
## 6 A*Teens 5 95
## 7 Aaliyah 52 1
## 8 Adams, Yolanda 20 57
## 9 Adkins, Trace 11 65
## 10 Aguilera, Christina 67 1
## # ℹ 218 more rows
3.7. Add new variables using mutate()
You can make new variables (columns). You’ll often do this if want to
calculate some new variable based on existing variables.
artist_dat %>%
mutate(top5= artist.best <=5) %>% #TRUE if artist's song was ever in Top 5
mutate(weeks.prop = artist.weeks/max(artist.weeks)) #weeks on list as proportion of maximum value
## # A tibble: 228 × 5
## artist artist.weeks artist.best top5 weeks.prop
## <chr> <int> <dbl> <lgl> <dbl>
## 1 2 Pac 7 72 FALSE 0.0673
## 2 2Ge+her 3 87 FALSE 0.0288
## 3 3 Doors Down 73 3 TRUE 0.702
## 4 504 Boyz 18 17 FALSE 0.173
## 5 98^0 20 2 TRUE 0.192
## 6 A*Teens 5 95 FALSE 0.0481
## 7 Aaliyah 52 1 TRUE 0.5
## 8 Adams, Yolanda 20 57 FALSE 0.192
## 9 Adkins, Trace 11 65 FALSE 0.106
## 10 Aguilera, Christina 67 1 TRUE 0.644
## # ℹ 218 more rows
3.8. Join two different data
The four main join functions all seek to merge data using matching
columns (either matching column names, or manually designated using the
by=
argument). But they differ in which rows they will
keep:
left_join(x, y)
: match up the values in designated
columns of x and y, and keep all rows in x. NAs show up when a value is
present in x but not y.
right_join(x, y)
: match up the values in designated
columns of x and y, and keep all rows in y. NAs show up when a value is
present in y but not x.
inner_join(x, y)
: match up the values in designated
columns of x and y, and keep only rows in which x and y values matched.
No NAs show up.
`full_join(x, y): match up the values in designated columns of x
and y, and keep all rows in x and y, even if they don’t match. NAs
whenever value in one table doesn’t have a match in the other.
Let’s demonstrate this by comparing the billboard
data
with the artist_dat
table that we created above. The
billboard
data has more rows than artist_dat
because some artists show up in multiple rows (if they have multiple
songs in the charts). We will remove the columns with “wk” so we can see
the results more easily.
left_join(billboard, artist_dat) %>%
select(-starts_with("wk"), -date.entered)
## # A tibble: 317 × 4
## artist track artist.weeks artist.best
## <chr> <chr> <int> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 7 72
## 2 2Ge+her The Hardest Part Of ... 3 87
## 3 3 Doors Down Kryptonite 73 3
## 4 3 Doors Down Loser 73 3
## 5 504 Boyz Wobble Wobble 18 17
## 6 98^0 Give Me Just One Nig... 20 2
## 7 A*Teens Dancing Queen 5 95
## 8 Aaliyah I Don't Wanna 52 1
## 9 Aaliyah Try Again 52 1
## 10 Adams, Yolanda Open My Heart 20 57
## # ℹ 307 more rows
You can see that whenever the artist has multiple songs on the list,
the “artist.weeks” and “artist.best” variables are repeated (because
these are the stats for the artist).