Worked Example: Wrangling World Bank Population Data

Things you will learn in this Worked Example

Practice wrangling data with dplyr and tidyr
Learn tricks to download data from public data repositories like the World Bank
More practice with ggplot
A little bit of stats

We will accomplish these while exploring the (admittedly superficial) patterns in population growth rates across countries.

By the end, you’ll be able to produce graphs like this:

Set up: load packages

library(wbstats)
library(tidyverse)

1. The Data: Population growth trends across the world

The World Bank actually makes it very easy to download a huge amount of data very easily. You could just go to the World Bankd Open Data site and search for data and download them in a few clicks. There are even packages such as WDI and wbstats that allow you to query and pull data from this site from within R.

However, for the purposes of this exercise, we will deal with several datasets that are included in the packages we have loaded above.

`world_bank_pop`: Population data from World Bank (2000-2017)

First is the world_bank_pop dataset that is included in the tidyr package (which is part of the the tidyverse suite). Start by pulling up the help file for the dataset

?world_bank_pop

Let’s take a peak at the data, which is in “tibble” format:

world_bank_pop

## # A tibble: 1,056 × 20
##    country indic…¹ `2000` `2001` `2002` `2003`  `2004`  `2005`   `2006`   `2007`
##    <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
##  1 ABW     SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4  4.49e+4  4.47e+4
##  2 ABW     SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
##  3 ABW     SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5  1.01e+5  1.01e+5
##  4 ABW     SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0  7.98e-1  3.84e-1
##  5 AFG     SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6  5.93e+6  6.15e+6
##  6 AFG     SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0  4.12e+0  3.65e+0
##  7 AFG     SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7  2.59e+7  2.66e+7
##  8 AFG     SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0  3.23e+0  2.76e+0
##  9 AGO     SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7  1.15e+7  1.21e+7
## 10 AGO     SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0  4.92e+0  4.89e+0
## # … with 1,046 more rows, 10 more variables: `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, and abbreviated variable name
## #   ¹indicator

Notice that each country is repeated across 4 rows. This is because there are actually 4 different “indicators” for each country, and so each country x indicator combination takes up a row. The rest of the columns are years, from 2000 to 2017.

Also notice that the numbers in the column names have a backquote, or “`” around them. In R, this allows numbers to be interpreted as text (and column names have to be text).

So, if you wanted to pull up the “2001” column, this will NOT work…

world_bank_pop$2001

…but this will work

world_bank_pop$`2001`

2. Exclude aggregate data to get data for countries only

We are going to start by diving into the world_bank_pop dataset.

Let’s take a peek at the dataset again and see what we have…

world_bank_pop

## # A tibble: 1,056 × 20
##    country indic…¹ `2000` `2001` `2002` `2003`  `2004`  `2005`   `2006`   `2007`
##    <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
##  1 ABW     SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4  4.49e+4  4.47e+4
##  2 ABW     SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
##  3 ABW     SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5  1.01e+5  1.01e+5
##  4 ABW     SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0  7.98e-1  3.84e-1
##  5 AFG     SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6  5.93e+6  6.15e+6
##  6 AFG     SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0  4.12e+0  3.65e+0
##  7 AFG     SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7  2.59e+7  2.66e+7
##  8 AFG     SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0  3.23e+0  2.76e+0
##  9 AGO     SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7  1.15e+7  1.21e+7
## 10 AGO     SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0  4.92e+0  4.89e+0
## # … with 1,046 more rows, 10 more variables: `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, and abbreviated variable name
## #   ¹indicator

I want to start by looking at how many unique countries we actually have here.

length(unique(world_bank_pop$country)) #this gives us the number of unique values in the "country" column

## [1] 264

You can see that there are way more “countries” here than there are in the world. As of 2022, there are 193 countries in the United Nations, though if we add soverign states that are not recognized by UN (e.g., Taiwan, Kosovo, etc.), the list is about 206… and there are more if we include disputed territories.

The reason the World Bank data has even more than that is because some of the “countries” include “aggregates” like “Arab World” or “High-income Countries”.

Unfortunately, the world_bank_pop dataset does not include any variables that allow us to differentiate the countries from the aggregates! However, we can solve this problem by pulling the metadata for countries from the World Bank. We can do this by using the wb_countries() function from the wbstats package.

Let’s pull the country metadata and save it as an object called “metadata”:

metadata=wb_countries()

metadata

## # A tibble: 299 × 18
##    iso3c iso2c country    capit…¹ longi…² latit…³ regio…⁴ regio…⁵ region admin…⁶
##    <chr> <chr> <chr>      <chr>     <dbl>   <dbl> <chr>   <chr>   <chr>  <chr>  
##  1 ABW   AW    Aruba      Oranje…  -70.0    12.5  LCN     ZJ      Latin… <NA>   
##  2 AFE   ZH    Africa Ea… <NA>      NA      NA    <NA>    <NA>    Aggre… <NA>   
##  3 AFG   AF    Afghanist… Kabul     69.2    34.5  SAS     8S      South… SAS    
##  4 AFR   A9    Africa     <NA>      NA      NA    <NA>    <NA>    Aggre… <NA>   
##  5 AFW   ZI    Africa We… <NA>      NA      NA    <NA>    <NA>    Aggre… <NA>   
##  6 AGO   AO    Angola     Luanda    13.2    -8.81 SSF     ZG      Sub-S… SSA    
##  7 ALB   AL    Albania    Tirane    19.8    41.3  ECS     Z7      Europ… ECA    
##  8 AND   AD    Andorra    Andorr…    1.52   42.5  ECS     Z7      Europ… <NA>   
##  9 ARB   1A    Arab World <NA>      NA      NA    <NA>    <NA>    Aggre… <NA>   
## 10 ARE   AE    United Ar… Abu Dh…   54.4    24.5  MEA     ZQ      Middl… <NA>   
## # … with 289 more rows, 8 more variables: admin_region_iso2c <chr>,
## #   admin_region <chr>, income_level_iso3c <chr>, income_level_iso2c <chr>,
## #   income_level <chr>, lending_type_iso3c <chr>, lending_type_iso2c <chr>,
## #   lending_type <chr>, and abbreviated variable names ¹capital_city,
## #   ²longitude, ³latitude, ⁴region_iso3c, ⁵region_iso2c, ⁶admin_region_iso3c

head(metadata$region)

## [1] "Latin America & Caribbean" "Aggregates"               
## [3] "South Asia"                "Aggregates"               
## [5] "Aggregates"                "Sub-Saharan Africa"

So this column shows us the “Aggregates”.

So, we can use this to filter the metadata to exclude the Aggregate codes.

metadata %>% dplyr::filter(region != "Aggregates")

## # A tibble: 218 × 18
##    iso3c iso2c country    capit…¹ longi…² latit…³ regio…⁴ regio…⁵ region admin…⁶
##    <chr> <chr> <chr>      <chr>     <dbl>   <dbl> <chr>   <chr>   <chr>  <chr>  
##  1 ABW   AW    Aruba      Oranje…  -70.0    12.5  LCN     ZJ      Latin… <NA>   
##  2 AFG   AF    Afghanist… Kabul     69.2    34.5  SAS     8S      South… SAS    
##  3 AGO   AO    Angola     Luanda    13.2    -8.81 SSF     ZG      Sub-S… SSA    
##  4 ALB   AL    Albania    Tirane    19.8    41.3  ECS     Z7      Europ… ECA    
##  5 AND   AD    Andorra    Andorr…    1.52   42.5  ECS     Z7      Europ… <NA>   
##  6 ARE   AE    United Ar… Abu Dh…   54.4    24.5  MEA     ZQ      Middl… <NA>   
##  7 ARG   AR    Argentina  Buenos…  -58.4   -34.6  LCN     ZJ      Latin… LAC    
##  8 ARM   AM    Armenia    Yerevan   44.5    40.2  ECS     Z7      Europ… ECA    
##  9 ASM   AS    American … Pago P… -171.    -14.3  EAS     Z4      East … EAP    
## 10 ATG   AG    Antigua a… Saint …  -61.8    17.1  LCN     ZJ      Latin… <NA>   
## # … with 208 more rows, 8 more variables: admin_region_iso2c <chr>,
## #   admin_region <chr>, income_level_iso3c <chr>, income_level_iso2c <chr>,
## #   income_level <chr>, lending_type_iso3c <chr>, lending_type_iso2c <chr>,
## #   lending_type <chr>, and abbreviated variable names ¹capital_city,
## #   ²longitude, ³latitude, ⁴region_iso3c, ⁵region_iso2c, ⁶admin_region_iso3c

Let’s also pare down the data to a few variables we might be interested in, including region and income level:

metadata %>% 
  dplyr::filter(region != "Aggregates") %>%
  select(iso3c, iso2c, country, region, income_level)

## # A tibble: 218 × 5
##    iso3c iso2c country              region                     income_level     
##    <chr> <chr> <chr>                <chr>                      <chr>            
##  1 ABW   AW    Aruba                Latin America & Caribbean  High income      
##  2 AFG   AF    Afghanistan          South Asia                 Low income       
##  3 AGO   AO    Angola               Sub-Saharan Africa         Lower middle inc…
##  4 ALB   AL    Albania              Europe & Central Asia      Upper middle inc…
##  5 AND   AD    Andorra              Europe & Central Asia      High income      
##  6 ARE   AE    United Arab Emirates Middle East & North Africa High income      
##  7 ARG   AR    Argentina            Latin America & Caribbean  Upper middle inc…
##  8 ARM   AM    Armenia              Europe & Central Asia      Upper middle inc…
##  9 ASM   AS    American Samoa       East Asia & Pacific        Upper middle inc…
## 10 ATG   AG    Antigua and Barbuda  Latin America & Caribbean  High income      
## # … with 208 more rows

Ok, this looks useful. Let’s save it as a data table called countries

countries=metadata %>% 
  dplyr::filter(region != "Aggregates") %>%
  select(iso3c, iso2c, country.name=country, region, income_level) %>%
  mutate(income_level=factor(income_level, levels=c("Low income", "Lower middle income", "Upper middle income", "High income")))

Now, we are going to use the inner_join() function to make a version of the world bank population data that is only for actual countries (not aggregates):

inner_join(world_bank_pop, countries, by=c("country" = "iso3c"))

## # A tibble: 868 × 24
##    country indic…¹ `2000` `2001` `2002` `2003`  `2004`  `2005`   `2006`   `2007`
##    <chr>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
##  1 ABW     SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4  4.49e+4  4.47e+4
##  2 ABW     SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
##  3 ABW     SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5  1.01e+5  1.01e+5
##  4 ABW     SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0  7.98e-1  3.84e-1
##  5 AFG     SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6  5.93e+6  6.15e+6
##  6 AFG     SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0  4.12e+0  3.65e+0
##  7 AFG     SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7  2.59e+7  2.66e+7
##  8 AFG     SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0  3.23e+0  2.76e+0
##  9 AGO     SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7  1.15e+7  1.21e+7
## 10 AGO     SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0  4.92e+0  4.89e+0
## # … with 858 more rows, 14 more variables: `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## #   `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, iso2c <chr>, country.name <chr>,
## #   region <chr>, income_level <fct>, and abbreviated variable name ¹indicator

You can see that the inner_join() function worked if you scroll all the way to the right of the output table.

Let’s rearrange the order of the columns using relocate() so that we can have all the useful metadata appear first (left-most):

inner_join(world_bank_pop, countries, by=c("country" = "iso3c")) %>%
  relocate(country, iso2c, country.name, region, income_level)

## # A tibble: 868 × 24
##    country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
##    <chr>   <chr> <chr>        <chr>  <fct>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
##  1 ABW     AW    Aruba        Latin… High i… SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4
##  2 ABW     AW    Aruba        Latin… High i… SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0
##  3 ABW     AW    Aruba        Latin… High i… SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4
##  4 ABW     AW    Aruba        Latin… High i… SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0
##  5 AFG     AF    Afghanistan  South… Low in… SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6
##  6 AFG     AF    Afghanistan  South… Low in… SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0
##  7 AFG     AF    Afghanistan  South… Low in… SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7
##  8 AFG     AF    Afghanistan  South… Low in… SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0
##  9 AGO     AO    Angola       Sub-S… Lower … SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6
## 10 AGO     AO    Angola       Sub-S… Lower … SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0
## # … with 858 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## #   ²indicator

This looks good. Let’s save it as an object we’ll call popdat_countries

popdat_countries=inner_join(world_bank_pop, countries, by=c("country" = "iso3c")) %>%
  relocate(country, iso2c, country.name, region, income_level)

popdat_countries

## # A tibble: 868 × 24
##    country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
##    <chr>   <chr> <chr>        <chr>  <fct>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
##  1 ABW     AW    Aruba        Latin… High i… SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4
##  2 ABW     AW    Aruba        Latin… High i… SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0
##  3 ABW     AW    Aruba        Latin… High i… SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4
##  4 ABW     AW    Aruba        Latin… High i… SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0
##  5 AFG     AF    Afghanistan  South… Low in… SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6
##  6 AFG     AF    Afghanistan  South… Low in… SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0
##  7 AFG     AF    Afghanistan  South… Low in… SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7
##  8 AFG     AF    Afghanistan  South… Low in… SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0
##  9 AGO     AO    Angola       Sub-S… Lower … SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6
## 10 AGO     AO    Angola       Sub-S… Lower … SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0
## # … with 858 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## #   ²indicator

3. Which are the world’s largest countries (in 2017)?

As mentioned at the beginning, each row of data here is a country x indicator combination, with 4 different indicators. The indicators are:

SP.POP.GROW = population growth
SP.POP.TOTL = total population
SP.URB.GROW = urban population growth
SP.URB.TOTL = total urban population

So, we will create a data table called pop_totals that is just the “total population” data.

pop_totals=popdat_countries %>% 
  dplyr::filter(indicator=="SP.POP.TOTL") 

pop_totals

## # A tibble: 217 × 24
##    country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
##    <chr>   <chr> <chr>        <chr>  <fct>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
##  1 ABW     AW    Aruba        Latin… High i… SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4
##  2 AFG     AF    Afghanistan  South… Low in… SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7
##  3 AGO     AO    Angola       Sub-S… Lower … SP.POP… 1.64e7 1.70e7 1.76e7 1.82e7
##  4 ALB     AL    Albania      Europ… Upper … SP.POP… 3.09e6 3.06e6 3.05e6 3.04e6
##  5 AND     AD    Andorra      Europ… High i… SP.POP… 6.54e4 6.73e4 7.00e4 7.32e4
##  6 ARE     AE    United Arab… Middl… High i… SP.POP… 3.15e6 3.33e6 3.51e6 3.74e6
##  7 ARG     AR    Argentina    Latin… Upper … SP.POP… 3.71e7 3.75e7 3.79e7 3.83e7
##  8 ARM     AM    Armenia      Europ… Upper … SP.POP… 3.07e6 3.05e6 3.03e6 3.02e6
##  9 ASM     AS    American Sa… East … Upper … SP.POP… 5.75e4 5.82e4 5.87e4 5.91e4
## 10 ATG     AG    Antigua and… Latin… High i… SP.POP… 8.36e4 8.51e4 8.63e4 8.73e4
## # … with 207 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## #   ²indicator

From this population size data, let’s figure out what is the 90th percentile in population size in 2017. We are going to get this number by doing pull() to get the 2017 population sizes as a vector, and use quantile() to get the 90th percentile. Save that value as upper_pop.

upper_pop=pop_totals %>%
  pull(`2017`) %>% 
  quantile(., probs=0.90, na.rm=T) # this allows us to find the value at Nth percentile

upper_pop

##      90% 
## 63286844

So, the 90th percentile of population sizes in 2017 is 63,286,844

Now, we can use this 90th percentile value to filter the population size data to just the largest (>= 90th percentile) countries. We will create a one-column tibble called large_countries() to save this list of country codes.

pop_totals %>% 
  dplyr::filter(`2017` > upper_pop)

## # A tibble: 22 × 24
##    country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
##    <chr>   <chr> <chr>        <chr>  <fct>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
##  1 BGD     BD    Bangladesh   South… Lower … SP.POP… 1.32e8 1.34e8 1.37e8 1.39e8
##  2 BRA     BR    Brazil       Latin… Upper … SP.POP… 1.75e8 1.78e8 1.80e8 1.82e8
##  3 CHN     CN    China        East … Upper … SP.POP… 1.26e9 1.27e9 1.28e9 1.29e9
##  4 COD     CD    Congo, Dem.… Sub-S… Low in… SP.POP… 4.71e7 4.84e7 4.98e7 5.14e7
##  5 DEU     DE    Germany      Europ… High i… SP.POP… 8.22e7 8.23e7 8.25e7 8.25e7
##  6 EGY     EG    Egypt, Arab… Middl… Lower … SP.POP… 6.99e7 7.12e7 7.26e7 7.40e7
##  7 ETH     ET    Ethiopia     Sub-S… Low in… SP.POP… 6.65e7 6.85e7 7.05e7 7.25e7
##  8 FRA     FR    France       Europ… High i… SP.POP… 6.09e7 6.14e7 6.18e7 6.22e7
##  9 GBR     GB    United King… Europ… High i… SP.POP… 5.89e7 5.91e7 5.94e7 5.96e7
## 10 IDN     ID    Indonesia    East … Lower … SP.POP… 2.12e8 2.15e8 2.18e8 2.21e8
## # … with 12 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## #   ²indicator

Ok, now let’s select just the country code, country name and population size of the world’s largest countries (90th percentile) and arrange it in descending order using arrange()

largest_countries=pop_totals %>% 
  dplyr::filter(`2017` > upper_pop) %>% 
  select(country, country.name, `2017`) %>%
  arrange(desc(`2017`))

print(largest_countries, n=22)

## # A tibble: 22 × 3
##    country country.name           `2017`
##    <chr>   <chr>                   <dbl>
##  1 CHN     China              1386395000
##  2 IND     India              1339180127
##  3 USA     United States       325719178
##  4 IDN     Indonesia           263991379
##  5 BRA     Brazil              209288278
##  6 PAK     Pakistan            197015955
##  7 NGA     Nigeria             190886311
##  8 BGD     Bangladesh          164669751
##  9 RUS     Russian Federation  144495044
## 10 MEX     Mexico              129163276
## 11 JPN     Japan               126785797
## 12 ETH     Ethiopia            104957438
## 13 PHL     Philippines         104918090
## 14 EGY     Egypt, Arab Rep.     97553151
## 15 VNM     Vietnam              95540800
## 16 DEU     Germany              82695000
## 17 COD     Congo, Dem. Rep.     81339988
## 18 IRN     Iran, Islamic Rep.   81162788
## 19 TUR     Turkiye              80745020
## 20 THA     Thailand             69037513
## 21 FRA     France               67118648
## 22 GBR     United Kingdom       66022273

4. How population growth rates have changed across years, by income level of countries

Now, let’s create a tibble that contains the population growth data as well as the income level designation for each country.

I’m also going to remove countries for which we are missing data using drop_na().

popdat_countries %>% 
  dplyr::filter(indicator=="SP.POP.GROW")  %>%
  drop_na()

## # A tibble: 214 × 24
##    country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
##    <chr>   <chr> <chr>        <chr>  <fct>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>
##  1 ABW     AW    Aruba        Latin… High i… SP.POP…  2.06   2.23   2.23   2.11 
##  2 AFG     AF    Afghanistan  South… Low in… SP.POP…  3.49   4.25   4.72   4.82 
##  3 AGO     AO    Angola       Sub-S… Lower … SP.POP…  3.03   3.25   3.41   3.53 
##  4 ALB     AL    Albania      Europ… Upper … SP.POP… -0.637 -0.938 -0.300 -0.374
##  5 AND     AD    Andorra      Europ… High i… SP.POP…  1.57   2.94   3.94   4.38 
##  6 ARE     AE    United Arab… Middl… High i… SP.POP…  5.43   5.28   5.30   6.48 
##  7 ARG     AR    Argentina    Latin… Upper … SP.POP…  1.11   1.11   1.11   1.10 
##  8 ARM     AM    Armenia      Europ… Upper … SP.POP… -0.631 -0.619 -0.551 -0.532
##  9 ASM     AS    American Sa… East … Upper … SP.POP…  1.31   1.13   0.951  0.655
## 10 ATG     AG    Antigua and… Latin… High i… SP.POP…  2.12   1.75   1.41   1.18 
## # … with 204 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## #   `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## #   ²indicator

Now, we are going to reshape this data into long-format.

To do this, I’m going to first pare down by removing columns we don’t need. I’m going to keep “country” because I want the value for each country for each year. I’m also going to keep “income_level” because I’m going to use that to summarise the population growth data later. Then, I’m going to use the pivot_longer() function to rearrange all of the columns with years as the heading into two columns: one column with the year, and one column with the population growth rate.

popdat_countries %>% 
  dplyr::filter(indicator=="SP.POP.GROW") %>%
  drop_na() %>%
  select(-indicator, -country.name, -iso2c, -region) %>% #pare down the data a bit & pivot longer
  pivot_longer(cols=-c(country, income_level), names_to="year", values_to="growth_rate")

## # A tibble: 3,852 × 4
##    country income_level year  growth_rate
##    <chr>   <fct>        <chr>       <dbl>
##  1 ABW     High income  2000       2.06  
##  2 ABW     High income  2001       2.23  
##  3 ABW     High income  2002       2.23  
##  4 ABW     High income  2003       2.11  
##  5 ABW     High income  2004       1.76  
##  6 ABW     High income  2005       1.30  
##  7 ABW     High income  2006       0.798 
##  8 ABW     High income  2007       0.384 
##  9 ABW     High income  2008       0.131 
## 10 ABW     High income  2009       0.0986
## # … with 3,842 more rows

Ok, now that we have the data in long format, it is easier to summarize the annual population growth rates across all years by income level.

To do this, I’m going to group_by() income level, and then summarize the data using a nifty mean_se() function, which gives me the mean AND the mean +/- standard error. This is a nice function to get the confidence band around the mean.

popdat_countries %>% 
  dplyr::filter(indicator=="SP.POP.GROW") %>%
  drop_na() %>%
  select(-indicator, -country.name, -iso2c, -region) %>% #pare down the data a bit & pivot longer
  pivot_longer(cols=-c(country, income_level), names_to="year", values_to="growth_rate") %>%
  group_by(income_level, year) %>%
  summarise(mean_se(growth_rate)) #calculates mean + standard error of the mean.

## # A tibble: 72 × 5
## # Groups:   income_level [4]
##    income_level year      y  ymin  ymax
##    <fct>        <chr> <dbl> <dbl> <dbl>
##  1 Low income   2000   2.97  2.78  3.17
##  2 Low income   2001   2.91  2.76  3.05
##  3 Low income   2002   2.86  2.71  3.02
##  4 Low income   2003   2.84  2.66  3.01
##  5 Low income   2004   2.84  2.68  3.00
##  6 Low income   2005   2.87  2.73  3.01
##  7 Low income   2006   2.90  2.76  3.04
##  8 Low income   2007   2.91  2.77  3.06
##  9 Low income   2008   2.89  2.74  3.04
## 10 Low income   2009   2.82  2.67  2.97
## # … with 62 more rows

Ok, this looks good, so I’m going to save that data table as popgrowth_income.

popgrowth_income=popdat_countries %>% 
  dplyr::filter(indicator=="SP.POP.GROW") %>%
  drop_na() %>%
  select(-indicator, -country.name, -iso2c, -region) %>% #pare down the data a bit & pivot longer
  pivot_longer(cols=-c(country, income_level), names_to="year", values_to="growth_rate") %>%
  group_by(income_level, year) %>%
  summarise(mean_se(growth_rate)) #calculates mean + standard error of the mean.

Now, we have the average (+/- standard error) population growth rate for countries at each income level across each year for 2000 to 2017. Here, the mean is “y” and lower and upper standard error is “ymin” and “ymax”, respectively. Let’s plot this. Let’s start with a simple line graph:

ggplot(popgrowth_income, aes(x=as.numeric(year), y=y, group=income_level)) +
  geom_line(aes(color=income_level)) +
  theme_bw()

We can make this fancy by adding the confidence band using geom_ribbon(), and cleaning up the axis labels:

ggplot(popgrowth_income, aes(x=as.numeric(year), y=y, group=income_level)) +
  geom_line(aes(color=income_level)) +
  geom_ribbon(aes(ymin=ymin, ymax=ymax), alpha=0.3, fill="gray") +
  theme_bw() +
  ylab("Average Population Growth Rate") +
  xlab("Year")

5. How does the relationship between population growth rate and income level differ between global regions?

Now, let’s look at how income levels have different effects on population growth rates depending on the region of the world. to do this, we have to pick a year to analyze. Let’s go with the latest data, 2017.

We can do that simply by going from the original popdat_countries (the raw data with the “aggregates” removed), then filtering it to just the data on population growth, then just looking at the 2017 data for each country. We’ll also keep the region and income level data. We’ll remove rows that have NAs.

popdat_countries %>% 
  dplyr::filter(indicator=="SP.POP.GROW") %>%
  select(country, `2017`, region, income_level) %>%
  drop_na()

## # A tibble: 215 × 4
##    country  `2017` region                     income_level       
##    <chr>     <dbl> <chr>                      <fct>              
##  1 ABW      0.421  Latin America & Caribbean  High income        
##  2 AFG      2.49   South Asia                 Low income         
##  3 AGO      3.31   Sub-Saharan Africa         Lower middle income
##  4 ALB     -0.0920 Europe & Central Asia      Upper middle income
##  5 AND     -0.410  Europe & Central Asia      High income        
##  6 ARE      1.40   Middle East & North Africa High income        
##  7 ARG      0.961  Latin America & Caribbean  Upper middle income
##  8 ARM      0.192  Europe & Central Asia      Upper middle income
##  9 ASM      0.0755 East Asia & Pacific        Upper middle income
## 10 ATG      1.03   Latin America & Caribbean  High income        
## # … with 205 more rows

Ok, that looks like what we want, so I’ll save it as popdat_2017:

popdat_2017=popdat_countries %>% 
  dplyr::filter(indicator=="SP.POP.GROW") %>%
  select(country, `2017`, region, income_level) %>%
  drop_na()

Now we can plot population growth by income level, and use region to create facets. Here, I’m going to use income_level as both the x axis and color because the x-axis label is going to be hard to read.

ggplot(popdat_2017, aes(x=income_level, y=`2017`)) +
  geom_boxplot(aes(color=income_level)) +
  facet_wrap(~region, nrow=3)

I can see that some regions don’t have enough variation in income level to be useful (e.g., North America and South Asia). I can actually filter the data within the ggplot function to remove those:

Lower income countries have higher population growth rate than those with high income in some regions (sub-Saharan Africa), while in others growth rate is bigger for high earners.

ggplot(popdat_2017 %>% dplyr::filter(region!="North America" & region!="South Asia"), aes(x=income_level, y=`2017`)) +
  geom_boxplot(aes(color=income_level)) +
  facet_wrap(~region, nrow=2) +
  theme_bw() +
  theme(legend.position = "bottom") +
  ylab("Population Growth Rate in 2017")

It looks like the relationship between income level and population growth rate indeed changes between regions. For example, in Europe/Central Asia, Latin America/Caribbean, and Sub-Saharan Africa, countries with lower income have higher population growth rate. But the relationship is opposite in the Middle East & North Africa. There, the countries with higher incomes have higher population growth rate.

How would we test the idea that the relationship between income level and population growth rate depends on the region? One way is to run a linear model including the interaction effect of the two variables:

lm.fit=lm(`2017`~region*income_level, data=popdat_2017)

anova(lm.fit)

## Analysis of Variance Table
## 
## Response: 2017
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## region                6 131.828 21.9713 35.7109 < 2.2e-16 ***
## income_level          3   2.949  0.9830  1.5977  0.191299    
## region:income_level  12  20.996  1.7497  2.8438  0.001279 ** 
## Residuals           193 118.744  0.6153                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Worked Example: Wrangling World Bank Population Data

Dai Shizuka

updated 09/15/22

Things you will learn in this Worked Example

Set up: load packages

1. The Data: Population growth trends across the world

`world_bank_pop`: Population data from World Bank (2000-2017)

2. Exclude aggregate data to get data for countries only

3. Which are the world’s largest countries (in 2017)?

4. How population growth rates have changed across years, by income level of countries

5. How does the relationship between population growth rate and income level differ between global regions?

Worked Example: Wrangling World Bank Population Data

Dai Shizuka

updated 09/15/22

Things you will learn in this Worked Example

Set up: load packages

1. The Data: Population growth trends across the world

world_bank_pop: Population data from World Bank (2000-2017)

2. Exclude aggregate data to get data for countries only

3. Which are the world’s largest countries (in 2017)?

4. How population growth rates have changed across years, by income level of countries

5. How does the relationship between population growth rate and income level differ between global regions?

`world_bank_pop`: Population data from World Bank (2000-2017)