We will accomplish these while exploring the (admittedly superficial) patterns in population growth rates across countries.
By the end, you’ll be able to produce graphs like this:
library(wbstats)
library(tidyverse)
The World Bank actually makes it very easy to download a huge amount
of data very easily. You could just go to the World Bankd Open Data site and
search for data and download them in a few clicks. There are even
packages such as WDI
and wbstats
that allow
you to query and pull data from this site from within R.
However, for the purposes of this exercise, we will deal with several datasets that are included in the packages we have loaded above.
world_bank_pop
: Population data from World Bank
(2000-2017)First is the world_bank_pop
dataset that is included in
the tidyr
package (which is part of the the tidyverse
suite). Start by pulling up the help file for the dataset
?world_bank_pop
Let’s take a peak at the data, which is in “tibble” format:
world_bank_pop
## # A tibble: 1,056 × 20
## country indic…¹ `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4 4.47e+4
## 2 ABW SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
## 3 ABW SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5 1.01e+5
## 4 ABW SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1 3.84e-1
## 5 AFG SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6 6.15e+6
## 6 AFG SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0 3.65e+0
## 7 AFG SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7 2.59e+7 2.66e+7
## 8 AFG SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0 3.23e+0 2.76e+0
## 9 AGO SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7 1.15e+7 1.21e+7
## 10 AGO SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0 4.92e+0 4.89e+0
## # … with 1,046 more rows, 10 more variables: `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## # `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, and abbreviated variable name
## # ¹indicator
Notice that each country is repeated across 4 rows. This is because there are actually 4 different “indicators” for each country, and so each country x indicator combination takes up a row. The rest of the columns are years, from 2000 to 2017.
Also notice that the numbers in the column names have a backquote, or “`” around them. In R, this allows numbers to be interpreted as text (and column names have to be text).
So, if you wanted to pull up the “2001” column, this will NOT work…
world_bank_pop$2001
…but this will work
world_bank_pop$`2001`
We are going to start by diving into the world_bank_pop
dataset.
Let’s take a peek at the dataset again and see what we have…
world_bank_pop
## # A tibble: 1,056 × 20
## country indic…¹ `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4 4.47e+4
## 2 ABW SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
## 3 ABW SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5 1.01e+5
## 4 ABW SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1 3.84e-1
## 5 AFG SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6 6.15e+6
## 6 AFG SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0 3.65e+0
## 7 AFG SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7 2.59e+7 2.66e+7
## 8 AFG SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0 3.23e+0 2.76e+0
## 9 AGO SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7 1.15e+7 1.21e+7
## 10 AGO SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0 4.92e+0 4.89e+0
## # … with 1,046 more rows, 10 more variables: `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## # `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, and abbreviated variable name
## # ¹indicator
I want to start by looking at how many unique countries we actually have here.
length(unique(world_bank_pop$country)) #this gives us the number of unique values in the "country" column
## [1] 264
You can see that there are way more “countries” here than there are in the world. As of 2022, there are 193 countries in the United Nations, though if we add soverign states that are not recognized by UN (e.g., Taiwan, Kosovo, etc.), the list is about 206… and there are more if we include disputed territories.
The reason the World Bank data has even more than that is because some of the “countries” include “aggregates” like “Arab World” or “High-income Countries”.
Unfortunately, the world_bank_pop
dataset does not
include any variables that allow us to differentiate the countries from
the aggregates! However, we can solve this problem by
pulling the metadata for countries from the World Bank. We can do this
by using the wb_countries()
function from the
wbstats package.
Let’s pull the country metadata and save it as an object called “metadata”:
metadata=wb_countries()
metadata
## # A tibble: 299 × 18
## iso3c iso2c country capit…¹ longi…² latit…³ regio…⁴ regio…⁵ region admin…⁶
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 ABW AW Aruba Oranje… -70.0 12.5 LCN ZJ Latin… <NA>
## 2 AFE ZH Africa Ea… <NA> NA NA <NA> <NA> Aggre… <NA>
## 3 AFG AF Afghanist… Kabul 69.2 34.5 SAS 8S South… SAS
## 4 AFR A9 Africa <NA> NA NA <NA> <NA> Aggre… <NA>
## 5 AFW ZI Africa We… <NA> NA NA <NA> <NA> Aggre… <NA>
## 6 AGO AO Angola Luanda 13.2 -8.81 SSF ZG Sub-S… SSA
## 7 ALB AL Albania Tirane 19.8 41.3 ECS Z7 Europ… ECA
## 8 AND AD Andorra Andorr… 1.52 42.5 ECS Z7 Europ… <NA>
## 9 ARB 1A Arab World <NA> NA NA <NA> <NA> Aggre… <NA>
## 10 ARE AE United Ar… Abu Dh… 54.4 24.5 MEA ZQ Middl… <NA>
## # … with 289 more rows, 8 more variables: admin_region_iso2c <chr>,
## # admin_region <chr>, income_level_iso3c <chr>, income_level_iso2c <chr>,
## # income_level <chr>, lending_type_iso3c <chr>, lending_type_iso2c <chr>,
## # lending_type <chr>, and abbreviated variable names ¹capital_city,
## # ²longitude, ³latitude, ⁴region_iso3c, ⁵region_iso2c, ⁶admin_region_iso3c
head(metadata$region)
## [1] "Latin America & Caribbean" "Aggregates"
## [3] "South Asia" "Aggregates"
## [5] "Aggregates" "Sub-Saharan Africa"
So this column shows us the “Aggregates”.
So, we can use this to filter the metadata to exclude the Aggregate codes.
metadata %>% dplyr::filter(region != "Aggregates")
## # A tibble: 218 × 18
## iso3c iso2c country capit…¹ longi…² latit…³ regio…⁴ regio…⁵ region admin…⁶
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 ABW AW Aruba Oranje… -70.0 12.5 LCN ZJ Latin… <NA>
## 2 AFG AF Afghanist… Kabul 69.2 34.5 SAS 8S South… SAS
## 3 AGO AO Angola Luanda 13.2 -8.81 SSF ZG Sub-S… SSA
## 4 ALB AL Albania Tirane 19.8 41.3 ECS Z7 Europ… ECA
## 5 AND AD Andorra Andorr… 1.52 42.5 ECS Z7 Europ… <NA>
## 6 ARE AE United Ar… Abu Dh… 54.4 24.5 MEA ZQ Middl… <NA>
## 7 ARG AR Argentina Buenos… -58.4 -34.6 LCN ZJ Latin… LAC
## 8 ARM AM Armenia Yerevan 44.5 40.2 ECS Z7 Europ… ECA
## 9 ASM AS American … Pago P… -171. -14.3 EAS Z4 East … EAP
## 10 ATG AG Antigua a… Saint … -61.8 17.1 LCN ZJ Latin… <NA>
## # … with 208 more rows, 8 more variables: admin_region_iso2c <chr>,
## # admin_region <chr>, income_level_iso3c <chr>, income_level_iso2c <chr>,
## # income_level <chr>, lending_type_iso3c <chr>, lending_type_iso2c <chr>,
## # lending_type <chr>, and abbreviated variable names ¹capital_city,
## # ²longitude, ³latitude, ⁴region_iso3c, ⁵region_iso2c, ⁶admin_region_iso3c
Let’s also pare down the data to a few variables we might be interested in, including region and income level:
metadata %>%
dplyr::filter(region != "Aggregates") %>%
select(iso3c, iso2c, country, region, income_level)
## # A tibble: 218 × 5
## iso3c iso2c country region income_level
## <chr> <chr> <chr> <chr> <chr>
## 1 ABW AW Aruba Latin America & Caribbean High income
## 2 AFG AF Afghanistan South Asia Low income
## 3 AGO AO Angola Sub-Saharan Africa Lower middle inc…
## 4 ALB AL Albania Europe & Central Asia Upper middle inc…
## 5 AND AD Andorra Europe & Central Asia High income
## 6 ARE AE United Arab Emirates Middle East & North Africa High income
## 7 ARG AR Argentina Latin America & Caribbean Upper middle inc…
## 8 ARM AM Armenia Europe & Central Asia Upper middle inc…
## 9 ASM AS American Samoa East Asia & Pacific Upper middle inc…
## 10 ATG AG Antigua and Barbuda Latin America & Caribbean High income
## # … with 208 more rows
Ok, this looks useful. Let’s save it as a data table called
countries
countries=metadata %>%
dplyr::filter(region != "Aggregates") %>%
select(iso3c, iso2c, country.name=country, region, income_level) %>%
mutate(income_level=factor(income_level, levels=c("Low income", "Lower middle income", "Upper middle income", "High income")))
Now, we are going to use the inner_join()
function to
make a version of the world bank population data that is only for actual
countries (not aggregates):
inner_join(world_bank_pop, countries, by=c("country" = "iso3c"))
## # A tibble: 868 × 24
## country indic…¹ `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ABW SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4 4.47e+4
## 2 ABW SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
## 3 ABW SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5 1.01e+5
## 4 ABW SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1 3.84e-1
## 5 AFG SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6 6.15e+6
## 6 AFG SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0 3.65e+0
## 7 AFG SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7 2.59e+7 2.66e+7
## 8 AFG SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0 3.23e+0 2.76e+0
## 9 AGO SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7 1.15e+7 1.21e+7
## 10 AGO SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0 4.92e+0 4.89e+0
## # … with 858 more rows, 14 more variables: `2008` <dbl>, `2009` <dbl>,
## # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
## # `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, iso2c <chr>, country.name <chr>,
## # region <chr>, income_level <fct>, and abbreviated variable name ¹indicator
You can see that the inner_join()
function worked if you
scroll all the way to the right of the output table.
Let’s rearrange the order of the columns using
relocate()
so that we can have all the useful metadata
appear first (left-most):
inner_join(world_bank_pop, countries, by=c("country" = "iso3c")) %>%
relocate(country, iso2c, country.name, region, income_level)
## # A tibble: 868 × 24
## country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
## <chr> <chr> <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ABW AW Aruba Latin… High i… SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4
## 2 ABW AW Aruba Latin… High i… SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0
## 3 ABW AW Aruba Latin… High i… SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4
## 4 ABW AW Aruba Latin… High i… SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0
## 5 AFG AF Afghanistan South… Low in… SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6
## 6 AFG AF Afghanistan South… Low in… SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0
## 7 AFG AF Afghanistan South… Low in… SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7
## 8 AFG AF Afghanistan South… Low in… SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0
## 9 AGO AO Angola Sub-S… Lower … SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6
## 10 AGO AO Angola Sub-S… Lower … SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0
## # … with 858 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## # `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## # ²indicator
This looks good. Let’s save it as an object we’ll call
popdat_countries
popdat_countries=inner_join(world_bank_pop, countries, by=c("country" = "iso3c")) %>%
relocate(country, iso2c, country.name, region, income_level)
popdat_countries
## # A tibble: 868 × 24
## country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
## <chr> <chr> <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ABW AW Aruba Latin… High i… SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4
## 2 ABW AW Aruba Latin… High i… SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0
## 3 ABW AW Aruba Latin… High i… SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4
## 4 ABW AW Aruba Latin… High i… SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0
## 5 AFG AF Afghanistan South… Low in… SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6
## 6 AFG AF Afghanistan South… Low in… SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0
## 7 AFG AF Afghanistan South… Low in… SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7
## 8 AFG AF Afghanistan South… Low in… SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0
## 9 AGO AO Angola Sub-S… Lower … SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6
## 10 AGO AO Angola Sub-S… Lower … SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0
## # … with 858 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## # `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## # ²indicator
As mentioned at the beginning, each row of data here is a country x indicator combination, with 4 different indicators. The indicators are:
SP.POP.GROW = population growth
SP.POP.TOTL = total population
SP.URB.GROW = urban population growth
SP.URB.TOTL = total urban population
So, we will create a data table called pop_totals
that
is just the “total population” data.
pop_totals=popdat_countries %>%
dplyr::filter(indicator=="SP.POP.TOTL")
pop_totals
## # A tibble: 217 × 24
## country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
## <chr> <chr> <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ABW AW Aruba Latin… High i… SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4
## 2 AFG AF Afghanistan South… Low in… SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7
## 3 AGO AO Angola Sub-S… Lower … SP.POP… 1.64e7 1.70e7 1.76e7 1.82e7
## 4 ALB AL Albania Europ… Upper … SP.POP… 3.09e6 3.06e6 3.05e6 3.04e6
## 5 AND AD Andorra Europ… High i… SP.POP… 6.54e4 6.73e4 7.00e4 7.32e4
## 6 ARE AE United Arab… Middl… High i… SP.POP… 3.15e6 3.33e6 3.51e6 3.74e6
## 7 ARG AR Argentina Latin… Upper … SP.POP… 3.71e7 3.75e7 3.79e7 3.83e7
## 8 ARM AM Armenia Europ… Upper … SP.POP… 3.07e6 3.05e6 3.03e6 3.02e6
## 9 ASM AS American Sa… East … Upper … SP.POP… 5.75e4 5.82e4 5.87e4 5.91e4
## 10 ATG AG Antigua and… Latin… High i… SP.POP… 8.36e4 8.51e4 8.63e4 8.73e4
## # … with 207 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## # `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## # ²indicator
From this population size data, let’s figure out what is the 90th
percentile in population size in 2017. We are going to get this number
by doing pull()
to get the 2017 population sizes as a
vector, and use quantile()
to get the 90th percentile. Save
that value as upper_pop
.
upper_pop=pop_totals %>%
pull(`2017`) %>%
quantile(., probs=0.90, na.rm=T) # this allows us to find the value at Nth percentile
upper_pop
## 90%
## 63286844
So, the 90th percentile of population sizes in 2017 is 63,286,844
Now, we can use this 90th percentile value to filter the population
size data to just the largest (>= 90th percentile) countries. We will
create a one-column tibble called large_countries()
to save
this list of country codes.
pop_totals %>%
dplyr::filter(`2017` > upper_pop)
## # A tibble: 22 × 24
## country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
## <chr> <chr> <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 BGD BD Bangladesh South… Lower … SP.POP… 1.32e8 1.34e8 1.37e8 1.39e8
## 2 BRA BR Brazil Latin… Upper … SP.POP… 1.75e8 1.78e8 1.80e8 1.82e8
## 3 CHN CN China East … Upper … SP.POP… 1.26e9 1.27e9 1.28e9 1.29e9
## 4 COD CD Congo, Dem.… Sub-S… Low in… SP.POP… 4.71e7 4.84e7 4.98e7 5.14e7
## 5 DEU DE Germany Europ… High i… SP.POP… 8.22e7 8.23e7 8.25e7 8.25e7
## 6 EGY EG Egypt, Arab… Middl… Lower … SP.POP… 6.99e7 7.12e7 7.26e7 7.40e7
## 7 ETH ET Ethiopia Sub-S… Low in… SP.POP… 6.65e7 6.85e7 7.05e7 7.25e7
## 8 FRA FR France Europ… High i… SP.POP… 6.09e7 6.14e7 6.18e7 6.22e7
## 9 GBR GB United King… Europ… High i… SP.POP… 5.89e7 5.91e7 5.94e7 5.96e7
## 10 IDN ID Indonesia East … Lower … SP.POP… 2.12e8 2.15e8 2.18e8 2.21e8
## # … with 12 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## # `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## # ²indicator
Ok, now let’s select just the country code, country name and
population size of the world’s largest countries (90th percentile) and
arrange it in descending order using arrange()
largest_countries=pop_totals %>%
dplyr::filter(`2017` > upper_pop) %>%
select(country, country.name, `2017`) %>%
arrange(desc(`2017`))
print(largest_countries, n=22)
## # A tibble: 22 × 3
## country country.name `2017`
## <chr> <chr> <dbl>
## 1 CHN China 1386395000
## 2 IND India 1339180127
## 3 USA United States 325719178
## 4 IDN Indonesia 263991379
## 5 BRA Brazil 209288278
## 6 PAK Pakistan 197015955
## 7 NGA Nigeria 190886311
## 8 BGD Bangladesh 164669751
## 9 RUS Russian Federation 144495044
## 10 MEX Mexico 129163276
## 11 JPN Japan 126785797
## 12 ETH Ethiopia 104957438
## 13 PHL Philippines 104918090
## 14 EGY Egypt, Arab Rep. 97553151
## 15 VNM Vietnam 95540800
## 16 DEU Germany 82695000
## 17 COD Congo, Dem. Rep. 81339988
## 18 IRN Iran, Islamic Rep. 81162788
## 19 TUR Turkiye 80745020
## 20 THA Thailand 69037513
## 21 FRA France 67118648
## 22 GBR United Kingdom 66022273
Now, let’s create a tibble that contains the population growth data as well as the income level designation for each country.
I’m also going to remove countries for which we are missing data
using drop_na()
.
popdat_countries %>%
dplyr::filter(indicator=="SP.POP.GROW") %>%
drop_na()
## # A tibble: 214 × 24
## country iso2c country.name region incom…¹ indic…² `2000` `2001` `2002` `2003`
## <chr> <chr> <chr> <chr> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ABW AW Aruba Latin… High i… SP.POP… 2.06 2.23 2.23 2.11
## 2 AFG AF Afghanistan South… Low in… SP.POP… 3.49 4.25 4.72 4.82
## 3 AGO AO Angola Sub-S… Lower … SP.POP… 3.03 3.25 3.41 3.53
## 4 ALB AL Albania Europ… Upper … SP.POP… -0.637 -0.938 -0.300 -0.374
## 5 AND AD Andorra Europ… High i… SP.POP… 1.57 2.94 3.94 4.38
## 6 ARE AE United Arab… Middl… High i… SP.POP… 5.43 5.28 5.30 6.48
## 7 ARG AR Argentina Latin… Upper … SP.POP… 1.11 1.11 1.11 1.10
## 8 ARM AM Armenia Europ… Upper … SP.POP… -0.631 -0.619 -0.551 -0.532
## 9 ASM AS American Sa… East … Upper … SP.POP… 1.31 1.13 0.951 0.655
## 10 ATG AG Antigua and… Latin… High i… SP.POP… 2.12 1.75 1.41 1.18
## # … with 204 more rows, 14 more variables: `2004` <dbl>, `2005` <dbl>,
## # `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
## # `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
## # `2016` <dbl>, `2017` <dbl>, and abbreviated variable names ¹income_level,
## # ²indicator
Now, we are going to reshape this data into long-format.
To do this, I’m going to first pare down by removing columns we don’t
need. I’m going to keep “country” because I want the value for each
country for each year. I’m also going to keep “income_level” because I’m
going to use that to summarise the population growth data later. Then,
I’m going to use the pivot_longer()
function to rearrange
all of the columns with years as the heading into two columns: one
column with the year, and one column with the population growth
rate.
popdat_countries %>%
dplyr::filter(indicator=="SP.POP.GROW") %>%
drop_na() %>%
select(-indicator, -country.name, -iso2c, -region) %>% #pare down the data a bit & pivot longer
pivot_longer(cols=-c(country, income_level), names_to="year", values_to="growth_rate")
## # A tibble: 3,852 × 4
## country income_level year growth_rate
## <chr> <fct> <chr> <dbl>
## 1 ABW High income 2000 2.06
## 2 ABW High income 2001 2.23
## 3 ABW High income 2002 2.23
## 4 ABW High income 2003 2.11
## 5 ABW High income 2004 1.76
## 6 ABW High income 2005 1.30
## 7 ABW High income 2006 0.798
## 8 ABW High income 2007 0.384
## 9 ABW High income 2008 0.131
## 10 ABW High income 2009 0.0986
## # … with 3,842 more rows
Ok, now that we have the data in long format, it is easier to summarize the annual population growth rates across all years by income level.
To do this, I’m going to group_by()
income level, and
then summarize the data using a nifty mean_se()
function,
which gives me the mean AND the mean +/- standard error. This is a nice
function to get the confidence band around the mean.
popdat_countries %>%
dplyr::filter(indicator=="SP.POP.GROW") %>%
drop_na() %>%
select(-indicator, -country.name, -iso2c, -region) %>% #pare down the data a bit & pivot longer
pivot_longer(cols=-c(country, income_level), names_to="year", values_to="growth_rate") %>%
group_by(income_level, year) %>%
summarise(mean_se(growth_rate)) #calculates mean + standard error of the mean.
## # A tibble: 72 × 5
## # Groups: income_level [4]
## income_level year y ymin ymax
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 Low income 2000 2.97 2.78 3.17
## 2 Low income 2001 2.91 2.76 3.05
## 3 Low income 2002 2.86 2.71 3.02
## 4 Low income 2003 2.84 2.66 3.01
## 5 Low income 2004 2.84 2.68 3.00
## 6 Low income 2005 2.87 2.73 3.01
## 7 Low income 2006 2.90 2.76 3.04
## 8 Low income 2007 2.91 2.77 3.06
## 9 Low income 2008 2.89 2.74 3.04
## 10 Low income 2009 2.82 2.67 2.97
## # … with 62 more rows
Ok, this looks good, so I’m going to save that data table as
popgrowth_income
.
popgrowth_income=popdat_countries %>%
dplyr::filter(indicator=="SP.POP.GROW") %>%
drop_na() %>%
select(-indicator, -country.name, -iso2c, -region) %>% #pare down the data a bit & pivot longer
pivot_longer(cols=-c(country, income_level), names_to="year", values_to="growth_rate") %>%
group_by(income_level, year) %>%
summarise(mean_se(growth_rate)) #calculates mean + standard error of the mean.
Now, we have the average (+/- standard error) population growth rate for countries at each income level across each year for 2000 to 2017. Here, the mean is “y” and lower and upper standard error is “ymin” and “ymax”, respectively. Let’s plot this. Let’s start with a simple line graph:
ggplot(popgrowth_income, aes(x=as.numeric(year), y=y, group=income_level)) +
geom_line(aes(color=income_level)) +
theme_bw()
We can make this fancy by adding the confidence band using
geom_ribbon()
, and cleaning up the axis labels:
ggplot(popgrowth_income, aes(x=as.numeric(year), y=y, group=income_level)) +
geom_line(aes(color=income_level)) +
geom_ribbon(aes(ymin=ymin, ymax=ymax), alpha=0.3, fill="gray") +
theme_bw() +
ylab("Average Population Growth Rate") +
xlab("Year")
Now, let’s look at how income levels have different effects on population growth rates depending on the region of the world. to do this, we have to pick a year to analyze. Let’s go with the latest data, 2017.
We can do that simply by going from the original
popdat_countries
(the raw data with the “aggregates”
removed), then filtering it to just the data on population growth, then
just looking at the 2017 data for each country. We’ll also keep the
region and income level data. We’ll remove rows that have NAs.
popdat_countries %>%
dplyr::filter(indicator=="SP.POP.GROW") %>%
select(country, `2017`, region, income_level) %>%
drop_na()
## # A tibble: 215 × 4
## country `2017` region income_level
## <chr> <dbl> <chr> <fct>
## 1 ABW 0.421 Latin America & Caribbean High income
## 2 AFG 2.49 South Asia Low income
## 3 AGO 3.31 Sub-Saharan Africa Lower middle income
## 4 ALB -0.0920 Europe & Central Asia Upper middle income
## 5 AND -0.410 Europe & Central Asia High income
## 6 ARE 1.40 Middle East & North Africa High income
## 7 ARG 0.961 Latin America & Caribbean Upper middle income
## 8 ARM 0.192 Europe & Central Asia Upper middle income
## 9 ASM 0.0755 East Asia & Pacific Upper middle income
## 10 ATG 1.03 Latin America & Caribbean High income
## # … with 205 more rows
Ok, that looks like what we want, so I’ll save it as
popdat_2017
:
popdat_2017=popdat_countries %>%
dplyr::filter(indicator=="SP.POP.GROW") %>%
select(country, `2017`, region, income_level) %>%
drop_na()
Now we can plot population growth by income level, and use region to create facets. Here, I’m going to use income_level as both the x axis and color because the x-axis label is going to be hard to read.
ggplot(popdat_2017, aes(x=income_level, y=`2017`)) +
geom_boxplot(aes(color=income_level)) +
facet_wrap(~region, nrow=3)
I can see that some regions don’t have enough variation in income level to be useful (e.g., North America and South Asia). I can actually filter the data within the ggplot function to remove those:
Lower income countries have higher population growth rate than those with high income in some regions (sub-Saharan Africa), while in others growth rate is bigger for high earners.
ggplot(popdat_2017 %>% dplyr::filter(region!="North America" & region!="South Asia"), aes(x=income_level, y=`2017`)) +
geom_boxplot(aes(color=income_level)) +
facet_wrap(~region, nrow=2) +
theme_bw() +
theme(legend.position = "bottom") +
ylab("Population Growth Rate in 2017")
It looks like the relationship between income level and population growth rate indeed changes between regions. For example, in Europe/Central Asia, Latin America/Caribbean, and Sub-Saharan Africa, countries with lower income have higher population growth rate. But the relationship is opposite in the Middle East & North Africa. There, the countries with higher incomes have higher population growth rate.
How would we test the idea that the relationship between income level and population growth rate depends on the region? One way is to run a linear model including the interaction effect of the two variables:
lm.fit=lm(`2017`~region*income_level, data=popdat_2017)
anova(lm.fit)
## Analysis of Variance Table
##
## Response: 2017
## Df Sum Sq Mean Sq F value Pr(>F)
## region 6 131.828 21.9713 35.7109 < 2.2e-16 ***
## income_level 3 2.949 0.9830 1.5977 0.191299
## region:income_level 12 20.996 1.7497 2.8438 0.001279 **
## Residuals 193 118.744 0.6153
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1