Make Simple Cross-Sectional Data with World Bank Data (from {WDI})

This Post Assumes Some Familiarity with `{WDI}` ⤵️

My undergraduate students reading this post, thinking about potential topics for their quantitative methods course or their C-papers, should read my earlier tutorial on how to use the {WDI} package in R.

: There's no business like Mr. Jim Business

Students in my quantitative methods class are (ideally) having to think about their end-of-the-course short papers and their BA theses that will (ideally!) make use of some of the methods and techniques I teach them. Part of that entails thinking of a question that can be answered with these methods and finding data to explore. That naturally draws the student to the World Bank, which contains a nice repository of data on a whole host of topics. If you’re interested in topics of economic development, population growth, corruption, education levels—or almost anything else in the cross-national context—the World Bank’s DataBank has you covered.

What’s less nice is how a student would think to obtain the data that interests them. The student might end up at a portal like this one. They’d have to fumble through what exact database they want, select what countries they want and over what time period, and then download the data to an Excel file. The Excel file would be less than appetizing to look at, having years as columns with unhelpful columns like X2010 for an observation in the year 2010. This particular format might overwhelm the student if they wanted to add anything to it, especially if they had the whole international system along with assorted regional or economic groups.

There’s a better way, I promise. Use the {WDI} package in R, and consult this previous guide of mine. All you need are the {WDI} package, an idea of what you want, and the knowledge of how the World Bank communicates indicators to you. The {WDI} package will get what you want and some assorted “tidy” verbs will convert what {WDI} returns to a simple cross-sectional data set for you to explore.

First, here are the R packages I’ll be using. My students should have all these installed by virtue of the course description, except for the {WDI} package.

library(tidyverse)    # for most things
library(stevedata)    # v. 1.4, for country_isocodes
library(stevemisc)    # for lag_at()
library(WDI)          # the star of the show for this post
library(modelsummary) # for the regression table at the end.
library(tinytable)    # OPTIONAL, for you: for customizing the regression table at the end.

Here’s a table of contents.

An Applied Example: Some Economic Indicators and the “Doing Business” Project
Convert a Panel to a Cross-Section (From “Easiest” to “Still Easy (but with Five More Lines of Code)”)
A Simple Regression, and a Conclusion

Alrightie, let’s get started.

An Applied Example: Some Economic Indicators and the “Doing Business” Project

My previous guide mentioned that I had a PhD student from my time at Clemson University that was interested in the following indicators available on the World Bank. These are access to electricity (as a percent of the population) [EG.ELC.ACCS.ZS], the current account balance [BN.CAB.XOKA.GD.ZS], the “ease of doing business” score [IC.BUS.DFRN.XQ], the consumer price index [FP.CPI.TOTL.ZG], and the interest rate spread [FR.INR.LNDP].

Here’s where I’ll note, especially as I don’t want students simply mimicking me: I forget why my student wanted these indicators. I only remember that he wanted them (and that he was interested in Sub-Saharan Africa). I can tell you what these assorted variables are, and even point you to the Doing Business project for more information on what that particular estimate is communicating.¹ However, I don’t know what relationship he was interested in exploring, but you should definitely know what you’re doing and why you’re doing it. Just because what follows is theoretically thoughtless doesn’t mean it’s permission for you to do the same. However, what follows is fine for the intended purpose: teaching students how to make simple cross-sectional data sets from data made available by the World Bank.

Now, let’s fire up the WDI() function knowing what information we want from the World Bank. Here’s the function we’re going to call, and let me explain more after the code block.

WDI(indicator = c("aepp" = "EG.ELC.ACCS.ZS", # access to electricity
                  "cab" = "BN.CAB.XOKA.GD.ZS", # current account balance
                  "edb" = "IC.BUS.DFRN.XQ", # ease of doing business
                  "cpi" = "FP.CPI.TOTL.ZG", # CPI
                  "irs" = "FR.INR.LNDP"), # interest rate spread
    start = 2014, end = 2019,
    country = country_isocodes$iso3c) %>% 
  as_tibble() -> rawData

First, the indicator argument in the WDI() function takes the indicators of interest, as stored by the World Bank. The guide I wrote in 2021 should communicate how you could minimally use the indicator argument in this function, though I’m doing what the package author recommends doing if you know you’re going to be renaming your columns anyway. In the above function, we’re grabbing the access to electricity indicator (EG.ELC.ACCS.ZS) and, once we do, we’re going to assign it to a column called aepp. Likewise, we’re going to grab the current account balance indicator (BN.CAB.XOKA.GD.ZS) and assign it to a column called cab. From there, you should be able to see how to do this for the three remaining columns.

Next, let’s think a little bit about what we’re doing here. For this case, let’s treat the ease of doing business score as our dependent variable (i.e. the thing we want to explain). I can see from exploring the World Bank’s data repository that the Doing Business project was discontinued as of Sept. 16, 2021. The most recent year for which it has data is 2019. Knowing these are somewhat recent projects, and I’m interested in a simple cross-sectional analysis, it would be a waste of time to ask for information from too far before the most recent year. Thus, I want to focus on just a few years: let’s say 2014 to 2019. That will explain the arguments of start = 2014 and end = 2019 you see in the code above.

Finally, let’s not overwhelm ourselves with what WDI() will return without any additional guidance. WDI() works primarily with ISO codes, but, by default, it will return everything for which it could plausibly have data. This includes countries (e.g. Sweden, the United States, Mexico) but also assorted regional groupings (e.g. North America, Latin America & the Caribbean), organizational groupings (e.g. European Union, OECD states), economic groupings (e.g. HIPCs, LDCs), and even the world (among some others). This would be a good opportunity to both know your state classification systems and know the population of cases you ultimately want to describe. You probably care just about sovereign states (“countries”), so why ask for the other stuff? By default, WDI() will get that for you unless you supply something different in the country argument.

That’s one such reason why I have the country_isocodes data set in {stevedata} to allow for some convenient subsetting. Here’s a simple summary of that data set.

country_isocodes
#> # A tibble: 249 × 4
#>    iso2c iso3c iso3n name                
#>    <chr> <chr> <chr> <chr>               
#>  1 AW    ABW   533   Aruba               
#>  2 AF    AFG   004   Afghanistan         
#>  3 AO    AGO   024   Angola              
#>  4 AI    AIA   660   Anguilla            
#>  5 AX    ALA   248   Åland Islands       
#>  6 AL    ALB   008   Albania             
#>  7 AD    AND   020   Andorra             
#>  8 AE    ARE   784   United Arab Emirates
#>  9 AR    ARG   032   Argentina           
#> 10 AM    ARM   051   Armenia             
#> # ℹ 239 more rows

The country argument in WDI() takes either two-character or three-character ISO codes and returns all observations included in what you asked. If you wanted just the United States, Canada, and Mexico, it would be something like country = c("US", "MX", "CA") or the three-character equivalent of country = c("USA", "MEX", "CAN"). In our simple example, however, it’s anything in the iso2c column in the country_isocodes data. Be forewarned, that WDI() is verbose, and will alert you to anything it can’t find in the World Bank data (e.g. the World Bank has no data for Åland Islands), though the warning message that is returned (and suppressed here) is just a warning and not an error, per se.

Run the above WDI() function and this is what will come back.²

rawData
#> # A tibble: 1,290 × 9
#>    country     iso2c iso3c  year  aepp    cab   edb    cpi   irs
#>    <chr>       <chr> <chr> <int> <dbl>  <dbl> <dbl>  <dbl> <dbl>
#>  1 Afghanistan AF    AFG    2014  89.5 -15.8   NA    4.67  NA   
#>  2 Afghanistan AF    AFG    2015  71.5 -21.9   39.3 -0.662 NA   
#>  3 Afghanistan AF    AFG    2016  97.7 -15.0   38.9  4.38  NA   
#>  4 Afghanistan AF    AFG    2017  97.7 -19.0   37.1  4.98  NA   
#>  5 Afghanistan AF    AFG    2018  93.4 -21.6   44.2  0.626 NA   
#>  6 Afghanistan AF    AFG    2019  97.7 -20.2   44.1  2.30  NA   
#>  7 Albania     AL    ALB    2014 100   -10.8   NA    1.63   6.06
#>  8 Albania     AL    ALB    2015 100    -8.60  58.1  1.90   6.48
#>  9 Albania     AL    ALB    2016  99.9  -7.59  64.2  1.28   5.90
#> 10 Albania     AL    ALB    2017  99.9  -7.54  66.8  1.99   5.45
#> # ℹ 1,280 more rows

Some basic exploration of the output will show that there often observations for which we have no data whatsoever on a key indicator, like interest rate spreads for Afghanistan or Austria, the consumer price index for Argentina and Eritrea, or the current account balance for Chad. Some have situational missingness (e.g. four years of missing data of interest rate spreads for Bahrain, three years of the consumer price index for Tajikistan). One observation, American Samoa, has no information whatsoever and should not be included.

Convert a Panel to a Cross-Section (From “Easiest” to “Still Easy (but with Five More Lines of Code)”)

The data created above and assigned to an object called rawData is what we’d call a “panel” in the social sciences. “Panels” are individual observations observed over (effectively) the same period of time. There are a few options for converting such a panel to what we’d call a “cross-section” (i.e. observations all gathered at (around) the same time, with no temporal component). These range from “easiest” to “still easy (but with five more lines of code)”.

Easiest: Subset to a Single Year (e.g. Most Recent Year)

The easiest would be a simple subset of the panel to a single year of observation. In the data created above, this would be a simple matter of selecting the data to, say, 2019 (which would incidentally be the most recent year).

rawData %>% 
  filter(year == 2019) -> Option1

Option1
#> # A tibble: 215 × 9
#>    country             iso2c iso3c  year  aepp     cab   edb   cpi   irs
#>    <chr>               <chr> <chr> <int> <dbl>   <dbl> <dbl> <dbl> <dbl>
#>  1 Afghanistan         AF    AFG    2019  97.7 -20.2    44.1  2.30 NA   
#>  2 Albania             AL    ALB    2019 100    -7.91   67.7  1.41  5.78
#>  3 Algeria             DZ    DZA    2019  99.5  -8.76   48.6  1.95  6.25
#>  4 American Samoa      AS    ASM    2019  NA    NA      NA   NA    NA   
#>  5 Andorra             AD    AND    2019 100    18.0    NA   NA    NA   
#>  6 Angola              AO    AGO    2019  45.6   7.25   41.3 17.1  12.9 
#>  7 Antigua and Barbuda AG    ATG    2019 100    -6.51   60.3  1.43  7.03
#>  8 Argentina           AR    ARG    2019 100    -0.780  59.0 NA    20.0 
#>  9 Armenia             AM    ARM    2019 100    -7.06   74.5  1.44  3.66
#> 10 Aruba               AW    ABW    2019 100     2.50   NA    4.26  3.5 
#> # ℹ 205 more rows

Option1 %>%
  na.omit
#> # A tibble: 99 × 9
#>    country             iso2c iso3c  year  aepp    cab   edb   cpi   irs
#>    <chr>               <chr> <chr> <int> <dbl>  <dbl> <dbl> <dbl> <dbl>
#>  1 Albania             AL    ALB    2019 100   -7.91   67.7  1.41  5.78
#>  2 Algeria             DZ    DZA    2019  99.5 -8.76   48.6  1.95  6.25
#>  3 Angola              AO    AGO    2019  45.6  7.25   41.3 17.1  12.9 
#>  4 Antigua and Barbuda AG    ATG    2019 100   -6.51   60.3  1.43  7.03
#>  5 Armenia             AM    ARM    2019 100   -7.06   74.5  1.44  3.66
#>  6 Australia           AU    AUS    2019 100    0.350  81.2  1.61  3.54
#>  7 Azerbaijan          AZ    AZE    2019 100    8.70   78.5  2.61  7.59
#>  8 Bahamas, The        BS    BHS    2019 100   -2.65   59.9  2.49  3.66
#>  9 Bangladesh          BD    BGD    2019  92.2 -0.839  45.0  5.59  2.78
#> 10 Belarus             BY    BLR    2019 100   -1.93   74.3  5.60  2.52
#> # ℹ 89 more rows

The above code shows that we have 215 cross-sectional units, but any regression model we employ on these data would have just 99 observations because of missing data either in what’s going to be our dependent variable (the ease of doing business score for Andorra), or independent variables (e.g. the consumer price index for Argentina), or both (e.g. American Samoa).

No matter, this is the path of the absolute least resistance for converting a panel to a cross-section. You can’t fail with this route, but the effort required to do this matches the effort that went into thinking about the desirability of this option.

Also Easy: Lag the IVs a Year, then Subset

There are two things that present themselves in our data that are teachable moments. First, this isn’t the kind of class where I can spam the word “endogeneity” at students, but some basic logic suggests it’s perilous to treat the ease of doing business score in 2019 as a function of the interest rate spread in 2019. Both are observed (effectively) at the same time. Discerning causal relationships is hard enough as it is, and it’s why practitioners like to lag independent variables by a time period (year, in this case). We can at least say with confidence that 2018 observations can only affect 2019 observations (in the dependent variable), and that 2019 cannot affect 2018.³ My recent discussion of Mitchell’s (1968) analysis of inequality and government control in South Vietnam comes with an appreciation that even he was aware of this. His analysis is careful to make sure everything that could possibly explain South Vietnamese control of its provinces in 1965 is observed before 1965.

First, let’s take a look at New Zealand as a proof of concept for some of the information gain we’re going to get by a year lag.

rawData %>% filter(iso2c == "NZ")
#> # A tibble: 6 × 9
#>   country     iso2c iso3c  year  aepp   cab   edb   cpi   irs
#>   <chr>       <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 New Zealand NZ    NZL    2014   100 -3.09  NA   1.23   1.79
#> 2 New Zealand NZ    NZL    2015   100 -2.62  87.1 0.293  2.03
#> 3 New Zealand NZ    NZL    2016   100 -2.13  87.2 0.646  1.79
#> 4 New Zealand NZ    NZL    2017   100 -2.81  87.0 1.85   1.46
#> 5 New Zealand NZ    NZL    2018   100 -4.05  87.0 1.60  -3.26
#> 6 New Zealand NZ    NZL    2019   100 -2.79  86.8 1.62  NA

New Zealand has missing data for the interest rate spread in 2019, but the panel is otherwise complete for the other observations. Taking a year lag allows us to keep New Zealand in our data.

I have a suite of functions—my so-called _at() functions—for doing single functions to multiple columns all in one fell swoop. lag_at(), in this case, creates lagged variables with a prefix of l[o]_ where o corresponds with the order of the lag. The default here is 1, as we want just a single year lag. We can (and must make sure) to specify these are grouped data, so we’re not lagging Albania’s observation of a current account balance in 2014 based on Afghanistan’s observation in 2019.

Let’s observe what this does.

rawData %>% 
  lag_at(c("aepp", "cab", "cpi", "irs"),
        .by = iso2c)
#> # A tibble: 1,290 × 13
#>    country     iso2c iso3c  year  aepp    cab   edb    cpi   irs l1_aepp l1_cab
#>    <chr>       <chr> <chr> <int> <dbl>  <dbl> <dbl>  <dbl> <dbl>   <dbl>  <dbl>
#>  1 Afghanistan AF    AFG    2014  89.5 -15.8   NA    4.67  NA       NA    NA   
#>  2 Afghanistan AF    AFG    2015  71.5 -21.9   39.3 -0.662 NA       89.5 -15.8 
#>  3 Afghanistan AF    AFG    2016  97.7 -15.0   38.9  4.38  NA       71.5 -21.9 
#>  4 Afghanistan AF    AFG    2017  97.7 -19.0   37.1  4.98  NA       97.7 -15.0 
#>  5 Afghanistan AF    AFG    2018  93.4 -21.6   44.2  0.626 NA       97.7 -19.0 
#>  6 Afghanistan AF    AFG    2019  97.7 -20.2   44.1  2.30  NA       93.4 -21.6 
#>  7 Albania     AL    ALB    2014 100   -10.8   NA    1.63   6.06    NA    NA   
#>  8 Albania     AL    ALB    2015 100    -8.60  58.1  1.90   6.48   100   -10.8 
#>  9 Albania     AL    ALB    2016  99.9  -7.59  64.2  1.28   5.90   100    -8.60
#> 10 Albania     AL    ALB    2017  99.9  -7.54  66.8  1.99   5.45    99.9  -7.59
#> # ℹ 1,280 more rows
#> # ℹ 2 more variables: l1_cpi <dbl>, l1_irs <dbl>

Notice lag_at() takes a character vector corresponding with the columns for which the user wants lags and creates new columns with that lagged information. Because we wanted just a lag of order 1 (i.e. the default), we get four new columns of l1_aepp, l1_cab, l1_cpi, and l1_irs corresponding with the first-order lags of access to electricity, current account balance, consumer price index, and interest rate spread (respectively).

Now that we see what it does, let’s do our second option. Notice how easy this is, but it’s just two more lines of code. The second extra line is optional (because it’s using the select() column to do column management).

rawData %>% 
  lag_at(c("aepp", "cab", "cpi", "irs"),
        .by = iso2c) %>%
  select(country:year, edb, l1_aepp:l1_irs) %>%
  filter(year == 2019) -> Option2

Option2 %>%
  na.omit
#> # A tibble: 102 × 9
#>    country             iso2c iso3c  year   edb l1_aepp   l1_cab l1_cpi l1_irs
#>    <chr>               <chr> <chr> <int> <dbl>   <dbl>    <dbl>  <dbl>  <dbl>
#>  1 Albania             AL    ALB    2019  67.7   100    -6.70     2.03   5.18
#>  2 Algeria             DZ    DZA    2019  48.6    99.6  -8.69     4.27   6.25
#>  3 Angola              AO    AGO    2019  41.3    45.3   9.32    19.6   13.8 
#>  4 Antigua and Barbuda AG    ATG    2019  60.3   100   -14.0      1.21   7.32
#>  5 Armenia             AM    ARM    2019  74.5    99.9  -7.23     2.52   4.13
#>  6 Australia           AU    AUS    2019  81.2   100    -2.23     1.91   3.28
#>  7 Azerbaijan          AZ    AZE    2019  78.5   100    12.7      2.27   7.17
#>  8 Bahamas, The        BS    BHS    2019  59.9   100    -8.84     2.27   3.41
#>  9 Bangladesh          BD    BGD    2019  45.0    86.9  -2.21     5.54   2.99
#> 10 Belarus             BY    BLR    2019  74.3    99.3   0.0381   4.87   2.78
#> # ℹ 92 more rows

This approach gains us three more observations because of missingness in 2019. anti_join() will tell us what these observations are.

anti_join(Option2 %>% na.omit, Option1 %>% na.omit)
#> # A tibble: 3 × 9
#>   country     iso2c iso3c  year   edb l1_aepp l1_cab l1_cpi l1_irs
#>   <chr>       <chr> <chr> <int> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Chile       CL    CHL    2019  72.6   100    -4.48   2.43   1.48
#> 2 New Zealand NZ    NZL    2019  86.8   100    -4.05   1.60  -3.26
#> 3 Uganda      UG    UGA    2019  60.0    41.9  -6.33   2.62  10.5

While it’s not always the case you “gain” more observations with this route, it happens to be the case that we do and we demonstrate that we’ve thought through a rudimentary concern in the social sciences. In our data, 2018 can only explain (“cause”) variation in 2019, and not the other way around.

Still Easy (but with Five More Lines of Code): Fill Based on Most Recent Available Year

We could alternatively take a page out of what I see the Quality of Government project doing with its cross-sectional data. In their data, as of Jan. 2024, observations are included from 2020. If 2020 is not available, it will take 2021. If no data exist for 2021, it’ll take 2019. No matter, the cross-sectional data frame is effectively one that “fills” to a referent year based on what’s available on the referent year, or surrounding it.

That seems like a mouthful, but let’s take a look at Japan to get an idea what we want to do.

rawData %>% filter(iso2c == "JP")
#> # A tibble: 6 × 9
#>   country iso2c iso3c  year  aepp   cab   edb    cpi    irs
#>   <chr>   <chr> <chr> <int> <dbl> <dbl> <dbl>  <dbl>  <dbl>
#> 1 Japan   JP    JPN    2014   100 0.742  NA    2.76   0.804
#> 2 Japan   JP    JPN    2015   100 3.07   77.5  0.795  0.737
#> 3 Japan   JP    JPN    2016   100 3.94   77.9 -0.127  0.744
#> 4 Japan   JP    JPN    2017   100 4.12   78.0  0.484  0.673
#> 5 Japan   JP    JPN    2018   100 3.52   78.0  0.989 NA    
#> 6 Japan   JP    JPN    2019   100 3.45   78.0  0.469 NA

rawData %>% 
  lag_at(c("aepp", "cab", "cpi", "irs"),
        .by = iso2c) %>%
  select(country:year, edb, l1_aepp:l1_irs) %>%
  filter(iso2c == "JP")
#> # A tibble: 6 × 9
#>   country iso2c iso3c  year   edb l1_aepp l1_cab l1_cpi l1_irs
#>   <chr>   <chr> <chr> <int> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Japan   JP    JPN    2014  NA        NA NA     NA     NA    
#> 2 Japan   JP    JPN    2015  77.5     100  0.742  2.76   0.804
#> 3 Japan   JP    JPN    2016  77.9     100  3.07   0.795  0.737
#> 4 Japan   JP    JPN    2017  78.0     100  3.94  -0.127  0.744
#> 5 Japan   JP    JPN    2018  78.0     100  4.12   0.484  0.673
#> 6 Japan   JP    JPN    2019  78.0     100  3.52   0.989 NA

Japan is missing an interest rate spread variable for 2019 and 2018. Because the first-order lag of the interest rate spread variable (l1_irs) for 2019 wants an observation from 2018 (that it does not have), this becomes an NA and Japan would drop from our cross-sectional analysis. However, we could just simply fill the most recent observation for Japan (2017) as a plug-in for the interest rate spread variable from 2018. There will be occasions where this might be less than desirable, but it’s perfectly fine for a case like this.⁴

Thus, the third option here is to complement the first-order lags with a group-by fill using the fill() function in {tidyr}. To the best of my knowledge, fill() doesn’t recognize the .by argument like other so-called “tidy” verbs, but it does work with the deprecated group_by() method. Observe the gains in available data for analysis we’ll get from this method.

# Reminder of how many observations we have in the first method
nrow(rawData %>% filter(year == 2019) %>% na.omit)
#> [1] 99

rawData %>%
  # Step 1: lag_at(), .by our group
  lag_at(c("aepp", "cab", "cpi", "irs"),
         .by = iso2c) %>%
  # Step 2: group_by() our group
  group_by(iso2c) %>%
  # Step 3: fill down the first-order lags.
  fill(l1_aepp:l1_irs,
       .direction = "down") %>%
  #  Step 4: practice safe group_by()
  ungroup() %>%
  # Step 5: select what we want, though this is basically optional
  select(country:year, edb, l1_aepp:l1_irs) %>%
  # Step 6: filter to most recent year (2019)
  filter(year == 2019) -> Option3

Option3 %>% na.omit
#> # A tibble: 123 × 9
#>    country             iso2c iso3c  year   edb l1_aepp l1_cab l1_cpi l1_irs
#>    <chr>               <chr> <chr> <int> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
#>  1 Albania             AL    ALB    2019  67.7   100    -6.70   2.03   5.18
#>  2 Algeria             DZ    DZA    2019  48.6    99.6  -8.69   4.27   6.25
#>  3 Angola              AO    AGO    2019  41.3    45.3   9.32  19.6   13.8 
#>  4 Antigua and Barbuda AG    ATG    2019  60.3   100   -14.0    1.21   7.32
#>  5 Armenia             AM    ARM    2019  74.5    99.9  -7.23   2.52   4.13
#>  6 Australia           AU    AUS    2019  81.2   100    -2.23   1.91   3.28
#>  7 Azerbaijan          AZ    AZE    2019  78.5   100    12.7    2.27   7.17
#>  8 Bahamas, The        BS    BHS    2019  59.9   100    -8.84   2.27   3.41
#>  9 Bahrain             BH    BHR    2019  76.0   100    -6.44   2.09   4.17
#> 10 Bangladesh          BD    BGD    2019  45.0    86.9  -2.21   5.54   2.99
#> # ℹ 113 more rows

This method nets us a more inclusive list than the other two methods, using recent data to “fill”, where necessary, from recent years to account for more immediately missing data. Notice there are other options in the .direction argument, though I’m deliberate in selecting “down” to make sure only past observations can stand in for more current observations (i.e. I won’t grab an observation from 2017 to fill in for 2016).

A Simple Regression, and a Conclusion

For the sake an end-of-the-course paper in my BA-level quantitative methods course, I’d be happy with any one of these options that makes use of data from the World Bank’s data bank (by way {WDI}), though the last of them would impress me the most. I’ll only offer the caveat that there is no guarantee that all three of these would produce identical results.

Again assuming we want to explain variation in the ease of doing business score as a function of the other indicators we got, three simple linear models will result in results that aren’t identical. Observe.

M1 <- lm(edb ~ aepp + cab + cpi + irs, Option1)
M2 <- lm(edb ~ l1_aepp + l1_cab + l1_cpi + l1_irs, Option2)
M3 <- lm(edb ~ l1_aepp + l1_cab + l1_cpi + l1_irs, Option3)

modelsummary(list("Subset: 2019" = M1,
                  "Subset: 2019 (w/ IV Lags)" = M2,
                  "Subset: 2019 (w/ IV Lags and Fills)" = M3),
             stars = TRUE,
             title = "The Covariates of the Ease of Doing Business in 2019",
             coef_map = c(
               "aepp" = "Access to Electricity",
               "l1_aepp" = "Access to Electricity",
               "cab" = "Current Account Balance",
               "l1_cab" = "Current Account Balance",
               "cpi" = "Consumer Price Index",
               "l1_cpi" = "Consumer Price Index",
               "irs" = "Interest Rate Spread",
               "l1_irs" = "Interest Rate Spread"
             ),
             gof_map = c("nobs", "r.squared", "adj.r.squared"))

The Covariates of the Ease of Doing Business in 2019
	Subset: 2019	Subset: 2019 (w/ IV Lags)	Subset: 2019 (w/ IV Lags and Fills)
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001
Access to Electricity	0.194***	0.167***	0.206***
	(0.042)	(0.040)	(0.033)
Current Account Balance	0.171+	0.344***	0.210*
	(0.091)	(0.091)	(0.092)
Consumer Price Index	-0.013	-0.202+	-0.156***
	(0.036)	(0.109)	(0.035)
Interest Rate Spread	-0.603**	-0.516**	-0.491***
	(0.179)	(0.156)	(0.143)
Num.Obs.	99	102	123
R2	0.379	0.460	0.454
R2 Adj.	0.353	0.437	0.436

As the composition of the sample changes, so too do the test statistics. It’s also the difference of thresholds of significance for the current account balance and consumer price index variables. I’ll withhold comment about the advisability of this exact regression given the above caveat that this applied example is purposely thoughtless.⁵

I’ll clarify here that this isn’t supposed to be a serious analysis. Rather, it’s supposed to be a tutorial that guides students on how to use {WDI} to do some introductory analyses that are suitable for their current level. Come armed with questions that you can answer with data, and think critically about what you want to do and why you want to do it. Using {WDI} and doing some basic lags/fills are quite simple by comparison. It’s just a few lines of code.

The statement announcing the discontinuation of the Doing Business project casts considerable doubt on whether these data should be used whatsoever. ↩
If you snooped on the source code for this post, you’d see that I saved the output of this function to a data set and work with that for this post. It’s great that this API exists, but accessing it can be a bit slow. With that in mind, it might be wise to consider this the kind of “raw data” you’d have for a project and keep it stored somewhere to process to “clean” data. See some posts of mine (here and here) for what I call this “data-laundering” approach to project management. ↩
Yeah, concerns for causal identification are not so easily dismissed by simple year lags, but that’s a topic for another class. ↩
In our data, Venezuela loses a consumer price index observation for 2018 and 2019 after observing 255% in 2017. Using 2017 to fill in 2018 may borrow trouble (i.e. this is Venezuela we’re talking about and the inflation for that year was likely much worse than it was in 2017). However, consumer price indices already behave poorly in the linear model context. Perhaps imputing 2017 for 2018 is a bad idea, but it wouldn’t be the worst idea to just infer that there is a hyperinflation crisis in Venezuela that you could discern from imputing an observation for 2018 from 2017. Use your head with the data limitations in mind. ↩
For example, it would make sense to transform some of these variables. The consumer price index will always have a grotesque scale in a cross-national context, the interest rate spread and current account balance have similar quirks, and the access to electricity tops at 100% (which concerns over 30% of observations). Think carefully about what you’re doing and why you’re doing it. ↩

This Post Assumes Some Familiarity with {WDI} ⤵️