I Have Other Musings on This That I Want My Students to Read ⤵️

Spiritually, this post is identical to one more focused on creating dyad- or state-year data for analyses of international conflict. {peacesciencer} talks a little bit about this as well. Likewise, I’m assuming some familiarity with state classification systems, which I talk about a bit on my blog and for {peacesciencer}. My blog has other things for my students to read about merging data. {peacesciencer} also talks about this, albeit in a more narrow context.

Note quite, but same energy for a lead image. Hot #take: the Bobby Sherman bubblegum pop cover is better.
Note quite, but same energy for a lead image. Hot #take: the Bobby Sherman bubblegum pop cover is better.

The idea for this post comes from an uncomfortable encounter with a student recently. The student in question proposed that they were doing a time-series analysis of a country from something like 2000 to 2023. The data were purportedly yearly. They reported an N in their model of over 5,000 observations. That obviously can’t be right, but understanding what is “right” may not be so straightforward for students who are leaning on data sets they download to think of their population for them. Let he who is without sin cast the first stone; I was a graduate student myself once. However, the more seasoned I’ve become with this stuff, the more I’ve appreciated that taking control of this stuff—with code—is going to make your life a lot easier as a researcher. It’ll also, hopefully, help the professor (this professor) avoid the discomfort of having to ask the student why an N of 24 suddenly became an N of over 5,000. Did something go horribly wrong in the merge process, and/or did the left hand not know what the right hand was doing? If you don’t know, I’ll have to ask.

This is something of a quick hitter, because I’m repeating myself a fair bit. Here’s a table of contents.

  1. What is a “Population” in this Context?
  2. What’s My Population?
  3. How Do I Create My Data?
  4. Conclusion: Why Does This Matter?

Here are the R packages we’ll be using. Importantly, {stevedata} version 1.5.0 is in development and has the wb_groups data that will feature prominently here.

library(tidyverse)
library(stevedata) # forthcoming, v. 1.5.0
library(peacesciencer)
library(WDI)

Alrightie, let’s get going.

What is a “Population” in this Context?

Students for whom this is applicable will probably remember their professor (me) teaching them about how inference is made from a sample to a population. The population parameter might be unknowable in the real world, but our statistical tools make inferential claims by way of ruling out things as incompatible with the data. In the classic case, the population is the thing we want to know about based on samples of it.

The “population” in this context isn’t referring to something wholly different, per se, but it is different. Instead, the “population” in this context is the universe of relevant cases we want to describe. If, say, the goal is to make inferences about the five Nordic countries, then the “population” is Sweden, Norway, Finland, Iceland, and Denmark. That population is five units. If, say, the goal is to make inferences about South Asia, then the “population” (per World Bank classifications) is Afghanistan, Bangladesh, Bhutan, India, Maldives, Nepal, Pakistan, and Sri Lanka. That population is eight units. Perhaps missing data creates a subset of that population (i.e. maybe we don’t have data on something for Maldives), or we might be interested in just the Scandinavian part of the Nordic countries (which would exclude Finland).

However, that means the size of the population decreases for these reasons, and never increases. Your “population” should never increase in your data.1 Please keep that in mind.

This part is simple, certainly for bite-sized “populations” like this. There is an added wrinkle when there is a temporal component to the population. The “population” is observed over some repeated interval of time. For a lot of international relations applications, this is yearly. There is a Sweden in 2020 and a Sweden in 2021. There is an India-Pakistan dyad in 1970 and an India-Pakistan dyad in 1971. Perhaps the most abstract sense, the “population” is unchanged but the underlying data aren’t unchanged. Our unit of analysis from this population has changed from “thing” (i.e. states) to “thing-time” (e.g. state-years, state-quarters).

We need to be super mindful about what that means for the data we’re ultimately going to have. It seems daunting, but it really isn’t. You just have to know what you’re doing and take control of your data-generating process.

What’s My Population?

I don’t know; you tell me.

No, seriously, you tell me and we can proceed from there. I have a suite of data sets in either {peacesciencer} or {stevedata} that can help you with this. For example, if you were interested in the universe (“population”) of Correlates of War states, you can get that from the Correlates of War project (or in {peacesciencer}):

cow_states
#> # A tibble: 243 × 10
#>    stateabb ccode statenme  styear stmonth stday endyear endmonth endday version
#>    <chr>    <dbl> <chr>      <dbl>   <dbl> <dbl>   <dbl>    <dbl>  <dbl>   <dbl>
#>  1 USA          2 United S…   1816       1     1    2016       12     31    2016
#>  2 CAN         20 Canada      1920       1    10    2016       12     31    2016
#>  3 BHM         31 Bahamas     1973       7    10    2016       12     31    2016
#>  4 CUB         40 Cuba        1902       5    20    1906        9     25    2016
#>  5 CUB         40 Cuba        1909       1    23    2016       12     31    2016
#>  6 HAI         41 Haiti       1859       1     1    1915        7     28    2016
#>  7 HAI         41 Haiti       1934       8    15    2016       12     31    2016
#>  8 DOM         42 Dominica…   1894       1     1    1916       11     29    2016
#>  9 DOM         42 Dominica…   1924       9    29    2016       12     31    2016
#> 10 JAM         51 Jamaica     1962       8     6    2016       12     31    2016
#> # ℹ 233 more rows

The data here suggest we have a population of 243 cases… except we don’t. Do you see from the output that we have duplicate entries for Cuba, Haiti, and the Dominican Republic in the first 10 rows? Those emerge as artifacts of the United States temporarily eliminating those states by occupying them for a stretch of several years before leaving, which then results in those states reappearing in the state system. You can helpfully see those dates communicated in the data, but it does mean there is an implicit time component in these data. If you wanted the true size of the population, irregarding time, you’d want to subset to unique Correlates of War state codes like this.

cow_states %>% slice(1, .by=ccode)
#> # A tibble: 217 × 10
#>    stateabb ccode statenme  styear stmonth stday endyear endmonth endday version
#>    <chr>    <dbl> <chr>      <dbl>   <dbl> <dbl>   <dbl>    <dbl>  <dbl>   <dbl>
#>  1 USA          2 United S…   1816       1     1    2016       12     31    2016
#>  2 CAN         20 Canada      1920       1    10    2016       12     31    2016
#>  3 BHM         31 Bahamas     1973       7    10    2016       12     31    2016
#>  4 CUB         40 Cuba        1902       5    20    1906        9     25    2016
#>  5 HAI         41 Haiti       1859       1     1    1915        7     28    2016
#>  6 DOM         42 Dominica…   1894       1     1    1916       11     29    2016
#>  7 JAM         51 Jamaica     1962       8     6    2016       12     31    2016
#>  8 TRI         52 Trinidad…   1962       8    31    2016       12     31    2016
#>  9 BAR         53 Barbados    1966      11    30    2016       12     31    2016
#> 10 DMA         54 Dominica    1978      11     3    2016       12     31    2016
#> # ℹ 207 more rows

Thus, we have 217 unique states that have ever existed in the population/universe of Correlates of War states.2 If you use create_stateyears() in that package, you’ll get that information processed for you in creating state-year data. We’ll talk more about the temporal component in the next section, but here it is in action creating a panel of five years for all countries in the Correlates of War state system data from 1816 to 1820.

create_stateyears(subset_year = c(1816:1820))
#> # A tibble: 115 × 3
#>    ccode statenme                  year
#>    <dbl> <chr>                    <int>
#>  1     2 United States of America  1816
#>  2     2 United States of America  1817
#>  3     2 United States of America  1818
#>  4     2 United States of America  1819
#>  5     2 United States of America  1820
#>  6   200 United Kingdom            1816
#>  7   200 United Kingdom            1817
#>  8   200 United Kingdom            1818
#>  9   200 United Kingdom            1819
#> 10   200 United Kingdom            1820
#> # ℹ 105 more rows

You wouldn’t be interested in this sliver of the overall panel, but there it is anyway.

Almost none of my students (unfortunately 😢) are interested in the kinds of conflict analyses I’ve done or typically read, but they are generally interested in panel models or time series data that might lean on data made available by the World Bank. However, the World Bank is generous to the point of too generous with the data it makes available. Sometimes a student is really interested in low-income countries, or some geographical region. If you’re not explicit with {WDI} when you get data from the World Bank, it will grab everything for you. It’s understandable that it does that, because you didn’t give it guidance about what to include or exclude with respect to a population that interests you.

This would be a good time to read about the assorted classification systems the World Bank employs. I have a version of these data in {stevedata} (forthcoming v. 1.5.0) as wb_groups. Here, you can see what are the assorted classification systems and what states are in them.

wb_groups
#> # A tibble: 2,085 × 4
#>    wbgc  wbgn                        iso3c name            
#>    <chr> <chr>                       <chr> <chr>           
#>  1 AFE   Africa Eastern and Southern AGO   Angola          
#>  2 AFE   Africa Eastern and Southern BWA   Botswana        
#>  3 AFE   Africa Eastern and Southern BDI   Burundi         
#>  4 AFE   Africa Eastern and Southern COM   Comoros         
#>  5 AFE   Africa Eastern and Southern COD   Congo, Dem. Rep.
#>  6 AFE   Africa Eastern and Southern ERI   Eritrea         
#>  7 AFE   Africa Eastern and Southern SWZ   Eswatini        
#>  8 AFE   Africa Eastern and Southern ETH   Ethiopia        
#>  9 AFE   Africa Eastern and Southern KEN   Kenya           
#> 10 AFE   Africa Eastern and Southern LSO   Lesotho         
#> # ℹ 2,075 more rows

wb_groups %>% count(wbgn) %>% data.frame
#>                                                  wbgn   n
#> 1                         Africa Eastern and Southern  26
#> 2                          Africa Western and Central  22
#> 3                                          Arab World  22
#> 4                              Caribbean small states  11
#> 5                      Central Europe and the Baltics  11
#> 6                          Early-demographic dividend  62
#> 7                                 East Asia & Pacific  38
#> 8                    East Asia & Pacific (IDA & IBRD)  23
#> 9         East Asia & Pacific (excluding high income)  22
#> 10                                          Euro area  20
#> 11                              Europe & Central Asia  58
#> 12                 Europe & Central Asia (IDA & IBRD)  23
#> 13      Europe & Central Asia (excluding high income)  18
#> 14                                     European Union  27
#> 15           Fragile and conflict affected situations  39
#> 16             Heavily indebted poor countries (HIPC)  39
#> 17                                        High income  86
#> 18                                          IBRD only  67
#> 19                                   IDA & IBRD total 145
#> 20                                          IDA blend  18
#> 21                                           IDA only  60
#> 22                                          IDA total  78
#> 23                          Late-demographic dividend  54
#> 24                          Latin America & Caribbean  42
#> 25             Latin America & Caribbean (IDA & IBRD)  31
#> 26  Latin America & Caribbean (excluding high income)  23
#> 27       Least developed countries: UN classification  45
#> 28                                Low & middle income 131
#> 29                                         Low income  26
#> 30                                Lower middle income  51
#> 31                         Middle East & North Africa  21
#> 32            Middle East & North Africa (IDA & IBRD)  12
#> 33 Middle East & North Africa (excluding high income)  13
#> 34                                      Middle income 105
#> 35                                      North America   3
#> 36                                       OECD members  38
#> 37                                 Other small states  18
#> 38                        Pacific island small states  11
#> 39                          Post-demographic dividend  38
#> 40                           Pre-demographic dividend  37
#> 41                                 Small states (SST)  40
#> 42                                         South Asia   8
#> 43                            South Asia (IDA & IBRD)   8
#> 44                                 Sub-Saharan Africa  48
#> 45                    Sub-Saharan Africa (IDA & IBRD)  48
#> 46         Sub-Saharan Africa (excluding high income)  47
#> 47                                Upper middle income  54
#> 48                                              World 218

Here we again refer to how this section started, but let’s assume the population to which we want to infer is “low-income countries”. We can identify the units in that population with no problem whatsoever.

wb_groups %>% filter(wbgn == "Low income") 
#> # A tibble: 26 × 4
#>    wbgc  wbgn       iso3c name                    
#>    <chr> <chr>      <chr> <chr>                   
#>  1 LIC   Low income AFG   Afghanistan             
#>  2 LIC   Low income BFA   Burkina Faso            
#>  3 LIC   Low income BDI   Burundi                 
#>  4 LIC   Low income CAF   Central African Republic
#>  5 LIC   Low income TCD   Chad                    
#>  6 LIC   Low income COD   Congo, Dem. Rep.        
#>  7 LIC   Low income ERI   Eritrea                 
#>  8 LIC   Low income ETH   Ethiopia                
#>  9 LIC   Low income GMB   Gambia, The             
#> 10 LIC   Low income GNB   Guinea-Bissau           
#> # ℹ 16 more rows

wb_groups %>% filter(wbgn == "Low income") %>% pull(name)
#>  [1] "Afghanistan"               "Burkina Faso"             
#>  [3] "Burundi"                   "Central African Republic" 
#>  [5] "Chad"                      "Congo, Dem. Rep."         
#>  [7] "Eritrea"                   "Ethiopia"                 
#>  [9] "Gambia, The"               "Guinea-Bissau"            
#> [11] "Korea, Dem. People's Rep." "Liberia"                  
#> [13] "Madagascar"                "Malawi"                   
#> [15] "Mali"                      "Mozambique"               
#> [17] "Niger"                     "Rwanda"                   
#> [19] "Sierra Leone"              "Somalia"                  
#> [21] "South Sudan"               "Sudan"                    
#> [23] "Syrian Arab Republic"      "Togo"                     
#> [25] "Uganda"                    "Yemen, Rep."

Perhaps we can anticipate a few issues that might emerge with this population. For example, there is no South Sudan before 2011 and North Korea is a notorious data desert. Since these classifications are current (as of the 2025 fiscal year), we can’t say what this population would’ve looked like in 2005 (beyond the obvious absence of South Sudan). No matter, I want to at least impress that we are being reasonably deliberate about identifying our population outright. Our population might be the universe of Correlates of War states, which we can subset to regions (once your eagle eye identifies how Correlates of War state codes crudely communicate geographical regions). Our population might be low-income countries, which may or may not have some data issues. We’re being transparent about identifying populations of interest based on assorted tools at our disposal. Use them to your advantage.

How Do I Create My Data?

First, you need to identify your population. Let’s keep this exercise reasonably simple and focus on South Asia.

wb_groups %>% filter(wbgn == "South Asia") -> southAsia

southAsia
#> # A tibble: 8 × 4
#>   wbgc  wbgn       iso3c name       
#>   <chr> <chr>      <chr> <chr>      
#> 1 SAS   South Asia AFG   Afghanistan
#> 2 SAS   South Asia BGD   Bangladesh 
#> 3 SAS   South Asia BTN   Bhutan     
#> 4 SAS   South Asia IND   India      
#> 5 SAS   South Asia MDV   Maldives   
#> 6 SAS   South Asia NPL   Nepal      
#> 7 SAS   South Asia PAK   Pakistan   
#> 8 SAS   South Asia LKA   Sri Lanka

As beginners, our eyes gravitate toward the country names. As researchers, they should go to the three-character ISO codes. That’s ultimately what the World Bank is (mostly) using for benchmarking their data and a failure to be diligent about benchmarking to ISO codes creates some occasional headaches. See for yourself.

wbd_example %>%
  filter(iso3c == "TUR" | iso3c == "CZE") %>% filter(year %in% c(2000, 2020)) %>% 
  arrange(iso3c, year)
#> # A tibble: 6 × 7
#>   country        iso2c iso3c  year rgdppc lifeexp    hci
#>   <chr>          <chr> <chr> <int>  <dbl>   <dbl>  <dbl>
#> 1 Czechia        CZ    CZE    2000 12312.    75.0 NA    
#> 2 Czech Republic CZ    CZE    2020    NA     NA    0.752
#> 3 Czechia        CZ    CZE    2020 19048.    78.2 NA    
#> 4 Turkiye        TR    TUR    2000  6455.    71.9 NA    
#> 5 Turkey         TR    TUR    2020    NA     NA    0.649
#> 6 Turkiye        TR    TUR    2020 12072.    75.8 NA

You can see what happened here, and this came as is from the World Bank. Don’t lean on an English country name to help with you anything important.

No matter, let’s return to South Asia and take control of our data-generating process. We have our population of interest, but what is our unit of analysis once we add a temporal component? In almost every instance, the temporal unit would be years. Most data of interest to us in the cross-national context is typically aggregated to years. So, let’s go from there.

Creating a Panel of State-Years

Let’s assume we wanted to proceed with a panel of these eight South Asian states from 2000 to 2020. If that’s what we wanted, then basic math say we should have 168 observations for our population of eight cases (i.e. 21*8). We could hope we get that right in Microsoft Excel, or could we do it ourselves based on my 2019 post. I’ll be honest that I return to this procedure all the time in creating data sets to analyze.

southAsia %>%
  rowwise() %>% # think rowwise
  # below: create an embedded list, as a column, of a sequence from 2000 to 2020
  # This will increment by 1.
  mutate(year = list(seq(2000, 2020, by = 1))) %>% 
  # unnest, to get our panel
  unnest(year)
#> # A tibble: 168 × 5
#>    wbgc  wbgn       iso3c name         year
#>    <chr> <chr>      <chr> <chr>       <dbl>
#>  1 SAS   South Asia AFG   Afghanistan  2000
#>  2 SAS   South Asia AFG   Afghanistan  2001
#>  3 SAS   South Asia AFG   Afghanistan  2002
#>  4 SAS   South Asia AFG   Afghanistan  2003
#>  5 SAS   South Asia AFG   Afghanistan  2004
#>  6 SAS   South Asia AFG   Afghanistan  2005
#>  7 SAS   South Asia AFG   Afghanistan  2006
#>  8 SAS   South Asia AFG   Afghanistan  2007
#>  9 SAS   South Asia AFG   Afghanistan  2008
#> 10 SAS   South Asia AFG   Afghanistan  2009
#> # ℹ 158 more rows

If we’re thinking ahead to merging data into this panel, it’s good to know that we should not ever have more than 168 observations in the data. If that happened, it might be because we did something inadvisable (like merging “Czechia” to “Czech Republic” for both a country name and three-character ISO code).

Creating a Panel of State-Quarters or State-Months

There are conceivably more granular periods over which the population to comprise our panel can be observed. While I’ve yet to get comfortable with the International Monetary Fund’s (IMF’s) application programming interface (API), at least through R, I know the IMF has data for countries that are more granular than yearly. Generally, some data (like trade) are available monthly or quarterly. If we wanted a panel of state-months or state-quarters from 2000 to 2020, we could create it like this. This will lean on {lubridate}, which is in {tidyverse}.

# Create state-quarters, one way...
southAsia %>%
  rowwise() %>% # think rowwise
  mutate(date = list(seq(ymd(20000101),
                         ymd(20201201), 
                         by = "1 quarter"))) %>%
  # unnest, to get our panel
  unnest(date) %>%
  mutate(quarter = quarter(date))
#> # A tibble: 672 × 6
#>    wbgc  wbgn       iso3c name        date       quarter
#>    <chr> <chr>      <chr> <chr>       <date>       <int>
#>  1 SAS   South Asia AFG   Afghanistan 2000-01-01       1
#>  2 SAS   South Asia AFG   Afghanistan 2000-04-01       2
#>  3 SAS   South Asia AFG   Afghanistan 2000-07-01       3
#>  4 SAS   South Asia AFG   Afghanistan 2000-10-01       4
#>  5 SAS   South Asia AFG   Afghanistan 2001-01-01       1
#>  6 SAS   South Asia AFG   Afghanistan 2001-04-01       2
#>  7 SAS   South Asia AFG   Afghanistan 2001-07-01       3
#>  8 SAS   South Asia AFG   Afghanistan 2001-10-01       4
#>  9 SAS   South Asia AFG   Afghanistan 2002-01-01       1
#> 10 SAS   South Asia AFG   Afghanistan 2002-04-01       2
#> # ℹ 662 more rows

# Create state-months.
# You'll notice this is just copy-pasting the above and changing a few things.
southAsia %>%
  rowwise() %>% # think rowwise
  mutate(date = list(seq(ymd(20000101),
                         ymd(20201201), 
                         by = "1 month"))) %>%
  # unnest, to get our panel
  unnest(date) %>%
  mutate(month = month(date))
#> # A tibble: 2,016 × 6
#>    wbgc  wbgn       iso3c name        date       month
#>    <chr> <chr>      <chr> <chr>       <date>     <dbl>
#>  1 SAS   South Asia AFG   Afghanistan 2000-01-01     1
#>  2 SAS   South Asia AFG   Afghanistan 2000-02-01     2
#>  3 SAS   South Asia AFG   Afghanistan 2000-03-01     3
#>  4 SAS   South Asia AFG   Afghanistan 2000-04-01     4
#>  5 SAS   South Asia AFG   Afghanistan 2000-05-01     5
#>  6 SAS   South Asia AFG   Afghanistan 2000-06-01     6
#>  7 SAS   South Asia AFG   Afghanistan 2000-07-01     7
#>  8 SAS   South Asia AFG   Afghanistan 2000-08-01     8
#>  9 SAS   South Asia AFG   Afghanistan 2000-09-01     9
#> 10 SAS   South Asia AFG   Afghanistan 2000-10-01    10
#> # ℹ 2,006 more rows

# Create state-quarters, another way...
southAsia %>%
  rowwise() %>% # think rowwise
  mutate(date = list(seq(ymd(20000101),
                         ymd(20201201), 
                         by = "1 month"))) %>%
  # unnest, to get our panel
  unnest(date) %>%
  mutate(month = month(date)) %>%
  filter(month %in% c(1,4,7,10))
#> # A tibble: 672 × 6
#>    wbgc  wbgn       iso3c name        date       month
#>    <chr> <chr>      <chr> <chr>       <date>     <dbl>
#>  1 SAS   South Asia AFG   Afghanistan 2000-01-01     1
#>  2 SAS   South Asia AFG   Afghanistan 2000-04-01     4
#>  3 SAS   South Asia AFG   Afghanistan 2000-07-01     7
#>  4 SAS   South Asia AFG   Afghanistan 2000-10-01    10
#>  5 SAS   South Asia AFG   Afghanistan 2001-01-01     1
#>  6 SAS   South Asia AFG   Afghanistan 2001-04-01     4
#>  7 SAS   South Asia AFG   Afghanistan 2001-07-01     7
#>  8 SAS   South Asia AFG   Afghanistan 2001-10-01    10
#>  9 SAS   South Asia AFG   Afghanistan 2002-01-01     1
#> 10 SAS   South Asia AFG   Afghanistan 2002-04-01     4
#> # ℹ 662 more rows

Creating a Panel of State-Days

It’s conceivable, however implausible, that a student might be interested in a panel of state-days. In this case, perhaps the student is interested in daily exchange rates of these eight currencies vis-a-vis the U.S. dollar as their currencies are traded on the foreign exchange market. If so, you’re really just changing one line of code to create what you want.

southAsia %>%
  rowwise() %>% # think rowwise
  mutate(date = list(seq(ymd(20000101),
                         ymd(20201201), 
                         by = "1 day"))) %>%
  # unnest, to get our panel
  unnest(date)
#> # A tibble: 61,128 × 5
#>    wbgc  wbgn       iso3c name        date      
#>    <chr> <chr>      <chr> <chr>       <date>    
#>  1 SAS   South Asia AFG   Afghanistan 2000-01-01
#>  2 SAS   South Asia AFG   Afghanistan 2000-01-02
#>  3 SAS   South Asia AFG   Afghanistan 2000-01-03
#>  4 SAS   South Asia AFG   Afghanistan 2000-01-04
#>  5 SAS   South Asia AFG   Afghanistan 2000-01-05
#>  6 SAS   South Asia AFG   Afghanistan 2000-01-06
#>  7 SAS   South Asia AFG   Afghanistan 2000-01-07
#>  8 SAS   South Asia AFG   Afghanistan 2000-01-08
#>  9 SAS   South Asia AFG   Afghanistan 2000-01-09
#> 10 SAS   South Asia AFG   Afghanistan 2000-01-10
#> # ℹ 61,118 more rows

There is the obvious caveat that these are just every day of all years for eight states in South Asia from 2000 to 2020. It won’t communicate days in which the foreign exchange market is closed (though eliminating weekends isn’t difficult at all). No matter, if you’re getting your exchange rate data from something like {quantmod}, that’ll become apparent. In which case, days in which the foreign exchange market are closed alter the unit of analysis slightly. They’re no longer state-days, but state-trading days.

Conclusion: Why Does This Matter?

The banner above this post notes that I’ve mused on this exact thing six times, and this would be the seventh. Coming up next is another instructional bit about the use of {WDI}, which I’ve also done here and here. I’m repeating myself because I want students to note it takes very little effort to be deliberate in defining the population of interest (in IR applications) and takes almost no effort to create the bare bones of the data themselves. You should not have to lean on data you download to take care of that for you, because there is no guarantee it will. Data you download doesn’t necessarily know the population of interest to you, only the population of interest to the data. Define your population, and you know what you’re doing. Define your population, as observed over time, and you have the exact dimensions of your unit of analysis. If something is posing an issue toward creating the full data set of interest, it’ll be easier to spot.

I say this because code in {peacesciencer} and just about everything I do leans on left_join() in {dplyr}. I’m a left_join() absolutist. Being abundantly clear about the population of interest and creating the data from scratch allows you full control over the data-generating process. It’ll also allow you to more efficiently use functions like WDI() in {WDI}. Observe:

southAsia$iso3c
#> [1] "AFG" "BGD" "BTN" "IND" "MDV" "NPL" "PAK" "LKA"

WDI(country = southAsia$iso3c,
    indicator = c("gdppc" = "NY.GDP.PCAP.KD"),
    start = 2000, end = 2020) -> wbd

wbd %>% as_tibble() # cool, I know these dimensions match.
#> # A tibble: 168 × 5
#>    country     iso2c iso3c  year gdppc
#>    <chr>       <chr> <chr> <int> <dbl>
#>  1 Afghanistan AF    AFG    2020  528.
#>  2 Afghanistan AF    AFG    2019  558.
#>  3 Afghanistan AF    AFG    2018  553.
#>  4 Afghanistan AF    AFG    2017  563.
#>  5 Afghanistan AF    AFG    2016  564.
#>  6 Afghanistan AF    AFG    2015  566.
#>  7 Afghanistan AF    AFG    2014  575.
#>  8 Afghanistan AF    AFG    2013  581.
#>  9 Afghanistan AF    AFG    2012  569.
#> 10 Afghanistan AF    AFG    2011  525.
#> # ℹ 158 more rows

wbd %>% as_tibble() %>% # trust the three-character ISO codes...
  select(iso3c, year, gdppc) -> wbd

southAsia %>%
  rowwise() %>% # thing rowwise
  # below: create an embedded list, as a column, of a sequence from 2000 to 2020
  # This will increment by 1.
  mutate(year = list(seq(2000, 2020, by = 1))) %>% 
  # unnest, to get our panel
  unnest(year) %>%
  # Everything above was copy-pasted, but's join in our GDP per capita data...
  left_join(., wbd) -> southAsia

southAsia # nailed it
#> # A tibble: 168 × 6
#>    wbgc  wbgn       iso3c name         year gdppc
#>    <chr> <chr>      <chr> <chr>       <dbl> <dbl>
#>  1 SAS   South Asia AFG   Afghanistan  2000  308.
#>  2 SAS   South Asia AFG   Afghanistan  2001  277.
#>  3 SAS   South Asia AFG   Afghanistan  2002  338.
#>  4 SAS   South Asia AFG   Afghanistan  2003  346.
#>  5 SAS   South Asia AFG   Afghanistan  2004  339.
#>  6 SAS   South Asia AFG   Afghanistan  2005  364.
#>  7 SAS   South Asia AFG   Afghanistan  2006  368.
#>  8 SAS   South Asia AFG   Afghanistan  2007  411.
#>  9 SAS   South Asia AFG   Afghanistan  2008  418.
#> 10 SAS   South Asia AFG   Afghanistan  2009  489.
#> # ℹ 158 more rows

Notice I trusted the three-character ISO codes to communicate the population of interest to me, and the years to match one-to-one (as they do) with what’s in my panel. I just needed the information I want (GDP per capita) and the information that would help me merge into the data frame I’m creating (the three-character ISO code, and importantly the year).

Be deliberate, but trust the process. As you learn to trust the process, you’ll also get a better idea of what went wrong if/when something does go wrong.

  1. You can obviously toggle this a bit if there is sufficient weirdness in your population. For example, Bangladesh was an exclave of Pakistan before a war of liberation (with assistance from India) created it in December 1971. Thus, there would be no Bangladesh in 1970, but there would be a Bangladesh for about 15 days in 1971. However, that “weirdness” only manifests when we’ve included a temporal component to how we understand the “population”. 

  2. You are welcome to read about some of the peculiarities of this state classification system, though it is ubiquitous in the study of inter-state conflict. By far the biggest open questions would concern cases like Germany, Yugoslavia/Serbia, and Yemen. I riff on those a little bit on {peacesciencer} and what are the implications of those cases.