Create Country-Year and (Non)-Directed Dyad-Year Data With Just a Few Lines in R

This Functionality is Now in `{peacesciencer}` ⤵️

This post became the basis for {peacesciencer}, which you can now install on CRAN. The processes described here ultimately became create_dyadyears() and create_stateyears() in that package. Please check out the package’s website for its continued development.

: The most accurate map in the world, just 'cause.

I’m writing this mostly as a note to myself since I have to remind myself how I do this every time I do it.

The long and short of this post is I remember being in grad school and working with international conflict data (mostly Correlates of War) and creating a lot of country-year and dyad-year data sets to analyze questions of interest to the earlier research I did with my adviser. Works in this line of research include country-year-level analyses of territorial threat, state capacity, and civil war and directed dyad-year analyses on host of questions of interest to the democratic peace research agenda: namely if democratic conflict selection advantages and democratic dispute resolution are epiphenomenal to territorial peace.

I also remember EUGene, an invaluable program at the time that could seamlessly generate non-directed dyad-year conflict data, directed dyad-year conflict data, and country-year data, complete with several covariates and conflict data of interest to the researcher, in just minutes. Yet my professional development occurred at a juncture when EUGene started to lose some of its value. The causes here were multiple and the reader should not interpret this as the fault of the developers. For one, EUGene existed only as a Windows binary. I got proficient at installing Wine and using it to make EUGene install on my Linux desktop (and later my Macbook). Yet, this is a tedious extra step. Further, EUGene was limited to the data it had and still mostly served its original purpose: to facilitate analyses and replications of analyses from the 1990s still indebted to version 2.1 of the Correlates of War Militarized Interstate Dispute (MID) data. Updates that could incorporate newer conflict data (i.e. the update to version 3 in 2004 and version 4 in 2014) and the revisions my adviser and I were proposing to the MID data lagged behind user demand for these features. Thus, I had to figure out how to do what EUGene was previously doing for me.

Fortunately, I learned a few lines of R code can create country-year and dyad-year panel data frames from basic country-level information as the remainder of this post will show. The lines necessary to create this code became even fewer when {plyr} gave way to {dplyr} in my workflow. All the user needs is the Correlates of War State System Membership data. I’ll be using v2016 but any version should be fine for this purpose.

Read in the data (wherever you stored it) into your R session and load the {tidyverse} package before it.

library(tidyverse)
States <- read_csv("~/Dropbox/data/cow/states/states2016.csv")

Country-Year Data

The following code will create a simple country-year data frame that a user can populate with country-year-level data of interest (e.g. civil war data, various IPE data). I’ll annotate the code below so the reader can see what it’s doing.

States %>%
  # This mutate command below is optional.
  # Basically: the data extend to 2016. If you want to extend it to the most recent year, change it.
  # If you don't need it (i.e. most CoW conflict data end in 2010), leave it alone or comment it out.
  mutate(endyear = ifelse(endyear == 2016, 2018, endyear)) %>%
  # Prepare the pipe to think rowwise. If you don't, the next mutate command will fail.
  rowwise() %>%
  # Create a list in a tibble that we're going to expand soon.
  mutate(year = list(seq(styear, endyear))) %>%
  # Unnest the list, which will expand the data.
  unnest() %>%
  # Arrange by ccode, year, just to be sure.
  arrange(ccode, year) %>%
  # Select just the ccode, year
  select(ccode, statenme, year) %>%
  # Make sure there are no duplicate country-year observations.
  # There shouldnt' be. But be sure.
  # Finally: assign to object.
  distinct(ccode, statenme, year) -> CY

As a proof of concept, here are all the years for Grand Duchy of Mecklenburg-Schwerin in the the Correlates of War State System Membership data. These years would coincide with its emergence as a state system member on Jan. 1, 1843 to its elimination as an independent state upon its entry into the North German Confederation. This alliance was an effective concession of state sovereignty by Fredrich Franz II to Prussia.

CY %>% filter(ccode == 280) %>%
  kable(., format="html",
        table.attr='id="stevetable"',
        caption = "A Simple Country-Year Panel for Mecklenburg Schwerin, 1843-1867",
        align=c("c","l","c"))

A Simple Country-Year Panel for Mecklenburg Schwerin, 1843-1867
ccode	statenme	year
280	Mecklenburg Schwerin	1843
280	Mecklenburg Schwerin	1844
280	Mecklenburg Schwerin	1845
280	Mecklenburg Schwerin	1846
280	Mecklenburg Schwerin	1847
280	Mecklenburg Schwerin	1848
280	Mecklenburg Schwerin	1849
280	Mecklenburg Schwerin	1850
280	Mecklenburg Schwerin	1851
280	Mecklenburg Schwerin	1852
280	Mecklenburg Schwerin	1853
280	Mecklenburg Schwerin	1854
280	Mecklenburg Schwerin	1855
280	Mecklenburg Schwerin	1856
280	Mecklenburg Schwerin	1857
280	Mecklenburg Schwerin	1858
280	Mecklenburg Schwerin	1859
280	Mecklenburg Schwerin	1860
280	Mecklenburg Schwerin	1861
280	Mecklenburg Schwerin	1862
280	Mecklenburg Schwerin	1863
280	Mecklenburg Schwerin	1864
280	Mecklenburg Schwerin	1865
280	Mecklenburg Schwerin	1866
280	Mecklenburg Schwerin	1867

Directed Dyad-Year Data

Directed dyad-year data are useful when the researcher is interested in, say, conflict onsets in a given dyad-year and it is important who initiates the dispute. In this interpretation, “France-Germany, 1911” and “Germany-France, 1911” are importantly different observations because Germany initiated the Agadir Crisis (MID#0315) against France and that distinction matters. There is likely an application of directed dyad-year panel frames for IPE researchers interested in, say, directional trade flows. However, I’ve never done an analysis like this before.

The code to create directed dyad-year data from the Correlates of War State System Membership data is remarkably easy to do.

States %>% 
  # Select just the stuff we need
  select(ccode, styear, endyear) %>% 
  # Expand the data, create two ccodes as well
  expand(ccode1=ccode, ccode2=ccode, year=seq(1816,2016)) %>% 
  # Filter out where ccode1 == ccode2
  filter(ccode1!=ccode2) %>% 
  # When you're merging into dyad-year data, prepare to do it twice.
  # Basically: merge in data (here, minimally: the info from the `States` data) for ccode1
  left_join(., States, by=c("ccode1"="ccode")) %>%
  # ...and filter out cases where the years don't align.
  filter(year >= styear & year <= endyear) %>%
  # Get rid of styear and endyear to do it again.
  select(-styear,-endyear) %>% 
  # And do it again, this time for ccode2
  left_join(., States, by=c("ccode2"="ccode")) %>%
  # Again, filter out cases where years don't align.
  filter(year >= styear & year <= endyear) %>%
  # And select just what we need.
  select(ccode1, ccode2, year) -> DDY

I want to belabor a few points about the nature of the directed dyad-year data I just created.

First, when populating a dyad-year (directed or non-directed) panel data frame with some covariates (e.g. Polity data, GDP data, whatever), be prepared to merge in data “twice.” That is, the data frame with which the user is primarily working is dyad-year, but the data the user wants to add into the dyad-year panel frame are country-year (e.g. the Polity score for France or Prussia/Germany in 1826, 1827, 1828, and so on). Thus, the user will need to merge it “twice” by first recoding the country-year country code variable to be something like ccode1. Merge and that assigns the data to pertain to ccode1 in the dyad-year data. Rename the country-year country code variable again to be something like ccode2. Repeat, and that assigns the data to pertain to ccode2 in the dyad-year data.

For what it’s worth, the left_join() function in {dplyr} is sophisticated enough to allow you to merge on two different keys. That’s what the by field is doing in the left_join() part of the piped chain of functions above.

Second, the filtering of the years will make sure the user is not left with any observations where the countries did not exist at the same time. For example, Kuwait and Baden were never independent states at the same time. That is an important detail. Obviously states like Belize and Wuerttemberg can’t have a conflict with each other when they never existed at the same time.

Third, the data take no consideration of whether the dyads here are “politically relevant” or “politically active”. They are universal. This means two things. One, this is a long data set. 1,912,350 rows long. It’s also why I dropped out all other columns to avoid the data frame consuming too much memory in R for this simple exercise.

The user may want to apply some case exclusion rules like “political relevance” or “political activity” before merging in other data because there are a lot of irrelevant dyads. That will start to devour memory in the R session pretty quickly. When I discuss this topic with my students, I always bring up my favorite dyad—Mongolia-Nigeria—because the probability of conflict between both states, let alone a fatal MID, is effectively zero even if the thought experiment of what such a fatal conflict would resemble will have your brain wandering to weird places.

However, Mongolia-Nigeria are in this data set, as are Nigeria-Mongolia.

DDY %>%
  filter(ccode1 %in% c(475, 712) & ccode2 %in% c(475, 712)) %>%
  filter(year <= 1970) %>%
  mutate(statenme1 = countrycode::countrycode(ccode1, "cown", "country.name"),
         statenme2 = countrycode::countrycode(ccode2, "cown", "country.name")) %>%
  select(ccode1, ccode2, statenme1, statenme2, year) %>%
  kable(., format="html",
        table.attr='id="stevetable"',
        caption = "A Simple Directed Dyad-Year Panel for Mongolia and Nigeria, 1960-1970",
        align=c("c","c","l","l","c"))

A Simple Directed Dyad-Year Panel for Mongolia and Nigeria, 1960-1970
ccode1	ccode2	statenme1	statenme2	year
475	712	Nigeria	Mongolia	1960
475	712	Nigeria	Mongolia	1961
475	712	Nigeria	Mongolia	1962
475	712	Nigeria	Mongolia	1963
475	712	Nigeria	Mongolia	1964
475	712	Nigeria	Mongolia	1965
475	712	Nigeria	Mongolia	1966
475	712	Nigeria	Mongolia	1967
475	712	Nigeria	Mongolia	1968
475	712	Nigeria	Mongolia	1969
475	712	Nigeria	Mongolia	1970
712	475	Mongolia	Nigeria	1960
712	475	Mongolia	Nigeria	1961
712	475	Mongolia	Nigeria	1962
712	475	Mongolia	Nigeria	1963
712	475	Mongolia	Nigeria	1964
712	475	Mongolia	Nigeria	1965
712	475	Mongolia	Nigeria	1966
712	475	Mongolia	Nigeria	1967
712	475	Mongolia	Nigeria	1968
712	475	Mongolia	Nigeria	1969
712	475	Mongolia	Nigeria	1970

Again, this will create universal directed dyad-year data. It’s imperative for the user to make reasonable case exclusion rules on top of that. There may be a reason to have a dyad like Mongolia-Nigeria, Belize-Botswana, or Estonia-Trinidad and Tobago in the data, no matter how rarely those countries interact. However, that’s for the user to justify in the analysis section of the paper s/he will write.

Non-Directed Dyad-Year Data

Non-directed dyad-year data are (I think) the most common form of dyad-year analyses of inter-state conflict. In other words, the distinction between who initiated a MID versus who was targeted in a MID is irrelevant. It does not matter that Germany initiated a MID against France in 1911 in Agadir. It only matters that there was one, especially between those two countries at that point in time.

In practice, creating non-directed dyad-year data means ccode1 is whichever state has the lower ccode in the dyad. That means the United States (ccode == 2) will always be first in any non-directed dyad in which it is a party and Samoa (ccode == 990) will never be first in any non-directed dyad in which it is a member.

Once you understand that distinction, creating a non-directed dyad-year data frame from the directed dyad-year data frame is simple.

DDY %>% filter(ccode2 > ccode1) -> NDY

It’s that simple. Notice the Mongolia-Nigeria directed dyad drops out because Mongolia has the higher country code.

NDY %>%
  filter(ccode1 %in% c(475, 712) & ccode2 %in% c(475, 712)) %>%
  filter(year <= 1970) %>%
  mutate(statenme1 = countrycode::countrycode(ccode1, "cown", "country.name"),
         statenme2 = countrycode::countrycode(ccode2, "cown", "country.name")) %>%
  select(ccode1, ccode2, statenme1, statenme2, year) %>%
  kable(., format="html",
        table.attr='id="stevetable"',
        caption = "A Simple Non-Directed Dyad-Year Panel for Mongolia and Nigeria, 1960-1970",
        align=c("c","c","l","l","c"))

A Simple Non-Directed Dyad-Year Panel for Mongolia and Nigeria, 1960-1970
ccode1	ccode2	statenme1	statenme2	year
475	712	Nigeria	Mongolia	1960
475	712	Nigeria	Mongolia	1961
475	712	Nigeria	Mongolia	1962
475	712	Nigeria	Mongolia	1963
475	712	Nigeria	Mongolia	1964
475	712	Nigeria	Mongolia	1965
475	712	Nigeria	Mongolia	1966
475	712	Nigeria	Mongolia	1967
475	712	Nigeria	Mongolia	1968
475	712	Nigeria	Mongolia	1969
475	712	Nigeria	Mongolia	1970

This Functionality is Now in {peacesciencer} ⤵️

Country-Year Data

Directed Dyad-Year Data

Non-Directed Dyad-Year Data

This Functionality is Now in `{peacesciencer}` ⤵️