Create Different Kinds of Data in `{peacesciencer}`

This tutorial is a companion to the user guide, which shows how to create different kinds of data in peacesciencer. However, space considerations (for ideal publication in a peer-reviewed journal) preclude the full “knitting” experience (i.e. giving the user a preview of what the data look like). What follows is a brief guide that expands on the tutorial section of that user guide for creating different kinds of data in peacesciencer.

This vignette will lean on the tidyverse package, which will be included in almost anything you should do (optimally) with peacesciencer. I will also load lubridate. Internal functions in peacesciencer use lubridate—it is a formal dependency of peacesciencer—but users may want to load it for doing some additional stuff outside of peacesciencer.

library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
#> ✓ tibble  3.1.4     ✓ dplyr   1.0.7
#> ✓ tidyr   1.1.4     ✓ stringr 1.4.0
#> ✓ readr   2.0.2     ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(peacesciencer)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

State-Year Data

The most basic form of data peacesciencer creates is state-year, by way of create_stateyears(). create_stateyears() has two arguments: system and mry. system takes either “cow” or “gw”, depending on whether the user wants Correlates of War state years or Gleditsch-Ward state-years. It defaults to “cow” in the absence of a user-specified override given the prominence of Correlates of War data in the peace science ecosystem. mry takes a logical (TRUE or FALSE), depending on whether the user wants the function to extend to the most recently concluded calendar year (2020). The Correlates of War state system data extend to the end of 2016 while the Gleditsch-Ward state system extend to the end of the 2017. This argument will allow the researcher to extend the data a few years, under the (reasonable) assumption there have been no fundamental composition to the state system since these data sets were last updated. mry defaults to TRUE in the absence of a user-specified override.

This will create Correlates of War state-year data from 1816 to 2020.

create_stateyears()
#> # A tibble: 16,731 × 3
#>    ccode statenme                  year
#>    <dbl> <chr>                    <int>
#>  1     2 United States of America  1816
#>  2     2 United States of America  1817
#>  3     2 United States of America  1818
#>  4     2 United States of America  1819
#>  5     2 United States of America  1820
#>  6     2 United States of America  1821
#>  7     2 United States of America  1822
#>  8     2 United States of America  1823
#>  9     2 United States of America  1824
#> 10     2 United States of America  1825
#> # … with 16,721 more rows

This will create Gleditsch-Ward state-year data from 1816 to 2017.

create_stateyears(system = "gw", mry = FALSE)
#> # A tibble: 17,767 × 3
#>    gwcode statename                 year
#>     <dbl> <chr>                    <int>
#>  1      2 United States of America  1816
#>  2      2 United States of America  1817
#>  3      2 United States of America  1818
#>  4      2 United States of America  1819
#>  5      2 United States of America  1820
#>  6      2 United States of America  1821
#>  7      2 United States of America  1822
#>  8      2 United States of America  1823
#>  9      2 United States of America  1824
#> 10      2 United States of America  1825
#> # … with 17,757 more rows

Dyad-Year Data

create_dyadyears() is one of the most useful functions in peacesciencer, transforming the raw Correlates of War state system data (cow_states in peacesciencer) or Gleditsch-Ward state system data (gw_states) into all possible dyad-years. It has three arguments. system and mry operate the same as they do in create_stateyears(). There is an additional argument—directed—that also takes a logical (TRUE or FALSE). The default here is TRUE, returning directed dyad-year data (useful for dyadic conflict analyses where the initiator/target distinction matters). FALSE returns non-directed dyad-year data, useful for cases where the initiator/target distinction does not matter and the researcher cares more about the presence or absence of a conflict. The convention for non-directed dyad-year data is that ccode2 > ccode1 and the underlying code of create_dyadyears() simply takes the directed dyad-year data and lops it in half with that rule.

Here are all Correlates of War dyad-years from 1816 to 2020.

create_dyadyears()
#> Joining, by = c("ccode1", "ccode2", "year")
#> # A tibble: 2,063,610 × 3
#>    ccode1 ccode2  year
#>     <dbl>  <dbl> <int>
#>  1      2     20  1920
#>  2      2     20  1921
#>  3      2     20  1922
#>  4      2     20  1923
#>  5      2     20  1924
#>  6      2     20  1925
#>  7      2     20  1926
#>  8      2     20  1927
#>  9      2     20  1928
#> 10      2     20  1929
#> # … with 2,063,600 more rows

Here are all Gleditsch-Ward dyad-years with the same temporal domain.

create_dyadyears(system = "gw")
#> Joining, by = c("gwcode1", "gwcode2", "year")
#> # A tibble: 2,029,622 × 3
#>    gwcode1 gwcode2  year
#>      <dbl>   <dbl> <int>
#>  1       2      20  1867
#>  2       2      20  1868
#>  3       2      20  1869
#>  4       2      20  1870
#>  5       2      20  1871
#>  6       2      20  1872
#>  7       2      20  1873
#>  8       2      20  1874
#>  9       2      20  1875
#> 10       2      20  1876
#> # … with 2,029,612 more rows

Dyadic Dispute-Year Data

Dyadic dispute-year data come pre-processed in peacesciencer. Another vignette show how these are transformed to true dyad-year data, but they are also available for analysis. For example, the (directed) dyadic dispute-year Gibler-Miller-Little (GML) MID data are available as gml_dirdisp. Here, we can add information to these dyadic dispute-years to identify contiguity relationships and Correlates of War major status.

gml_dirdisp %>% add_contiguity() %>% add_cow_majors()
#> Joining, by = c("ccode1", "ccode2", "year")
#> # A tibble: 10,276 × 42
#>    dispnum ccode1 ccode2  year midongoing midonset sidea1 sidea2 revstate1
#>      <dbl>  <dbl>  <dbl> <dbl>      <dbl>    <dbl>  <dbl>  <dbl>     <dbl>
#>  1       2      2    200  1902          1        1      1      0         1
#>  2       2    200      2  1902          1        1      0      1         1
#>  3       3    300    345  1913          1        1      1      0         1
#>  4       3    345    300  1913          1        1      0      1         0
#>  5       4    200    339  1946          1        1      0      1         0
#>  6       4    339    200  1946          1        1      1      0         0
#>  7       7    200    651  1951          1        1      1      0         0
#>  8       7    200    651  1952          1        0      1      0         0
#>  9       7    651    200  1951          1        1      0      1         1
#> 10       7    651    200  1952          1        0      0      1         1
#> # … with 10,266 more rows, and 33 more variables: revstate2 <dbl>,
#> #   revtype11 <dbl>, revtype12 <dbl>, revtype21 <dbl>, revtype22 <dbl>,
#> #   fatality1 <dbl>, fatality2 <dbl>, fatalpre1 <dbl>, fatalpre2 <dbl>,
#> #   hiact1 <dbl>, hiact2 <dbl>, hostlev1 <dbl>, hostlev2 <dbl>, orig1 <dbl>,
#> #   orig2 <dbl>, hiact <dbl>, hostlev <dbl>, mindur <dbl>, maxdur <dbl>,
#> #   outcome <dbl>, settle <dbl>, fatality <dbl>, fatalpre <dbl>, stmon <dbl>,
#> #   endmon <dbl>, recip <dbl>, numa <dbl>, numb <dbl>, ongo2010 <dbl>, …

Users interested in the Correlates of War MID data will have this available for use as cow_mid_dirdisps. Future updates may change the object names for better standardization, but this is how it is now.

State-Day Data

peacesciencer comes with a create_statedays() function. This is admittedly more proof of concept as it is really difficult to conjure too many daily data sets in peace science, certainly with coverage into the 19th century. No matter, create_statedays() will create these data. It too has the same system and mry arguments (and same defaults) as create_stateyears().

Here are all Correlates of War state-days from 1816 to 2020.

create_statedays()
#> # A tibble: 6,061,091 × 3
#>    ccode statenme                 date      
#>    <dbl> <chr>                    <date>    
#>  1     2 United States of America 1816-01-01
#>  2     2 United States of America 1816-01-02
#>  3     2 United States of America 1816-01-03
#>  4     2 United States of America 1816-01-04
#>  5     2 United States of America 1816-01-05
#>  6     2 United States of America 1816-01-06
#>  7     2 United States of America 1816-01-07
#>  8     2 United States of America 1816-01-08
#>  9     2 United States of America 1816-01-09
#> 10     2 United States of America 1816-01-10
#> # … with 6,061,081 more rows

Here are all Gleditsch-Ward state-days with the same temporal domain.

create_statedays(system = "gw")
#> # A tibble: 6,638,781 × 3
#>    gwcode statename                date      
#>     <dbl> <chr>                    <date>    
#>  1      2 United States of America 1816-01-01
#>  2      2 United States of America 1816-01-02
#>  3      2 United States of America 1816-01-03
#>  4      2 United States of America 1816-01-04
#>  5      2 United States of America 1816-01-05
#>  6      2 United States of America 1816-01-06
#>  7      2 United States of America 1816-01-07
#>  8      2 United States of America 1816-01-08
#>  9      2 United States of America 1816-01-09
#> 10      2 United States of America 1816-01-10
#> # … with 6,638,771 more rows

I can conjure an application where a user may want to think of daily conflict episodes within the Gleditsch-Ward domain. The UCDP armed conflict data have more precise dates than, say, the Correlates of War MID data, making such an analysis possible. However, there are no conflict data before 1946 and you should reflect that with peacesciencer with something like this. This will require lubridate.

create_statedays(system = "gw") %>%
  filter(year(date) >= 1946)
#> # A tibble: 3,870,980 × 3
#>    gwcode statename                date      
#>     <dbl> <chr>                    <date>    
#>  1      2 United States of America 1946-01-01
#>  2      2 United States of America 1946-01-02
#>  3      2 United States of America 1946-01-03
#>  4      2 United States of America 1946-01-04
#>  5      2 United States of America 1946-01-05
#>  6      2 United States of America 1946-01-06
#>  7      2 United States of America 1946-01-07
#>  8      2 United States of America 1946-01-08
#>  9      2 United States of America 1946-01-09
#> 10      2 United States of America 1946-01-10
#> # … with 3,870,970 more rows

State-Month Data

State-months are simple aggregations of state-days. You can accomplish this with a few more extra commands after create_statedays().

create_statedays(system = "gw") %>%
  mutate(year = year(date),
         month = month(date)) %>%
  distinct(gwcode, statename, year, month)
#> # A tibble: 218,194 × 4
#>    gwcode statename                 year month
#>     <dbl> <chr>                    <dbl> <dbl>
#>  1      2 United States of America  1816     1
#>  2      2 United States of America  1816     2
#>  3      2 United States of America  1816     3
#>  4      2 United States of America  1816     4
#>  5      2 United States of America  1816     5
#>  6      2 United States of America  1816     6
#>  7      2 United States of America  1816     7
#>  8      2 United States of America  1816     8
#>  9      2 United States of America  1816     9
#> 10      2 United States of America  1816    10
#> # … with 218,184 more rows

State-Quarter Data

There is some assumption about what a “quarter” would look like in a more general context, but it might look something like this. Again, this is an aggregation of create_statedays().

create_statedays(system = "gw") %>%
  mutate(year = year(date),
         month = month(date)) %>%
  filter(month %in% c(1, 4, 7, 10)) %>%
  mutate(quarter = case_when(
    month == 1 ~ "Q1",
    month == 4 ~ "Q2",
    month == 7 ~ "Q3",
    month == 10 ~ "Q4"
  )) %>%
  distinct(gwcode, statename, year, quarter)
#> # A tibble: 72,687 × 4
#>    gwcode statename                 year quarter
#>     <dbl> <chr>                    <dbl> <chr>  
#>  1      2 United States of America  1816 Q1     
#>  2      2 United States of America  1816 Q2     
#>  3      2 United States of America  1816 Q3     
#>  4      2 United States of America  1816 Q4     
#>  5      2 United States of America  1817 Q1     
#>  6      2 United States of America  1817 Q2     
#>  7      2 United States of America  1817 Q3     
#>  8      2 United States of America  1817 Q4     
#>  9      2 United States of America  1818 Q1     
#> 10      2 United States of America  1818 Q2     
#> # … with 72,677 more rows

Leader-Day (Leader-Month, Leader-Year) Data

peacesciencer has leader-level units of analysis as well, which can be easily created with the modified Archigos (archigos) data in peacesciencer. The data are version 4.1.

archigos
#> # A tibble: 3,409 × 11
#>    obsid  gwcode leadid   leader yrborn gender startdate  enddate    entry exit 
#>    <chr>   <dbl> <chr>    <chr>   <dbl> <chr>  <date>     <date>     <chr> <chr>
#>  1 USA-1…      2 81dcc17… Grant    1822 M      1869-03-04 1877-03-04 Regu… Regu…
#>  2 USA-1…      2 81dcc17… Hayes    1822 M      1877-03-04 1881-03-04 Regu… Regu…
#>  3 USA-1…      2 81dcf24… Garfi…   1831 M      1881-03-04 1881-09-19 Regu… Irre…
#>  4 USA-1…      2 81dcf24… Arthur   1829 M      1881-09-19 1885-03-04 Regu… Regu…
#>  5 USA-1…      2 34fb155… Cleve…   1837 M      1885-03-04 1889-03-04 Regu… Regu…
#>  6 USA-1…      2 81dcf24… Harri…   1833 M      1889-03-04 1893-03-04 Regu… Regu…
#>  7 USA-1…      2 34fb155… Cleve…   1837 M      1893-03-04 1897-03-04 Regu… Regu…
#>  8 USA-1…      2 81dcf24… McKin…   1843 M      1897-03-04 1901-09-14 Regu… Irre…
#>  9 USA-1…      2 81dd231… Roose…   1858 M      1901-09-14 1909-03-04 Regu… Regu…
#> 10 USA-1…      2 81dd231… Taft     1857 M      1909-03-04 1913-03-04 Regu… Regu…
#> # … with 3,399 more rows, and 1 more variable: exitcode <chr>

create_leaderdays() will create leader-day data from archigos.

create_leaderdays()
#> # A tibble: 5,298,380 × 5
#>    obsid    gwcode leader date       yrinoffice
#>    <chr>     <dbl> <chr>  <date>          <dbl>
#>  1 USA-1869      2 Grant  1869-03-04          1
#>  2 USA-1869      2 Grant  1869-03-05          1
#>  3 USA-1869      2 Grant  1869-03-06          1
#>  4 USA-1869      2 Grant  1869-03-07          1
#>  5 USA-1869      2 Grant  1869-03-08          1
#>  6 USA-1869      2 Grant  1869-03-09          1
#>  7 USA-1869      2 Grant  1869-03-10          1
#>  8 USA-1869      2 Grant  1869-03-11          1
#>  9 USA-1869      2 Grant  1869-03-12          1
#> 10 USA-1869      2 Grant  1869-03-13          1
#> # … with 5,298,370 more rows

I do want to note one thing about the leader-level functions in this package. Whereas Correlates of War state system membership is often the default system for a lot of functions (prominently create_stateyears() and create_dyadyears()), the Gleditsch-Ward system is the default system because that is the state system around which the Archigos project created its leader data. Moreover, the leader data isn’t exactly tethered to the Gleditsch-Ward state system for dates either (e.g. there are leader entries for Gleditsch-Ward states that aren’t in the system yet). In a case like this, you can standardize these leader data to either the Correlates of War system or the Gleditsch-Ward system with the standardize argument. By default, the option here is “none” (i.e. return all available leader days recorded in the Archigos data). “cow” or “gw” standardizes the leader data to Correlates of War state system membership or Gleditsch-Ward state system membership, respectively.

create_leaderdays(standardize = "cow")
#> Joining, by = c("gwcode", "year")
#> Joining, by = c("ccode", "date")
#> # A tibble: 4,824,967 × 5
#>    obsid    ccode leader date       yrinoffice
#>    <chr>    <dbl> <chr>  <date>          <dbl>
#>  1 USA-1869     2 Grant  1869-03-04          1
#>  2 USA-1869     2 Grant  1869-03-05          1
#>  3 USA-1869     2 Grant  1869-03-06          1
#>  4 USA-1869     2 Grant  1869-03-07          1
#>  5 USA-1869     2 Grant  1869-03-08          1
#>  6 USA-1869     2 Grant  1869-03-09          1
#>  7 USA-1869     2 Grant  1869-03-10          1
#>  8 USA-1869     2 Grant  1869-03-11          1
#>  9 USA-1869     2 Grant  1869-03-12          1
#> 10 USA-1869     2 Grant  1869-03-13          1
#> # … with 4,824,957 more rows

The user may want to think about some additional post-processing on top of this, but this is enough to get started. From there, the same process that creates state-months can create something like leader-months.

create_leaderdays() %>%
  mutate(year = year(date),
         month = month(date)) %>%
  group_by(gwcode, obsid, year, month) %>%
  slice(1)
#> # A tibble: 177,128 × 7
#> # Groups:   gwcode, obsid, year, month [177,128]
#>    obsid    gwcode leader date       yrinoffice  year month
#>    <chr>     <dbl> <chr>  <date>          <dbl> <dbl> <dbl>
#>  1 USA-1869      2 Grant  1869-03-04          1  1869     3
#>  2 USA-1869      2 Grant  1869-04-01          1  1869     4
#>  3 USA-1869      2 Grant  1869-05-01          1  1869     5
#>  4 USA-1869      2 Grant  1869-06-01          1  1869     6
#>  5 USA-1869      2 Grant  1869-07-01          1  1869     7
#>  6 USA-1869      2 Grant  1869-08-01          1  1869     8
#>  7 USA-1869      2 Grant  1869-09-01          1  1869     9
#>  8 USA-1869      2 Grant  1869-10-01          1  1869    10
#>  9 USA-1869      2 Grant  1869-11-01          1  1869    11
#> 10 USA-1869      2 Grant  1869-12-01          1  1869    12
#> # … with 177,118 more rows

And here are leader-years, which is pre-packaged as a peacesciencer function. The package also adds some information about leader gender, an approximation of the leader’s age that year (i.e. year - yrborn), and a running count (starting a 1) for the leader’s tenure (in years).

create_leaderyears()
#> # A tibble: 17,686 × 7
#>    obsid    gwcode leader gender  year yrinoffice leaderage
#>    <chr>     <dbl> <chr>  <chr>  <dbl>      <dbl>     <dbl>
#>  1 USA-1869      2 Grant  M       1869          1        47
#>  2 USA-1869      2 Grant  M       1870          2        48
#>  3 USA-1869      2 Grant  M       1871          3        49
#>  4 USA-1869      2 Grant  M       1872          4        50
#>  5 USA-1869      2 Grant  M       1873          5        51
#>  6 USA-1869      2 Grant  M       1874          6        52
#>  7 USA-1869      2 Grant  M       1875          7        53
#>  8 USA-1869      2 Grant  M       1876          8        54
#>  9 USA-1869      2 Grant  M       1877          9        55
#> 10 USA-1877      2 Hayes  M       1877          1        55
#> # … with 17,676 more rows

Leader Dyad-Year Data

peacesciencer can also create leader dyad-year data by way of create_leaderdyadyears(). You can see some of the underlying code that is creating these data. It’s a lot of code, it would take a lot of time to run from scratch, and the ensuing output is too large to store as an R data object in the package because CRAN hard-caps package size at 5 MB. Instead, users who want these data should first run download_extdata() when they first install or update the package. Therein, they can run create_leaderdyadyears() to create the full universe of leader dyad-year data.

# create_leaderdyadyears() is effectively doing this.
# Let's do the G-W leader dyad-year data for illustration's sake.
# Do note: `download_extdata()` will download these data and stick them in the package directory
# Thus, it is *not* downloading the data fresh each time.

readRDS(url("http://svmiller.com/R/peacesciencer/gw_dir_leader_dyad_years.rds")) %>%
  declare_attributes(data_type = "leader_dyad_year", system = "gw")
#> # A tibble: 2,336,990 × 11
#>     year obsid1   obsid2   gwcode1 gwcode2 gender1 gender2 leaderage1 leaderage2
#>    <int> <chr>    <chr>      <dbl>   <dbl> <chr>   <chr>        <dbl>      <dbl>
#>  1  1870 AFG-1868 AUH-1848     700     300 M       M               45         40
#>  2  1870 AFG-1868 BAV-1864     700     245 M       M               45         39
#>  3  1870 AFG-1868 BRA-1840     700     140 M       M               45         45
#>  4  1870 AFG-1868 CHN-1861     700     710 M       M               45         35
#>  5  1870 AFG-1868 COS-1870     700      94 M       M               45         39
#>  6  1870 AFG-1868 ECU-1869     700     130 M       M               45         49
#>  7  1870 AFG-1868 GMY-1858     700     255 M       M               45         73
#>  8  1870 AFG-1868 GRC-1863     700     350 M       M               45         25
#>  9  1870 AFG-1868 IRN-1848     700     630 M       M               45         39
#> 10  1870 AFG-1868 JPN-1868     700     740 M       M               45         18
#> # … with 2,336,980 more rows, and 2 more variables: yrinoffice1 <dbl>,
#> #   yrinoffice2 <dbl>

# ^ compare with:
# download_extdata()
# create_leaderdyadyears()

Create Different Kinds of Data in {peacesciencer}