{peacesciencer}
vignettes/create-different-kinds-of-data-in-peacesciencer.Rmd
create-different-kinds-of-data-in-peacesciencer.Rmd
This tutorial is a companion to the user guide, which shows how to create different kinds of data in peacesciencer. However, space considerations (for ideal publication in a peer-reviewed journal) preclude the full “knitting” experience (i.e. giving the user a preview of what the data look like). What follows is a brief guide that expands on the tutorial section of that user guide for creating different kinds of data in peacesciencer.
This vignette will lean on the tidyverse package, which will be included in almost anything you should do (optimally) with peacesciencer. I will also load lubridate. Internal functions in peacesciencer use lubridate—it is a formal dependency of peacesciencer—but users may want to load it for doing some additional stuff outside of peacesciencer.
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
#> ✓ tibble 3.1.4 ✓ dplyr 1.0.7
#> ✓ tidyr 1.1.4 ✓ stringr 1.4.0
#> ✓ readr 2.0.2 ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
library(peacesciencer)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
The most basic form of data peacesciencer creates is state-year, by way of create_stateyears()
. create_stateyears()
has two arguments: system
and mry
. system
takes either “cow” or “gw”, depending on whether the user wants Correlates of War state years or Gleditsch-Ward state-years. It defaults to “cow” in the absence of a user-specified override given the prominence of Correlates of War data in the peace science ecosystem. mry
takes a logical (TRUE
or FALSE
), depending on whether the user wants the function to extend to the most recently concluded calendar year (2020). The Correlates of War state system data extend to the end of 2016 while the Gleditsch-Ward state system extend to the end of the 2017. This argument will allow the researcher to extend the data a few years, under the (reasonable) assumption there have been no fundamental composition to the state system since these data sets were last updated. mry
defaults to TRUE
in the absence of a user-specified override.
This will create Correlates of War state-year data from 1816 to 2020.
create_stateyears()
#> # A tibble: 16,731 × 3
#> ccode statenme year
#> <dbl> <chr> <int>
#> 1 2 United States of America 1816
#> 2 2 United States of America 1817
#> 3 2 United States of America 1818
#> 4 2 United States of America 1819
#> 5 2 United States of America 1820
#> 6 2 United States of America 1821
#> 7 2 United States of America 1822
#> 8 2 United States of America 1823
#> 9 2 United States of America 1824
#> 10 2 United States of America 1825
#> # … with 16,721 more rows
This will create Gleditsch-Ward state-year data from 1816 to 2017.
create_stateyears(system = "gw", mry = FALSE)
#> # A tibble: 17,767 × 3
#> gwcode statename year
#> <dbl> <chr> <int>
#> 1 2 United States of America 1816
#> 2 2 United States of America 1817
#> 3 2 United States of America 1818
#> 4 2 United States of America 1819
#> 5 2 United States of America 1820
#> 6 2 United States of America 1821
#> 7 2 United States of America 1822
#> 8 2 United States of America 1823
#> 9 2 United States of America 1824
#> 10 2 United States of America 1825
#> # … with 17,757 more rows
create_dyadyears()
is one of the most useful functions in peacesciencer, transforming the raw Correlates of War state system data (cow_states
in peacesciencer) or Gleditsch-Ward state system data (gw_states
) into all possible dyad-years. It has three arguments. system
and mry
operate the same as they do in create_stateyears()
. There is an additional argument—directed
—that also takes a logical (TRUE
or FALSE
). The default here is TRUE
, returning directed dyad-year data (useful for dyadic conflict analyses where the initiator/target distinction matters). FALSE
returns non-directed dyad-year data, useful for cases where the initiator/target distinction does not matter and the researcher cares more about the presence or absence of a conflict. The convention for non-directed dyad-year data is that ccode2 > ccode1
and the underlying code of create_dyadyears()
simply takes the directed dyad-year data and lops it in half with that rule.
Here are all Correlates of War dyad-years from 1816 to 2020.
create_dyadyears()
#> Joining, by = c("ccode1", "ccode2", "year")
#> # A tibble: 2,063,610 × 3
#> ccode1 ccode2 year
#> <dbl> <dbl> <int>
#> 1 2 20 1920
#> 2 2 20 1921
#> 3 2 20 1922
#> 4 2 20 1923
#> 5 2 20 1924
#> 6 2 20 1925
#> 7 2 20 1926
#> 8 2 20 1927
#> 9 2 20 1928
#> 10 2 20 1929
#> # … with 2,063,600 more rows
Here are all Gleditsch-Ward dyad-years with the same temporal domain.
create_dyadyears(system = "gw")
#> Joining, by = c("gwcode1", "gwcode2", "year")
#> # A tibble: 2,029,622 × 3
#> gwcode1 gwcode2 year
#> <dbl> <dbl> <int>
#> 1 2 20 1867
#> 2 2 20 1868
#> 3 2 20 1869
#> 4 2 20 1870
#> 5 2 20 1871
#> 6 2 20 1872
#> 7 2 20 1873
#> 8 2 20 1874
#> 9 2 20 1875
#> 10 2 20 1876
#> # … with 2,029,612 more rows
Dyadic dispute-year data come pre-processed in peacesciencer. Another vignette show how these are transformed to true dyad-year data, but they are also available for analysis. For example, the (directed) dyadic dispute-year Gibler-Miller-Little (GML) MID data are available as gml_dirdisp
. Here, we can add information to these dyadic dispute-years to identify contiguity relationships and Correlates of War major status.
gml_dirdisp %>% add_contiguity() %>% add_cow_majors()
#> Joining, by = c("ccode1", "ccode2", "year")
#> # A tibble: 10,276 × 42
#> dispnum ccode1 ccode2 year midongoing midonset sidea1 sidea2 revstate1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 2 200 1902 1 1 1 0 1
#> 2 2 200 2 1902 1 1 0 1 1
#> 3 3 300 345 1913 1 1 1 0 1
#> 4 3 345 300 1913 1 1 0 1 0
#> 5 4 200 339 1946 1 1 0 1 0
#> 6 4 339 200 1946 1 1 1 0 0
#> 7 7 200 651 1951 1 1 1 0 0
#> 8 7 200 651 1952 1 0 1 0 0
#> 9 7 651 200 1951 1 1 0 1 1
#> 10 7 651 200 1952 1 0 0 1 1
#> # … with 10,266 more rows, and 33 more variables: revstate2 <dbl>,
#> # revtype11 <dbl>, revtype12 <dbl>, revtype21 <dbl>, revtype22 <dbl>,
#> # fatality1 <dbl>, fatality2 <dbl>, fatalpre1 <dbl>, fatalpre2 <dbl>,
#> # hiact1 <dbl>, hiact2 <dbl>, hostlev1 <dbl>, hostlev2 <dbl>, orig1 <dbl>,
#> # orig2 <dbl>, hiact <dbl>, hostlev <dbl>, mindur <dbl>, maxdur <dbl>,
#> # outcome <dbl>, settle <dbl>, fatality <dbl>, fatalpre <dbl>, stmon <dbl>,
#> # endmon <dbl>, recip <dbl>, numa <dbl>, numb <dbl>, ongo2010 <dbl>, …
Users interested in the Correlates of War MID data will have this available for use as cow_mid_dirdisps
. Future updates may change the object names for better standardization, but this is how it is now.
peacesciencer comes with a create_statedays()
function. This is admittedly more proof of concept as it is really difficult to conjure too many daily data sets in peace science, certainly with coverage into the 19th century. No matter, create_statedays()
will create these data. It too has the same system
and mry
arguments (and same defaults) as create_stateyears()
.
Here are all Correlates of War state-days from 1816 to 2020.
create_statedays()
#> # A tibble: 6,061,091 × 3
#> ccode statenme date
#> <dbl> <chr> <date>
#> 1 2 United States of America 1816-01-01
#> 2 2 United States of America 1816-01-02
#> 3 2 United States of America 1816-01-03
#> 4 2 United States of America 1816-01-04
#> 5 2 United States of America 1816-01-05
#> 6 2 United States of America 1816-01-06
#> 7 2 United States of America 1816-01-07
#> 8 2 United States of America 1816-01-08
#> 9 2 United States of America 1816-01-09
#> 10 2 United States of America 1816-01-10
#> # … with 6,061,081 more rows
Here are all Gleditsch-Ward state-days with the same temporal domain.
create_statedays(system = "gw")
#> # A tibble: 6,638,781 × 3
#> gwcode statename date
#> <dbl> <chr> <date>
#> 1 2 United States of America 1816-01-01
#> 2 2 United States of America 1816-01-02
#> 3 2 United States of America 1816-01-03
#> 4 2 United States of America 1816-01-04
#> 5 2 United States of America 1816-01-05
#> 6 2 United States of America 1816-01-06
#> 7 2 United States of America 1816-01-07
#> 8 2 United States of America 1816-01-08
#> 9 2 United States of America 1816-01-09
#> 10 2 United States of America 1816-01-10
#> # … with 6,638,771 more rows
I can conjure an application where a user may want to think of daily conflict episodes within the Gleditsch-Ward domain. The UCDP armed conflict data have more precise dates than, say, the Correlates of War MID data, making such an analysis possible. However, there are no conflict data before 1946 and you should reflect that with peacesciencer with something like this. This will require lubridate.
create_statedays(system = "gw") %>%
filter(year(date) >= 1946)
#> # A tibble: 3,870,980 × 3
#> gwcode statename date
#> <dbl> <chr> <date>
#> 1 2 United States of America 1946-01-01
#> 2 2 United States of America 1946-01-02
#> 3 2 United States of America 1946-01-03
#> 4 2 United States of America 1946-01-04
#> 5 2 United States of America 1946-01-05
#> 6 2 United States of America 1946-01-06
#> 7 2 United States of America 1946-01-07
#> 8 2 United States of America 1946-01-08
#> 9 2 United States of America 1946-01-09
#> 10 2 United States of America 1946-01-10
#> # … with 3,870,970 more rows
State-months are simple aggregations of state-days. You can accomplish this with a few more extra commands after create_statedays()
.
create_statedays(system = "gw") %>%
mutate(year = year(date),
month = month(date)) %>%
distinct(gwcode, statename, year, month)
#> # A tibble: 218,194 × 4
#> gwcode statename year month
#> <dbl> <chr> <dbl> <dbl>
#> 1 2 United States of America 1816 1
#> 2 2 United States of America 1816 2
#> 3 2 United States of America 1816 3
#> 4 2 United States of America 1816 4
#> 5 2 United States of America 1816 5
#> 6 2 United States of America 1816 6
#> 7 2 United States of America 1816 7
#> 8 2 United States of America 1816 8
#> 9 2 United States of America 1816 9
#> 10 2 United States of America 1816 10
#> # … with 218,184 more rows
There is some assumption about what a “quarter” would look like in a more general context, but it might look something like this. Again, this is an aggregation of create_statedays()
.
create_statedays(system = "gw") %>%
mutate(year = year(date),
month = month(date)) %>%
filter(month %in% c(1, 4, 7, 10)) %>%
mutate(quarter = case_when(
month == 1 ~ "Q1",
month == 4 ~ "Q2",
month == 7 ~ "Q3",
month == 10 ~ "Q4"
)) %>%
distinct(gwcode, statename, year, quarter)
#> # A tibble: 72,687 × 4
#> gwcode statename year quarter
#> <dbl> <chr> <dbl> <chr>
#> 1 2 United States of America 1816 Q1
#> 2 2 United States of America 1816 Q2
#> 3 2 United States of America 1816 Q3
#> 4 2 United States of America 1816 Q4
#> 5 2 United States of America 1817 Q1
#> 6 2 United States of America 1817 Q2
#> 7 2 United States of America 1817 Q3
#> 8 2 United States of America 1817 Q4
#> 9 2 United States of America 1818 Q1
#> 10 2 United States of America 1818 Q2
#> # … with 72,677 more rows
peacesciencer has leader-level units of analysis as well, which can be easily created with the modified Archigos (archigos
) data in peacesciencer. The data are version 4.1.
archigos
#> # A tibble: 3,409 × 11
#> obsid gwcode leadid leader yrborn gender startdate enddate entry exit
#> <chr> <dbl> <chr> <chr> <dbl> <chr> <date> <date> <chr> <chr>
#> 1 USA-1… 2 81dcc17… Grant 1822 M 1869-03-04 1877-03-04 Regu… Regu…
#> 2 USA-1… 2 81dcc17… Hayes 1822 M 1877-03-04 1881-03-04 Regu… Regu…
#> 3 USA-1… 2 81dcf24… Garfi… 1831 M 1881-03-04 1881-09-19 Regu… Irre…
#> 4 USA-1… 2 81dcf24… Arthur 1829 M 1881-09-19 1885-03-04 Regu… Regu…
#> 5 USA-1… 2 34fb155… Cleve… 1837 M 1885-03-04 1889-03-04 Regu… Regu…
#> 6 USA-1… 2 81dcf24… Harri… 1833 M 1889-03-04 1893-03-04 Regu… Regu…
#> 7 USA-1… 2 34fb155… Cleve… 1837 M 1893-03-04 1897-03-04 Regu… Regu…
#> 8 USA-1… 2 81dcf24… McKin… 1843 M 1897-03-04 1901-09-14 Regu… Irre…
#> 9 USA-1… 2 81dd231… Roose… 1858 M 1901-09-14 1909-03-04 Regu… Regu…
#> 10 USA-1… 2 81dd231… Taft 1857 M 1909-03-04 1913-03-04 Regu… Regu…
#> # … with 3,399 more rows, and 1 more variable: exitcode <chr>
create_leaderdays()
will create leader-day data from archigos
.
create_leaderdays()
#> # A tibble: 5,298,380 × 5
#> obsid gwcode leader date yrinoffice
#> <chr> <dbl> <chr> <date> <dbl>
#> 1 USA-1869 2 Grant 1869-03-04 1
#> 2 USA-1869 2 Grant 1869-03-05 1
#> 3 USA-1869 2 Grant 1869-03-06 1
#> 4 USA-1869 2 Grant 1869-03-07 1
#> 5 USA-1869 2 Grant 1869-03-08 1
#> 6 USA-1869 2 Grant 1869-03-09 1
#> 7 USA-1869 2 Grant 1869-03-10 1
#> 8 USA-1869 2 Grant 1869-03-11 1
#> 9 USA-1869 2 Grant 1869-03-12 1
#> 10 USA-1869 2 Grant 1869-03-13 1
#> # … with 5,298,370 more rows
I do want to note one thing about the leader-level functions in this package. Whereas Correlates of War state system membership is often the default system for a lot of functions (prominently create_stateyears()
and create_dyadyears()
), the Gleditsch-Ward system is the default system because that is the state system around which the Archigos project created its leader data. Moreover, the leader data isn’t exactly tethered to the Gleditsch-Ward state system for dates either (e.g. there are leader entries for Gleditsch-Ward states that aren’t in the system yet). In a case like this, you can standardize these leader data to either the Correlates of War system or the Gleditsch-Ward system with the standardize
argument. By default, the option here is “none” (i.e. return all available leader days recorded in the Archigos data). “cow” or “gw” standardizes the leader data to Correlates of War state system membership or Gleditsch-Ward state system membership, respectively.
create_leaderdays(standardize = "cow")
#> Joining, by = c("gwcode", "year")
#> Joining, by = c("ccode", "date")
#> # A tibble: 4,824,967 × 5
#> obsid ccode leader date yrinoffice
#> <chr> <dbl> <chr> <date> <dbl>
#> 1 USA-1869 2 Grant 1869-03-04 1
#> 2 USA-1869 2 Grant 1869-03-05 1
#> 3 USA-1869 2 Grant 1869-03-06 1
#> 4 USA-1869 2 Grant 1869-03-07 1
#> 5 USA-1869 2 Grant 1869-03-08 1
#> 6 USA-1869 2 Grant 1869-03-09 1
#> 7 USA-1869 2 Grant 1869-03-10 1
#> 8 USA-1869 2 Grant 1869-03-11 1
#> 9 USA-1869 2 Grant 1869-03-12 1
#> 10 USA-1869 2 Grant 1869-03-13 1
#> # … with 4,824,957 more rows
The user may want to think about some additional post-processing on top of this, but this is enough to get started. From there, the same process that creates state-months can create something like leader-months.
create_leaderdays() %>%
mutate(year = year(date),
month = month(date)) %>%
group_by(gwcode, obsid, year, month) %>%
slice(1)
#> # A tibble: 177,128 × 7
#> # Groups: gwcode, obsid, year, month [177,128]
#> obsid gwcode leader date yrinoffice year month
#> <chr> <dbl> <chr> <date> <dbl> <dbl> <dbl>
#> 1 USA-1869 2 Grant 1869-03-04 1 1869 3
#> 2 USA-1869 2 Grant 1869-04-01 1 1869 4
#> 3 USA-1869 2 Grant 1869-05-01 1 1869 5
#> 4 USA-1869 2 Grant 1869-06-01 1 1869 6
#> 5 USA-1869 2 Grant 1869-07-01 1 1869 7
#> 6 USA-1869 2 Grant 1869-08-01 1 1869 8
#> 7 USA-1869 2 Grant 1869-09-01 1 1869 9
#> 8 USA-1869 2 Grant 1869-10-01 1 1869 10
#> 9 USA-1869 2 Grant 1869-11-01 1 1869 11
#> 10 USA-1869 2 Grant 1869-12-01 1 1869 12
#> # … with 177,118 more rows
And here are leader-years, which is pre-packaged as a peacesciencer function. The package also adds some information about leader gender, an approximation of the leader’s age that year (i.e. year - yrborn
), and a running count (starting a 1) for the leader’s tenure (in years).
create_leaderyears()
#> # A tibble: 17,686 × 7
#> obsid gwcode leader gender year yrinoffice leaderage
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 USA-1869 2 Grant M 1869 1 47
#> 2 USA-1869 2 Grant M 1870 2 48
#> 3 USA-1869 2 Grant M 1871 3 49
#> 4 USA-1869 2 Grant M 1872 4 50
#> 5 USA-1869 2 Grant M 1873 5 51
#> 6 USA-1869 2 Grant M 1874 6 52
#> 7 USA-1869 2 Grant M 1875 7 53
#> 8 USA-1869 2 Grant M 1876 8 54
#> 9 USA-1869 2 Grant M 1877 9 55
#> 10 USA-1877 2 Hayes M 1877 1 55
#> # … with 17,676 more rows
peacesciencer can also create leader dyad-year data by way of create_leaderdyadyears()
. You can see some of the underlying code that is creating these data. It’s a lot of code, it would take a lot of time to run from scratch, and the ensuing output is too large to store as an R data object in the package because CRAN hard-caps package size at 5 MB. Instead, users who want these data should first run download_extdata()
when they first install or update the package. Therein, they can run create_leaderdyadyears()
to create the full universe of leader dyad-year data.
# create_leaderdyadyears() is effectively doing this.
# Let's do the G-W leader dyad-year data for illustration's sake.
# Do note: `download_extdata()` will download these data and stick them in the package directory
# Thus, it is *not* downloading the data fresh each time.
readRDS(url("http://svmiller.com/R/peacesciencer/gw_dir_leader_dyad_years.rds")) %>%
declare_attributes(data_type = "leader_dyad_year", system = "gw")
#> # A tibble: 2,336,990 × 11
#> year obsid1 obsid2 gwcode1 gwcode2 gender1 gender2 leaderage1 leaderage2
#> <int> <chr> <chr> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1870 AFG-1868 AUH-1848 700 300 M M 45 40
#> 2 1870 AFG-1868 BAV-1864 700 245 M M 45 39
#> 3 1870 AFG-1868 BRA-1840 700 140 M M 45 45
#> 4 1870 AFG-1868 CHN-1861 700 710 M M 45 35
#> 5 1870 AFG-1868 COS-1870 700 94 M M 45 39
#> 6 1870 AFG-1868 ECU-1869 700 130 M M 45 49
#> 7 1870 AFG-1868 GMY-1858 700 255 M M 45 73
#> 8 1870 AFG-1868 GRC-1863 700 350 M M 45 25
#> 9 1870 AFG-1868 IRN-1848 700 630 M M 45 39
#> 10 1870 AFG-1868 JPN-1868 700 740 M M 45 18
#> # … with 2,336,980 more rows, and 2 more variables: yrinoffice1 <dbl>,
#> # yrinoffice2 <dbl>
# ^ compare with:
# download_extdata()
# create_leaderdyadyears()