library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
#> ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
#> ✓ tibble  3.1.4     ✓ dplyr   1.0.7
#> ✓ tidyr   1.1.4     ✓ stringr 1.4.0
#> ✓ readr   2.0.2     ✓ forcats 0.5.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(peacesciencer)
#> {peacesciencer} includes additional remote data for separate download. Please type ?download_extdata() for more information.
#> This message disappears on load when these data are downloaded and in the package's `extdata` directory.
library(kableExtra)
#> 
#> Attaching package: 'kableExtra'
#> The following object is masked from 'package:dplyr':
#> 
#>     group_rows
create_bench <- readRDS("~/Dropbox/projects/peacesciencer/data-raw/times/create_bench.rds")
state_bench <- readRDS("~/Dropbox/projects/peacesciencer/data-raw/times/state_bench.rds")
dyad_bench <- readRDS("~/Dropbox/projects/peacesciencer/data-raw/times/dyad_bench.rds")I had time around watching the UEFA Euro 2020 action to evaluate the expected run times in peacesciencer. The TV is in the living room and that meant I can have my laptop open for this. My laptop is a more appropriate computer for doing this because my desktop is comically overpowered and may provide some unrealistic expectations (for me) about how other users might experience peacesciencer.1
A subdirectory on the project’s Github shows what I did here. I grouped the functions in peacesciencer into two types, with one type have two subcomponents. The first type creates base data. These are create_statedays(), create_stateyears(), and create_dyadyears(). The second type broadly changes data—either adding to it or subtracting from it. For convenience sake, it’s good to think of this family of peacesciencer functions as applicable to state-year or dyad-year data. With that in mind, I used the {microbenchmark} package to run each of the relevant functions 100 times—across those two types (and three overall groups)—to see how long these functions can take for a user with a computer similar to mine. The times are calculated as nanoseconds and the benchmarking happened while I was also fiddling with other things, perusing the internet, and watching UEFA Euro 2020. Thus, what I offer here is illustrative, but still useful.
A user can see the associated R Markdown file for this vignette to see the code for processing/formatting, so I want to focus on just the substance. Here is a summary table of the run times for the create_* functions from this exercise. Note that these functions were executed with the default options, so these are all Correlates of War state system data from 1816 to the most recently concluded calendar year.
| Function | Average | Median | 95% Interval | Minimum | Maximum | 
|---|---|---|---|---|---|
| create_statedays() | 0.963 | 0.948 | [0.43, 1.791] | 0.370 | 2.260 | 
| create_stateyears() | 0.039 | 0.035 | [0.027, 0.066] | 0.026 | 0.199 | 
| create_dyadyears() | 4.757 | 4.649 | [3.565, 6.46] | 3.437 | 7.167 | 
The simulations show that creating dyad-years is by far the most time-intensive data-creating function in peacesciencer. This is not terribly surprising. The code for create_dyadyears() is transforming the raw Correlates of War (or Gleditsch-Ward) state system data. The nature of this transformation is invariably going to take more time than state-year or even state-day summaries of the data. That said, about 4-5 seconds for creating these data is pretty damn good, all things considered.
These are the run times for the functions that add to state-year data, ranged from most time-consuming to least time-consuming.
| Function | Average | Median | 95% Interval | Minimum | Maximum | 
|---|---|---|---|---|---|
| add_archigos() | 6.291 | 6.185 | [5.25, 7.726] | 4.948 | 8.195 | 
| add_creg_fractionalization() [G-W] | 3.389 | 3.313 | [2.93, 4.052] | 2.817 | 4.229 | 
| add_creg_fractionalization() [CoW] | 3.350 | 3.284 | [2.919, 4.137] | 2.845 | 4.561 | 
| add_capital_distance() | 2.310 | 2.256 | [1.975, 2.86] | 1.936 | 3.234 | 
| add_strategic_rivalries() | 2.254 | 2.199 | [2.01, 2.717] | 1.940 | 3.123 | 
| add_cow_wars(type=“intra”) | 0.615 | 0.581 | [0.514, 0.917] | 0.508 | 0.972 | 
| add_contiguity() | 0.391 | 0.373 | [0.334, 0.614] | 0.318 | 0.798 | 
| add_gwcode_to_cow() | 0.379 | 0.353 | [0.309, 0.633] | 0.303 | 0.691 | 
| add_ucdp_acd() | 0.153 | 0.145 | [0.129, 0.205] | 0.128 | 0.407 | 
| add_minimum_distance() [G-W] | 0.151 | 0.142 | [0.119, 0.214] | 0.113 | 0.572 | 
| add_minimum_distance() [CoW] | 0.142 | 0.136 | [0.117, 0.187] | 0.108 | 0.433 | 
| add_peace_years() [CoW Intra-State Wars] | 0.119 | 0.114 | [0.098, 0.147] | 0.097 | 0.398 | 
| add_peace_years() [(G-W) UCDP ACD] | 0.119 | 0.114 | [0.096, 0.181] | 0.092 | 0.250 | 
| add_cow_majors() | 0.027 | 0.025 | [0.021, 0.043] | 0.020 | 0.045 | 
| add_ccode_to_gw() | 0.010 | 0.009 | [0.008, 0.015] | 0.008 | 0.018 | 
| add_democracy() [G-W] | 0.010 | 0.007 | [0.006, 0.012] | 0.005 | 0.302 | 
| add_sdp_gdp() [G-W] | 0.009 | 0.008 | [0.006, 0.014] | 0.006 | 0.026 | 
| add_sdp_gdp() [CoW] | 0.008 | 0.007 | [0.006, 0.013] | 0.006 | 0.015 | 
| add_cow_trade() | 0.007 | 0.006 | [0.005, 0.011] | 0.005 | 0.012 | 
| add_democracy() [CoW] | 0.007 | 0.006 | [0.005, 0.012] | 0.005 | 0.019 | 
| add_igos() | 0.007 | 0.006 | [0.005, 0.01] | 0.005 | 0.020 | 
| add_nmc() | 0.007 | 0.007 | [0.005, 0.015] | 0.005 | 0.021 | 
| add_rugged_terrain() [G-W] | 0.007 | 0.006 | [0.005, 0.013] | 0.005 | 0.022 | 
| add_rugged_terrain() [CoW] | 0.006 | 0.006 | [0.005, 0.01] | 0.005 | 0.014 | 
There are five functions for which the average execution time is over a second. Knowing what I know about how I wrote these functions, a few of them make some sense.add_archigos() takes the most time by far—an average of over six seconds—largely because 1) it needs to rowwise-transform a subset of the raw data to extend dates into leader-days before calculating the relevant variables as a group-by mutate before doing the most time-consuming function I sometimes bury into these functions: a group-by slice for eliminating duplicates. add_creg_fractionalization() has this same group-by slice largely because its state codes are not quite Correlates of War and note quite Gleditsch-Ward, for which a group-by slice is one of my go-tos for eliminating grouped duplicates. add_capital_distance() is a bit time-consuming because it’s doing on-the-fly “as the crow flies” distance estimates between state capitals using the provided latitude/longitude coordinates. add_strategic_rivalries() doesn’t have any of these, but it has a lot of buried if-elses for how a user may want to calculate the presence of a rivalry type at the state-year.
The dyad-year run times are a little bit more interesting and will merit some further explanation. Most of these are a little time-consuming because of the reasons mentioned above (e.g. group-by slices, as in the add_creg_fractionalization() and add_archigos() cases). The peace-year calculations are a little time-consuming, but ultimately have a straightforward explanation.
| Function | Average | Median | 95% Interval | Minimum | Maximum | 
|---|---|---|---|---|---|
| add_peace_years() [CoW-MID] | 12.913 | 12.566 | [11.574, 15.679] | 11.297 | 15.982 | 
| add_peace_years() [GML MID] | 12.319 | 11.967 | [11.057, 14.449] | 10.914 | 16.352 | 
| add_archigos() | 5.994 | 5.837 | [5.068, 7.524] | 4.904 | 7.815 | 
| add_creg_fractionalization() [CoW] | 3.588 | 3.430 | [3.143, 4.454] | 3.068 | 4.911 | 
| add_creg_fractionalization() [G-W] | 3.548 | 3.433 | [3.143, 4.447] | 3.043 | 4.582 | 
| filter_prd() [+ add_contiguity() + add_cow_majors()] | 2.868 | 2.840 | [2.471, 3.616] | 2.352 | 3.746 | 
| add_contiguity() | 2.071 | 1.974 | [1.778, 2.671] | 1.762 | 2.853 | 
| add_capital_distance() | 1.881 | 1.851 | [1.623, 2.322] | 1.549 | 2.484 | 
| add_cow_wars(type=“inter”) | 1.395 | 1.353 | [1.19, 1.79] | 1.188 | 2.119 | 
| add_cow_trade() | 1.025 | 0.954 | [0.854, 1.36] | 0.851 | 1.564 | 
| add_igos() | 1.002 | 0.961 | [0.874, 1.318] | 0.871 | 1.561 | 
| add_minimum_distance() [CoW] | 0.956 | 0.902 | [0.787, 1.457] | 0.772 | 1.620 | 
| add_minimum_distance() [G-W] | 0.929 | 0.894 | [0.795, 1.288] | 0.792 | 1.406 | 
| add_gwcode_to_cow() | 0.773 | 0.740 | [0.671, 1.03] | 0.648 | 1.135 | 
| add_atop_alliance() | 0.634 | 0.571 | [0.492, 0.933] | 0.488 | 0.985 | 
| add_cow_majors() | 0.623 | 0.583 | [0.514, 0.873] | 0.503 | 1.156 | 
| add_nmc() | 0.572 | 0.528 | [0.458, 0.83] | 0.451 | 0.961 | 
| add_sdp_gdp() [G-W] | 0.521 | 0.478 | [0.415, 0.853] | 0.414 | 0.896 | 
| add_cow_alliance() | 0.515 | 0.462 | [0.416, 0.787] | 0.412 | 0.864 | 
| add_sdp_gdp() [CoW] | 0.497 | 0.449 | [0.412, 0.819] | 0.402 | 0.997 | 
| add_democracy() [G-W] | 0.444 | 0.408 | [0.369, 0.737] | 0.358 | 0.874 | 
| add_gml_mids(keep=NULL) | 0.444 | 0.399 | [0.345, 0.755] | 0.336 | 0.908 | 
| add_democracy() [CoW] | 0.443 | 0.411 | [0.371, 0.638] | 0.361 | 0.805 | 
| add_cow_mids(keep=NULL) | 0.440 | 0.408 | [0.359, 0.713] | 0.346 | 0.863 | 
| add_strategic_rivalries() | 0.418 | 0.382 | [0.345, 0.739] | 0.340 | 0.915 | 
| add_ccode_to_gw() | 0.414 | 0.383 | [0.348, 0.653] | 0.342 | 0.762 | 
| add_rugged_terrain() [CoW] | 0.402 | 0.360 | [0.311, 0.736] | 0.308 | 0.840 | 
| add_rugged_terrain() [G-W] | 0.384 | 0.347 | [0.306, 0.718] | 0.301 | 0.734 | 
Basically, add_peace_years() works generally with a variety of data types you feed it. It’s also implicitly a grouped function. For state-year data, that means you have about 217 “groups” (i.e. states) in the Correlates of War cases. If you want—as I do here—the full damn universe of Correlates of War dyads from 1816 to 2020, that means you’ll have 41,252 dyads for which add_peace_years() will calculate your peace spells. So yeah, that’s going to take some time. You can cut that in about half if you filtered the data to just politically relevant dyads before calculating peace years.
Ultimately, the examples on the README show that you can do most things in peacesciencer in a matter of seconds. Unless you’re stress-testing the package’s ability to do everything on the full universe of dyad-year data, you can create the kind of data you want in well under a minute. Some functions take longer than others, mostly because of some hacks I built into these functions on the premise that I know they’ll work as I intend them to work (even if a more optimal alternative is possible).
My laptop is pretty good as far as performance laptops go. At the least, it has 16 GB of RAM. That is on the high end as far as most consumer laptops go, but dedicated professionals may have a laptop similar to what I have.↩︎