How `{peacesciencer}` Coerces Dispute-Year Data into Dyad-Year Data
Source:vignettes/coerce-dispute-year-dyad-year.Rmd
coerce-dispute-year-dyad-year.Rmd
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.1 ✔ readr 2.1.4
#> ✔ forcats 1.0.0 ✔ stringr 1.5.0
#> ✔ ggplot2 3.4.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
#> ✔ purrr 1.0.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(peacesciencer)
#> {peacesciencer} includes additional remote data for separate download. Please type ?download_extdata() for more information.
#> This message disappears on load when these data are downloaded and in the package's `extdata` directory.
library(kableExtra)
#>
#> Attaching package: 'kableExtra'
#>
#> The following object is masked from 'package:dplyr':
#>
#> group_rows
Dyad-year models—whether directed or non-directed—seek to explain variation in conflict onset by reference to some covariates of interest. This unit of analysis in these models is the dyad-year (e.g. USA-Canada, 1920; USA-Canada, 1921) and not the dispute-year. A researcher who is not careful about the difference will end up with duplicate dyad-year observations for dyads in which there were multiple confrontations ongoing in a calendar year.
Here is my favorite case in point: the Italy-France dyad in 1860. This dyad not only had three unique disputes occurring in 1860, they also had three unique onsets that year. Two were even wars even as France was a passive participant in those. Heck, they even all started effectively at the same time, but concerned different components of the wars of Italian unification happening at that time. A researcher should be mindful about this: their unit of analysis is supposed to be the dyad-year, not the dispute-year. Not knowing the difference is the difference of having three Italy-France observations for 1860 or just one. The researcher who has a dyad-year design wants the latter, not the former.
haven::read_dta("~/Dropbox/data/cow/mid/5/MIDB 5.0.dta") %>%
filter(dispnum %in% c(112, 113, 306)) %>%
select(dispnum:sidea, fatality, hiact, hostlev)
#> # A tibble: 8 × 13
#> dispnum stabb ccode stday stmon styear endday endmon endyear sidea fatality
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 112 ITA 325 7 9 1860 29 9 1860 1 4
#> 2 112 FRN 220 10 9 1860 29 9 1860 0 0
#> 3 112 PAP 327 7 9 1860 29 9 1860 0 5
#> 4 113 ITA 325 18 9 1860 13 2 1861 1 5
#> 5 113 FRN 220 13 11 1860 19 1 1861 0 0
#> 6 113 SIC 329 18 9 1860 13 2 1861 0 4
#> 7 306 ITA 325 17 9 1860 19 1 1861 0 -9
#> 8 306 FRN 220 17 9 1860 19 1 1861 1 -9
#> # ℹ 2 more variables: hiact <dbl>, hostlev <dbl>
This means a researcher must make careful design decisions about
which cases to exclude from their data. There is no correct answer here,
per se. There is good reason theoretically to employ certain case
exclusion rules before others, which is what
peacesciencer will do by default in the
add_cow_mids()
and add_gml_mids()
functions.
This vignette will explain what peacesciencer does by
default. Users who want to employ their own case exclusion rules are
free to use the “whittle” class of functions
(e.g. whittle_conflicts_onsets()
,
whittle_conflicts_fatality()
) on the dyadic dispute-year
data included in this package. I will start with version 5.0 of the
CoW-MID data and the dyadic dispute-year data I created from it.
Converting CoW-MID Dyadic Dispute-Year Data into Dyad-Year Data
First, let’s identify where there are dyad-year duplicates in the data.
cow_mid_dirdisps %>%
# make it non-directed for ease of presentation
filter(ccode2 > ccode1) %>%
group_by(ccode1, ccode2, year) %>%
summarize(n = n(),
mids = paste0(dispnum, collapse = ", ")) %>%
arrange(-n) %>%
filter(n > 1) %>%
ungroup()
#> `summarise()` has grouped output by 'ccode1', 'ccode2'. You can override using
#> the `.groups` argument.
#> # A tibble: 498 × 5
#> ccode1 ccode2 year n mids
#> <dbl> <dbl> <dbl> <int> <chr>
#> 1 200 365 1920 6 186, 197, 1133, 2363, 2364, 2604
#> 2 2 365 1958 5 125, 173, 608, 2215, 2216
#> 3 651 666 1959 5 3375, 3405, 3419, 3421, 3430
#> 4 651 666 1960 5 3375, 3405, 3419, 3422, 3430
#> 5 652 666 1955 5 3404, 3405, 3416, 3417, 3418
#> 6 2 365 1962 4 61, 1353, 2219, 3361
#> 7 2 365 1967 4 345, 2930, 2931, 2934
#> 8 200 365 1919 4 197, 2363, 2604, 2605
#> 9 651 666 1958 4 3375, 3405, 3419, 3420
#> 10 652 666 1954 4 3403, 3404, 3415, 3417
#> # ℹ 488 more rows
The absolute most in the data is the United Kingdom-Soviet Union dyad, which had six conflicts ongoing and/or initiated in 1920. Next most is a tie between the United States-Soviet Union dyad in 1958, the Egypt-Israel dyad (1959, 1960), and the Syria-Israel dyad (1955). All told, there are 498 dyad-years that duplicate in the dyadic dispute-year data. We need to whittle those down to where there is no more than one dyad-year in these data.
First: Select Unique Onsets
The primary aim is to preserve the unique onsets. The case of the United States-United Kingdom dyad in 1903 will illustrate what’s at stake here. Here, the United States and United Kingdom had three MIDs ongoing in 1903. Two (MID#0002 and MID#0254) began in 1902. The third, MID#3301, is a new onset. In this case, we want to remove the observation for MID#0002 and MID#0254 and keep the observation for MID3301.
cow_mid_dirdisps %>%
filter(ccode1 == 2 & ccode2 == 200 & year == 1903) %>%
select(dispnum:disponset) %>%
kbl(.,
caption = "United States-United Kingdom Dyadic Dispute-Years in 1903",
booktabs = TRUE, longtable = TRUE) %>%
kable_styling(position = "center", full_width = F,
bootstrap_options = "striped")
dispnum | ccode1 | ccode2 | year | dispongoing | disponset |
---|---|---|---|---|---|
2 | 2 | 200 | 1903 | 1 | 0 |
254 | 2 | 200 | 1903 | 1 | 0 |
3301 | 2 | 200 | 1903 | 1 | 1 |
Here’s how peacesciencer does this first cut. Grouping
by dyad-year (i.e. group_by(ccode1, ccode2, year)
), it
creates a new variable that equals 1 if the number of rows by dyad-year
is more than 1. Maintaining the same grouped structure, it calculates
the standard deviation of the the disponset
variable. Cases
where no standard deviation could be calculate are cases where the
dyad-year does not duplicate and these are assigned as 0. Next, it
creates a simple removeme
column that equals 1 if 1) it’s a
duplicated dyad-year, and 2) it’s not a unique onset, and 3) the
standard deviation is greater than 0 (i.e. there is at least one onset
in that dyad-year). It then removes cases where
removeme == 1
.
cow_mid_dirdisps %>%
group_by(ccode1, ccode2, year) %>%
mutate(duplicated = ifelse(n() > 1, 1, 0)) %>%
# Remove anything that's not a unique MID onset
mutate(sd = sd(disponset),
sd = ifelse(is.na(sd), 0, sd)) %>%
mutate(removeme = ifelse(duplicated == 1 & disponset == 0 & sd > 0,
1, 0)) %>%
filter(removeme != 1) %>%
# remove detritus
select(-removeme, -sd) %>%
# practice safe group_by()
ungroup() -> hold_this
# ^ The `hold_this` naming convention is my favorite for intermediate objects.
# It's also a bad idea to overwrite data objects that come in this package.
Observe how it fixed that USA-United Kingdom observation in 1903.
hold_this %>%
filter(ccode1 == 2 & ccode2 == 200 & year == 1903) %>%
select(dispnum:disponset)
#> # A tibble: 1 × 6
#> dispnum ccode1 ccode2 year dispongoing disponset
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3301 2 200 1903 1 1
It did not fix the Italy-France problem from 1860, but that’s because all three dispute-years were onsets that year.
hold_this %>%
filter(ccode1 == 220 & ccode2 == 325 & year == 1860) %>%
select(dispnum:disponset) %>%
kbl(., caption = "France-Italy Dyadic Dispute-Years in 1903",
booktabs = TRUE, longtable = TRUE) %>%
kable_styling(position = "center", full_width = F,
bootstrap_options = "striped")
dispnum | ccode1 | ccode2 | year | dispongoing | disponset |
---|---|---|---|---|---|
112 | 220 | 325 | 1860 | 1 | 1 |
113 | 220 | 325 | 1860 | 1 | 1 |
306 | 220 | 325 | 1860 | 1 | 1 |
This just tells us we’re not done, but we knew we wouldn’t be. We need more exclusion rules to whittle down the data.
Second: Keep the Highest Dispute-Level Fatality
If presented the opportunity to keep one dispute and drop another where two appear in a year, researchers will likely prefer the more “serious” one rather than the one that might have been a simple threat to use or show of force. Consider this Russia-Ottoman Empire (Turkey) dyad-year in 1853. There are two unique onsets between the two that year. One (MID#0057) became the Crimean War, an important conflict! The other (MID#0126) was an apparent show of force with no fatalities. Under those conditions, it’s an easy call to keep the one with more fatalities.
hold_this %>%
filter(ccode1 == 365 & ccode2 == 640 & year == 1853) %>%
select(dispnum:disponset, fatality1:fatality2, hiact1, hiact2) %>%
kbl(.,
caption = "Russia-Ottoman Empire Dyadic Dispute-Years in 1853",
booktabs = TRUE, longtable = TRUE) %>%
kable_styling(position = "center", full_width = F,
bootstrap_options = "striped")
dispnum | ccode1 | ccode2 | year | dispongoing | disponset | fatality1 | fatality2 | hiact1 | hiact2 |
---|---|---|---|---|---|---|---|---|---|
57 | 365 | 640 | 1853 | 1 | 1 | 6 | 6 | 20 | 20 |
126 | 365 | 640 | 1853 | 1 | 1 | 0 | 0 | 7 | 0 |
There is one limitation with CoW-MID data toward this end. We obviously know CoW-MID only assigns fatalities at the end of the dispute to the participants, so we’d have no way of knowing a priori how many fatalities in that Russia-Turkey dyad were in 1853. We could have a situation like Belgium-Germany in 1939-1940. In that case, the highest action in which Belgium engaged against Germany in 1939 was a mobilization and the war that momentarily eliminated Belgium from the international system happened the next year. We also don’t know to what extent Turkey was responsible for Russia’s fatalities. The Crimean War was a multilateral war pitting the Russians against the United Kingdom, Austria-Hungary, Italy, Turkey, and France.
Thus, what follows is crude, but still useful. We’ll use the
dispute-level fatality information as a stand-in here and keep the
duplicate dyad-year observation with the highest fatality score. We’ll
also need to take inventory of how to handle the cases where
fatality == -9
. In a forthcoming data release, we find that
cases of missing fatalities in the CoW-MID data mean that there were
fatalities in more than half of the cases. Some were even wars! However,
we’d have no way of knowing this from CoW-MID. We’ll be safe and recode
-9 to be .5, indicating more than 0 fatalities but “less” than the
fatality level of 1 (1-25 deaths) in that CoW-MID can at least
confidently say the latter happened.
hold_this %>%
left_join(., cow_mid_disps %>% select(dispnum, fatality)) %>%
mutate(fatality = ifelse(fatality == -9, .5, fatality)) %>%
arrange(ccode1, ccode2, year) %>%
group_by(ccode1, ccode2, year) %>%
mutate(duplicated = ifelse(n() > 1, 1, 0)) %>%
group_by(ccode1, ccode2, year, duplicated) %>%
# Keep the highest fatality
filter(fatality == max(fatality)) %>%
mutate(fatality = ifelse(fatality == .5, -9, fatality)) %>%
arrange(ccode1, ccode2, year) %>%
# practice safe group_by()
ungroup() -> hold_this
#> Joining with `by = join_by(dispnum)`
This will fix the Russia-Turkey-1853 problem.
hold_this %>% filter(ccode1 == 365 & ccode2 == 640 & year == 1853)
#> # A tibble: 1 × 20
#> dispnum ccode1 ccode2 year dispongoing disponset sidea1 sidea2 fatality1
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 57 365 640 1853 1 1 1 0 6
#> # ℹ 11 more variables: fatality2 <dbl>, fatalpre1 <dbl>, fatalpre2 <dbl>,
#> # hiact1 <dbl>, hiact2 <dbl>, hostlev1 <dbl>, hostlev2 <dbl>, orig1 <dbl>,
#> # orig2 <dbl>, duplicated <dbl>, fatality <dbl>
It won’t fix cases where there were multiple disputes initiated in the same year in the dyad, but no one died. There are lot of these. So, we’ll need more case exclusion rules.
Third: Keep the Highest Dispute-Level Hostility
The next case exclusion rule will want to continue isolating those serious MIDs from MIDs of lesser severity. Consider this case of India and Pakistan in 1963.
hold_this %>%
filter(ccode1 == 750 & ccode2 == 770 & year == 1963) %>%
select(dispnum:year, disponset, fatality1, fatality2, hiact1, hiact2) %>%
kbl(., caption = "India-Pakistan Dyadic Dispute-Years in 1963",
booktabs = TRUE, longtable = TRUE) %>%
kable_styling(position = "center", full_width = F,
bootstrap_options = "striped")
dispnum | ccode1 | ccode2 | year | disponset | fatality1 | fatality2 | hiact1 | hiact2 |
---|---|---|---|---|---|---|---|---|
1317 | 750 | 770 | 1963 | 1 | 0 | 0 | 0 | 14 |
2630 | 750 | 770 | 1963 | 1 | 0 | 0 | 0 | 1 |
These are two unique MID onsets in 1963 and neither was fatal, meaning this duplicate dyad-year is still here. However, MID#2630 was just a threat to use force whereas MID#1317 had an occupation of territory (by Pakistan against India). The former is a threat. The latter is a use. MID#2630 has a higher hostility level and that is the MID we’ll want to keep. The same caveat applies, as it did with fatalities, so we’ll have to use the dispute-level hostility variable as a plug-in here.
hold_this %>%
left_join(., cow_mid_disps %>% select(dispnum, hostlev)) %>%
arrange(ccode1, ccode2, year) %>%
group_by(ccode1, ccode2, year) %>%
mutate(duplicated = ifelse(n() > 1, 1, 0)) %>%
group_by(ccode1, ccode2, year, duplicated) %>%
# Keep the highest hostlev
filter(hostlev == max(hostlev)) %>%
arrange(ccode1, ccode2, year) %>%
# practice safe group_by()
ungroup() -> hold_this
#> Joining with `by = join_by(dispnum)`
This will at least fix that India-Pakistan observation in 1963, and others like it.
hold_this %>%
filter(ccode1 == 750 & ccode2 == 770 & year == 1963) %>%
select(dispnum:year, disponset, fatality1, fatality2, hiact1, hiact2)
#> # A tibble: 1 × 9
#> dispnum ccode1 ccode2 year disponset fatality1 fatality2 hiact1 hiact2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1317 750 770 1963 1 0 0 0 14
Fourth: Keep the Highest Dispute-Level (Minimum, Then Maximum) Duration
At this point, we still have duplicate dyad-years remaining in these data, but we’ve selected on cases that are fairly similar to each other (at least given the dispute- and participant-level data that are available). The duplicates that remain will be unique onsets with the same fatality levels and hostility levels. The next available measure that approximates dispute severity is duration. Consider this duplicate observation of Colombia-Peru in 1852 and the corresponding MIDs (MID#1506 and MID#1523).
haven::read_dta("~/Dropbox/data/cow/mid/5/MIDB 5.0.dta") %>%
filter(dispnum %in% c(1506, 1523)) %>%
select(dispnum:sidea, fatality, hiact, hostlev)
#> # A tibble: 7 × 13
#> dispnum stabb ccode stday stmon styear endday endmon endyear sidea fatality
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1506 VEN 101 -9 10 1852 -9 11 1852 1 0
#> 2 1506 CHL 155 14 9 1852 -9 11 1852 0 0
#> 3 1506 PER 135 -9 8 1852 -9 11 1852 0 0
#> 4 1506 COL 100 -9 8 1852 -9 11 1852 1 0
#> 5 1523 PER 135 -9 3 1852 18 7 1852 0 0
#> 6 1523 CHL 155 2 6 1852 2 6 1852 1 0
#> 7 1523 COL 100 -9 3 1852 18 7 1852 1 0
#> # ℹ 2 more variables: hiact <dbl>, hostlev <dbl>
These MIDs look fairly similar. They both started the same year. They
both have the same level of fatalities (none). They both have the same
hostility level (a show of force). It would be tough to read tea leaves
to argue that an alert (hiact: 8
) is “greater” than a show
of force (hiact: 7
) even as 8 > 7 (i.e. CoW-MID action
codes have never been truly ordinal). Further, they’re both multilateral
MIDs. MID#1506 pit Venezuela and Colombia against Chile and Peru whereas
MID#1523 pit Chile and Colombia against Peru. Both even unhelpfully have
some unknown duration to them. There are -9s in start days in both.
However, MID#1523 has the highest minimum duration. It lasted at least 110 days (and as many as 140) whereas MID#1506 has a minimum duration of 63 days (and a maximum duration of 122 days). Under those conditions, we will keep the one with the minimum duration and then, where duplicates still remain, keep the one with the highest maximum duration.
hold_this %>%
left_join(., cow_mid_disps %>% select(dispnum, mindur, maxdur)) %>%
arrange(ccode1, ccode2, year) %>%
group_by(ccode1, ccode2, year) %>%
mutate(duplicated = ifelse(n() > 1, 1, 0)) %>%
group_by(ccode1, ccode2, year, duplicated) %>%
# Keep the highest mindur
filter(mindur == max(mindur)) %>%
arrange(ccode1, ccode2, year) %>%
group_by(ccode1, ccode2, year) %>%
mutate(duplicated = ifelse(n() > 1, 1, 0)) %>%
group_by(ccode1, ccode2, year, duplicated) %>%
# Keep the highest maxdur
filter(maxdur == max(maxdur)) %>%
# practice safe group_by()
ungroup() -> hold_this
#> Joining with `by = join_by(dispnum)`
This will fix that Colombia-Peru problem in 1852.
hold_this %>%
filter(ccode1 == 135 & ccode2 == 100 & year == 1852) %>%
select(dispnum:year, disponset, fatality1, fatality2, hiact1, hiact2)
#> # A tibble: 1 × 9
#> dispnum ccode1 ccode2 year disponset fatality1 fatality2 hiact1 hiact2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1523 135 100 1852 1 0 0 0 8
Final Case Exclusions for the CoW-MID Data
We had started with 498 duplicate directed dyad-years in the dyadic dispute-year data. We’re now down to just 24 directed (12 non-directed) dyad-years. A glance at these remaining observations suggest the substance here is very similar. For example, MID#4428 and MID#4430 are both one-day border fortifications between Kyrgyzstan and Uzbekistan in 2005. MID#2171 and MID#2172 are both one-day threats to use force between Cyprus and Turkey in 1965.
hold_this %>%
group_by(ccode1, ccode2, year) %>%
filter(n() > 1) %>% filter(ccode2 > ccode1) %>%
select(dispnum:disponset, hiact1:hiact2, fatality:maxdur) %>%
kbl(., caption = "Duplicate Non-Directed Dyad-Years Still Remaining",
booktabs = TRUE, longtable = TRUE) %>%
kable_styling(position = "center", full_width = F,
bootstrap_options = "striped")
dispnum | ccode1 | ccode2 | year | dispongoing | disponset | hiact1 | hiact2 | fatality | hostlev | mindur | maxdur |
---|---|---|---|---|---|---|---|---|---|---|---|
2233 | 2 | 365 | 1986 | 1 | 1 | 7 | 0 | 0 | 3 | 1 | 1 |
3637 | 2 | 365 | 1986 | 1 | 1 | 0 | 7 | 0 | 3 | 1 | 1 |
2171 | 352 | 640 | 1965 | 1 | 1 | 1 | 1 | 0 | 2 | 1 | 1 |
2172 | 352 | 640 | 1965 | 1 | 1 | 0 | 1 | 0 | 2 | 1 | 1 |
4416 | 365 | 372 | 2003 | 1 | 1 | 7 | 0 | 0 | 3 | 1 | 1 |
4420 | 365 | 372 | 2003 | 1 | 1 | 7 | 0 | 0 | 3 | 1 | 1 |
2800 | 541 | 560 | 1987 | 1 | 1 | 15 | 12 | 0 | 4 | 1 | 1 |
2801 | 541 | 560 | 1987 | 1 | 1 | 0 | 16 | 0 | 4 | 1 | 1 |
4428 | 703 | 704 | 2005 | 1 | 1 | 11 | 11 | 0 | 3 | 1 | 1 |
4430 | 703 | 704 | 2005 | 1 | 1 | 11 | 0 | 0 | 3 | 1 | 1 |
4225 | 731 | 740 | 1999 | 1 | 1 | 12 | 7 | 0 | 3 | 4 | 4 |
4322 | 731 | 740 | 1999 | 1 | 1 | 0 | 8 | 0 | 3 | 4 | 4 |
The final case exclusion rules will round us home. First, a few of these duplicate dyad-years feature a case where one dispute was reciprocated and the other was not. For example, MID#4428 was a mutual border fortification while MID#4430 was just one border fortification directed by Kyrgyzstan against Uzbekistan. Thus, we should keep the one that involved at least two codable incidents rather than the MID in which there was just one codable incident.
A reader may object here that reciprocation should feature higher in the proverbial chain, given its prominence in the audience cost literature. I caution that we should not do this. Gibler and Miller (also with Little) have driven home that the reciprocation variable is an information-poor variable. It only minimally tells you that Side B in a MID initiated a militarized incident or was involved in an attack in which there was no clear initiator. In our review of the conflict data, we find that attacks or ambushes initiated by Side A are countered when they happen more than half the time. Further, inferences made from the reciprocation variable are among the most sensitive to the errors we report in the CoW-MID data. For that reason, we discourage researchers from using this variable for their analyses and, for this application, it’s why peacesciencer uses the dispute-level reciprocation variable near the bottom of the rung in its case exclusions.
Still, here’s how to do that.
hold_this %>%
left_join(., cow_mid_disps %>% select(dispnum, recip)) %>%
arrange(ccode1, ccode2, year) %>%
group_by(ccode1, ccode2, year) %>%
mutate(duplicated = ifelse(n() > 1, 1, 0)) %>%
group_by(ccode1, ccode2, year, duplicated) %>%
# Keep the reciprocated ones, where non-reciprocated ones exist
filter(recip == max(recip)) %>%
arrange(ccode1, ccode2, year) %>%
# practice safe group_by()
ungroup() -> hold_this
#> Joining with `by = join_by(dispnum)`
We’re down to just three duplicate dyad-years now. The only reason MID#4428 and MID#4430 are both still there is CoW-MID has MID#4428 as unreciprocated at the dispute-level while it also has a militarized incident for Side B in the dispute. This is a CoW-MID issue and not a peacesciencer issue.
hold_this %>%
group_by(ccode1, ccode2, year) %>%
filter(n() > 1) %>% filter(ccode2 > ccode1) %>%
select(dispnum:disponset, hiact1:hiact2, fatality:maxdur) %>%
kbl(., caption = "Duplicate Non-Directed Dyad-Years Still Remaining",
booktabs = TRUE, longtable = TRUE) %>%
kable_styling(position = "center", full_width = F,
bootstrap_options = "striped")
dispnum | ccode1 | ccode2 | year | dispongoing | disponset | hiact1 | hiact2 | fatality | hostlev | mindur | maxdur |
---|---|---|---|---|---|---|---|---|---|---|---|
2233 | 2 | 365 | 1986 | 1 | 1 | 7 | 0 | 0 | 3 | 1 | 1 |
3637 | 2 | 365 | 1986 | 1 | 1 | 0 | 7 | 0 | 3 | 1 | 1 |
4416 | 365 | 372 | 2003 | 1 | 1 | 7 | 0 | 0 | 3 | 1 | 1 |
4420 | 365 | 372 | 2003 | 1 | 1 | 7 | 0 | 0 | 3 | 1 | 1 |
4428 | 703 | 704 | 2005 | 1 | 1 | 11 | 11 | 0 | 3 | 1 | 1 |
4430 | 703 | 704 | 2005 | 1 | 1 | 11 | 0 | 0 | 3 | 1 | 1 |
All three are effectively identical MIDs. They start the same year. They have the same fatality-level, hostility-level, duration, and both are either are reciprocated or not-reciprocated (that MID#4428/MID#4430 issue notwithstanding). Thus, we will select the one that has the lowest start month.
hold_this %>%
left_join(., cow_mid_disps %>% select(dispnum, stmon)) %>%
arrange(ccode1, ccode2, year) %>%
group_by(ccode1, ccode2, year) %>%
mutate(duplicated = ifelse(n() > 1, 1, 0)) %>%
group_by(ccode1, ccode2, year, duplicated) %>%
# Keep the reciprocated ones, where non-reciprocated ones exist
filter(stmon == min(stmon)) %>%
arrange(ccode1, ccode2, year) %>%
# practice safe group_by()
ungroup() -> hold_this
#> Joining with `by = join_by(dispnum)`
# And we're done
And this is enough to eliminate duplicate dyad-years.
hold_this %>%
group_by(ccode1, ccode2, year) %>%
filter(n() > 1)
#> # A tibble: 0 × 25
#> # Groups: ccode1, ccode2, year [0]
#> # ℹ 25 variables: dispnum <dbl>, ccode1 <dbl>, ccode2 <dbl>, year <dbl>,
#> # dispongoing <dbl>, disponset <dbl>, sidea1 <dbl>, sidea2 <dbl>,
#> # fatality1 <dbl>, fatality2 <dbl>, fatalpre1 <dbl>, fatalpre2 <dbl>,
#> # hiact1 <dbl>, hiact2 <dbl>, hostlev1 <dbl>, hostlev2 <dbl>, orig1 <dbl>,
#> # orig2 <dbl>, duplicated <dbl>, fatality <dbl>, hostlev <dbl>, mindur <dbl>,
#> # maxdur <dbl>, recip <dbl>, stmon <dbl>