Users who may wish to improve their own data management skills in R
by looking how peacesciencer functions are written will
see that basic foundation of peacesciencer’s functions
consists of so-called “join” functions. The “join” functions themselves
come in dplyr, a critical dependency for
peacesciencer and the effective engine of {tidyverse}
(which I
suggest for a basic workflow tool, and which the user may already be
using). Users who have absolutely no idea what these functions do are
welcome to find more thorough texts about these different types of
joins. Their functionality and terminology have a clear basis in
SQL, a relational database management system that first appeared in
1974 for data management and data manipulation. My goal here is not to
offer a crash course on all these potential “join” functions, though
helpful visual
primers are available in R and SQL.
Instead, I will offer the basic principles these visual primers are
communicating as they apply to peacesciencer.
Left (Outer) Join
The first type of join is the most important type of join function in
peacesciencer. Indeed, almost every function in this
package that deals with adding variables to a type of data created in
peacesciencer includes it. This is the “left join”
(left_join()
in dplyr), alternatively known
as the “outer
join” or “left outer join” in the SQL context, and is a type of
“mutating join” in the tidyverse context. In plain
English, the left_join()
assumes two data objects—a “left”
object (x
) and a “right” object (y
)—and
returns all rows from the left object (x
) with matching
information in the right object (y
) by a set of common
matching keys (or columns in both x
and
y
).
Here is a simple example of how this works in the
peacesciencer context. Assume a simple state-year data
set of the United States (ccode: 2
), Canada
(ccode: 20
), and the United Kingdom
(ccode: 200
) for all years from 2016 to 2020. Recreating
this simple kind of data is no problem in R and will serve as our “left
object” (x
) for this simple example.
tibble(ccode = c(2, 20, 200)) %>%
# rowwise() is a great trick for nesting sequences in tibbles
# This parlor trick, for example, generates state-year data out of raw state
# data in create_stateyears()
rowwise() %>%
# create a sequence as a nested column
mutate(year = list(seq(2016, 2020))) %>%
# unnest the column
unnest(year) -> x
x
#> # A tibble: 15 × 2
#> ccode year
#> <dbl> <int>
#> 1 2 2016
#> 2 2 2017
#> 3 2 2018
#> 4 2 2019
#> 5 2 2020
#> 6 20 2016
#> 7 20 2017
#> 8 20 2018
#> 9 20 2019
#> 10 20 2020
#> 11 200 2016
#> 12 200 2017
#> 13 200 2018
#> 14 200 2019
#> 15 200 2020
Let’s assume we’re building toward the kind of state-year analysis I describe in the manuscript accompanying this package. For example, the canonical civil conflict analysis by Fearon and Laitin (2003) has an outcome that varies by year, but several independent variables that are time-invariant and serve as variables for making state-to-state comparisons in their model of civil war onset (e.g ethnic fractionalization, religious fractionalization, terrain ruggedness). In a similar manner, we have a basic ranking of the United States, Canada, and the United Kingdom in our case. Minimally, the United States scores “low”, Canada scores “medium”, and the United Kingdom scores “high” on some metric. There is no variation by time in this simple example.
tibble(ccode = c(2, 20, 200),
ranking = c("low", "medium", "high")) -> y
y
#> # A tibble: 3 × 2
#> ccode ranking
#> <dbl> <chr>
#> 1 2 low
#> 2 20 medium
#> 3 200 high
This is the “right object” (y
) that we want to add to
the “left object” that serves as our main data frame. Notice that
x
has no variable for the ranking information we want. It
does, however, have matching observations for the state identifiers
corresponding with the Correlates of War state codes for the U.S.,
Canada, and the United Kingdom. The left join (as
left_join()
) merges y
into x
,
returning all rows of x
with matching information in
y
based on columns they share in common (here:
ccode
).
# alternatively, as I tend to do it: x %>% left_join(., y)
left_join(x, y)
#> Joining with `by = join_by(ccode)`
#> # A tibble: 15 × 3
#> ccode year ranking
#> <dbl> <int> <chr>
#> 1 2 2016 low
#> 2 2 2017 low
#> 3 2 2018 low
#> 4 2 2019 low
#> 5 2 2020 low
#> 6 20 2016 medium
#> 7 20 2017 medium
#> 8 20 2018 medium
#> 9 20 2019 medium
#> 10 20 2020 medium
#> 11 200 2016 high
#> 12 200 2017 high
#> 13 200 2018 high
#> 14 200 2019 high
#> 15 200 2020 high
This is obviously a very simple example, but it scales well even if
there is some additional complexity. For example, let’s assume we added
a simple five-year panel of Australia (ccode: 900
) to the
“left object” (x
). However, we have no corresponding
information about Australia in the “right object” (y
). Here
is what the left join would produce under these circumstances.
tibble(ccode = 900,
year = c(2016:2020)) %>%
bind_rows(x, .) -> x
x
#> # A tibble: 20 × 2
#> ccode year
#> <dbl> <int>
#> 1 2 2016
#> 2 2 2017
#> 3 2 2018
#> 4 2 2019
#> 5 2 2020
#> 6 20 2016
#> 7 20 2017
#> 8 20 2018
#> 9 20 2019
#> 10 20 2020
#> 11 200 2016
#> 12 200 2017
#> 13 200 2018
#> 14 200 2019
#> 15 200 2020
#> 16 900 2016
#> 17 900 2017
#> 18 900 2018
#> 19 900 2019
#> 20 900 2020
left_join(x, y)
#> Joining with `by = join_by(ccode)`
#> # A tibble: 20 × 3
#> ccode year ranking
#> <dbl> <int> <chr>
#> 1 2 2016 low
#> 2 2 2017 low
#> 3 2 2018 low
#> 4 2 2019 low
#> 5 2 2020 low
#> 6 20 2016 medium
#> 7 20 2017 medium
#> 8 20 2018 medium
#> 9 20 2019 medium
#> 10 20 2020 medium
#> 11 200 2016 high
#> 12 200 2017 high
#> 13 200 2018 high
#> 14 200 2019 high
#> 15 200 2020 high
#> 16 900 2016 NA
#> 17 900 2017 NA
#> 18 900 2018 NA
#> 19 900 2019 NA
#> 20 900 2020 NA
Because have no ranking for Australia in this simple example, the
left join returns NAs (i.e. missing values) for Australia. The original
number of rows of x
under these conditions is
unaffected.
What would happen if we had an observation in y
that has
no corresponding match in x
? For example, let’s assume our
y
data also included a ranking for Denmark
(ccode: 390
), though Denmark does not appear in
x
. Here is what would happen under these circumstances.
tibble(ccode = 390,
ranking = "high") %>%
bind_rows(y, .) -> y
y
#> # A tibble: 4 × 2
#> ccode ranking
#> <dbl> <chr>
#> 1 2 low
#> 2 20 medium
#> 3 200 high
#> 4 390 high
left_join(x, y)
#> Joining with `by = join_by(ccode)`
#> # A tibble: 20 × 3
#> ccode year ranking
#> <dbl> <int> <chr>
#> 1 2 2016 low
#> 2 2 2017 low
#> 3 2 2018 low
#> 4 2 2019 low
#> 5 2 2020 low
#> 6 20 2016 medium
#> 7 20 2017 medium
#> 8 20 2018 medium
#> 9 20 2019 medium
#> 10 20 2020 medium
#> 11 200 2016 high
#> 12 200 2017 high
#> 13 200 2018 high
#> 14 200 2019 high
#> 15 200 2020 high
#> 16 900 2016 NA
#> 17 900 2017 NA
#> 18 900 2018 NA
#> 19 900 2019 NA
#> 20 900 2020 NA
Notice the output of this left join is identical to the output above.
Australia is in x
, but not in y
. Thus, the
rows for Australia are returned but the absence of ranking information
for Australia in y
means the variable is NA
for Australia after the merge. Denmark is in y
, but not
x
. Because the left join returns all rows in x
with matching information in y
, the absence of observations
for Denmark in x
means there is nowhere for the ranking
information to go in the merge. Thus, Denmark’s ranking is ignored.
Why the Left Join, in Particular?
An interested user may ask what’s so special about this kind of join
that it appears everywhere in peacesciencer. One reply is
that my use of the left_join()
is in part a matter of
taste. I could just as well be doing this vignette by reference to the
“right join”, the mirror join to “left join.” The right join in
dplyr’s right_join(x,y)
returns all records
from y
with matching rows in x
by common
columns, though the equivalency would depend on reversing the order of
x
and y
(i.e. left_join(x, y)
produces the same information as right_join(y, x)
). The
arrangement of columns would differ in the left_join()
and
right_join()
in this simple application even if the same
underlying information is there. Ultimately, I tend to think
“left-handed” when it comes to data management and instruct my
students to do the same when I introduce them to data transformation in
R. I like the intuition, especially in the pipe-based workflow, to start
with a master data object at the top of the pipe and keep it “left” as I
add information to it. It has the benefit of keeping the units of
analysis (e.g. state-years in this simple setup) as the first columns
the user sees as well. This is my preferred approach to data
transformation and left_join()
recurs in
peacesciencer as a result.
Beyond that matter of taste, the left join is everywhere in
peacesciencer because the project endeavors hard to
recreate the appropriate universe of cases of interest to the user and
allow the user to add stuff to it as they see fit.
create_stateyears()
will create the entire universe of
state-years from 1816 to the present for a state-year analysis.
create_dyadyears()
will create the entire universe of
dyad-years from 1816 to the present for a dyad-year analysis. The logic,
as it is implemented in peacesciencer’s multiple
functions, is the type of data the user wants to create has been created
for them. The user does not want to expand the data any further than
that, though the user may want to do something like reduce the full
universe of 1816-2020 state-years to just 1946-2010. However, this is a
universe partially discarded, not a universe that has been augmented or
expanded.
With that in mind, every function’s use of the left join assumes the data object it receives represents the full universe of cases of interest to the researcher. The left join is just adding information to it, based on matching information in one of its many data sets. When done carefully, the left join is a dutiful way of adding information to a data set without changing the number of rows of the original data set. The number of columns will obviously expand, but the number of rows is unaffected.
Potential Problems of the Left Join
“When done carefully” is doing some heavy-lifting in that last sentence. So, let me explain some situations where the left join will produce problems for the researcher (even if the join itself is doing what it is supposed to do from an operational standpoint).
The first is less of a problem, at least as I have implemented in
peacesciencer, but more of a caution. In the above
example, our panel consists of just the U.S., Canada, the United
Kingdom, and Australia. We happen to have a ranking for Denmark, but
Denmark wasn’t in our panel of (effectively, exclusively) Anglophone
states. Therefore, no row is created for Denmark. If it were that
important that the left join create those rows for Denmark, we should
have had it in the first place (i.e. a panel for Denmark should have
been in x
before the merge). In this case, the left join is
behaving as it should. We should have had Denmark in the panel before
trying to match information to it.
peacesciencer circumvents this issue by creating
universal data (e.g. all state-years, all dyad-years, all available
leader-years) that the user is free to subset as they see fit. Users
should run one of the “create” functions
(e.g. create_stateyears()
, create_dyadyears()
)
at the top of their script before adding information to it because the
left join, as implemented everywhere in this package, is building in an
assumption that the universe of cases of interest to the user is
represented in the “left object” for a left outer join. Basically, do
not expect the left join to create new rows in x
in a
situation where there is a state represented in y
but not
in x
. It will not. This type of join assumes the universe
of cases of interest to the researcher already appear in the “left
object.”
The second situation is a bigger problem. Sometimes, often when bouncing
between information denominated in Correlates of War states and
Gleditsch-Ward states, there is an unwanted duplicate observation in
the data frame to be merged into the primary data of interest to the
user. Let’s go back to our simple example of x
and
y
here. Everything here performs nicely, though Australia
(in x
) has no ranking and Denmark (in y
) is
not in our panel of state-years because it wasn’t part of the original
universe of cases of interest to us.
x
#> # A tibble: 20 × 2
#> ccode year
#> <dbl> <int>
#> 1 2 2016
#> 2 2 2017
#> 3 2 2018
#> 4 2 2019
#> 5 2 2020
#> 6 20 2016
#> 7 20 2017
#> 8 20 2018
#> 9 20 2019
#> 10 20 2020
#> 11 200 2016
#> 12 200 2017
#> 13 200 2018
#> 14 200 2019
#> 15 200 2020
#> 16 900 2016
#> 17 900 2017
#> 18 900 2018
#> 19 900 2019
#> 20 900 2020
y
#> # A tibble: 4 × 2
#> ccode ranking
#> <dbl> <chr>
#> 1 2 low
#> 2 20 medium
#> 3 200 high
#> 4 390 high
Let’s assume, however, we mistakenly entered the United Kingdom twice
into y
. We know these data are supposed to be simple
state-level rankings. Each state is supposed to be in there just once.
The United Kingdom appears in there twice.
If we were to left join y2
into x
, we get
an unwelcome result. The United Kingdom is duplicated for all yearly
observations.
left_join(x, y2) %>% data.frame
#> Joining with `by = join_by(ccode)`
#> Warning in left_join(x, y2): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 11 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#> "many-to-many"` to silence this warning.
#> ccode year ranking
#> 1 2 2016 low
#> 2 2 2017 low
#> 3 2 2018 low
#> 4 2 2019 low
#> 5 2 2020 low
#> 6 20 2016 medium
#> 7 20 2017 medium
#> 8 20 2018 medium
#> 9 20 2019 medium
#> 10 20 2020 medium
#> 11 200 2016 high
#> 12 200 2016 high
#> 13 200 2017 high
#> 14 200 2017 high
#> 15 200 2018 high
#> 16 200 2018 high
#> 17 200 2019 high
#> 18 200 2019 high
#> 19 200 2020 high
#> 20 200 2020 high
#> 21 900 2016 <NA>
#> 22 900 2017 <NA>
#> 23 900 2018 <NA>
#> 24 900 2019 <NA>
#> 25 900 2020 <NA>
It doesn’t matter that the duplicate ranking in y2
for
the UK was the same. It would be messier, sure, if the ranking were
different for the duplicate observation, but it matters more here that
it was duplicated. In a panel like this, a user who is not careful will
have the effect of overweighting those observations that duplicate. In a
simple example like this, subsetting to just complete cases
(i.e. Australia has no ranking), the UK is 50% of all observations
despite the fact it should just be a third of observations. That’s not
ideal for a researcher.
peacesciencer goes above and beyond to make sure this doesn’t happen in the data it creates. Functions are aggressively tested to make sure nothing duplicates, and various parlor tricks (prominently group-by slices) are used internally to cull those duplicate observations. The release of a function that makes prominent use of the left join is done with the assurance it doesn’t create a duplicate. No matter, this is the biggest peril of the left join for a researcher who may want to duplicate what peacesciencer does on their own. Always inspect the data you merge, and the output.
Semi-Join
The “semi-join” (semi_join()
in dplyr)
returns all rows from the left object (x
) that have
matching values in the right object (y
). It is a type of
“filtering join”, which affects the observations and not the variables.
It appears just twice in peacesciencer, serving as a
final join in create_leaderdays()
and
create_leaderyears()
. In both cases, it serves as a means
of standardizing leader data (denominated in the Gleditsch-Ward system,
if not necessarily Gleditsch-Ward system dates) to the Correlates of War
or Gleditsch-Ward system.
Here is a basic example of what a semi-join is doing in this context, with an illustration of the kind of difficulties that manifest in standardizing Archigos’ leader data to the Correlates of War state system. Assume this simple state system that has just two states—“Lincoln” and “Morrill”—over a two-week period at the start of 1975 (Jan. 1, 1975 to Jan. 14, 1975). In this simple system, “Lincoln” is a state for the full two week period (Jan. 1-Jan.14) whereas “Morrill” is a state for just the first seven days (Jan. 1-Jan. 7) because, let’s say, “Lincoln” occupied “Morrill” and ended its statehood. We also happened to have some leader data for these two states. Over this two week period, our leader data suggests “Lincoln” had just one continuous leader—“Archie”—whereas “Morrill” had three. “Brian” was the leader of “Morrill” before he retired from office and was replaced by “Cornelius.” However, he was deposed when “Lincoln” invaded “Morrill” and was replaced by a puppet head of state, “Pete.” Our data look like this.
tibble(code = c("Lincoln", "Morrill"),
stdate = make_date(1975, 01, 01),
enddate = c(make_date(1975, 01, 14),
make_date(1975, 01, 07))) -> state_system
state_system
#> # A tibble: 2 × 3
#> code stdate enddate
#> <chr> <date> <date>
#> 1 Lincoln 1975-01-01 1975-01-14
#> 2 Morrill 1975-01-01 1975-01-07
tibble(code = c("Lincoln", "Morrill", "Morrill", "Morrill"),
leader = c("Archie", "Brian", "Cornelius", "Pete"),
stdate = c(make_date(1975, 01, 01), make_date(1975, 01, 01),
make_date(1975, 01, 04), make_date(1975, 01, 08)),
enddate = c(make_date(1975, 01, 14), make_date(1975, 01, 04),
make_date(1975, 01, 08), make_date(1975, 01, 14))) -> leaders
leaders
#> # A tibble: 4 × 4
#> code leader stdate enddate
#> <chr> <chr> <date> <date>
#> 1 Lincoln Archie 1975-01-01 1975-01-14
#> 2 Morrill Brian 1975-01-01 1975-01-04
#> 3 Morrill Cornelius 1975-01-04 1975-01-08
#> 4 Morrill Pete 1975-01-08 1975-01-14
We can use some basic rowwise()
transformation to recast
these data as daily, resulting in state-day data and leader-day
data.
state_system %>%
rowwise() %>%
mutate(date = list(seq(stdate, enddate, by = '1 day'))) %>%
unnest(date) %>%
select(code, date) -> state_days
state_days %>% data.frame
#> code date
#> 1 Lincoln 1975-01-01
#> 2 Lincoln 1975-01-02
#> 3 Lincoln 1975-01-03
#> 4 Lincoln 1975-01-04
#> 5 Lincoln 1975-01-05
#> 6 Lincoln 1975-01-06
#> 7 Lincoln 1975-01-07
#> 8 Lincoln 1975-01-08
#> 9 Lincoln 1975-01-09
#> 10 Lincoln 1975-01-10
#> 11 Lincoln 1975-01-11
#> 12 Lincoln 1975-01-12
#> 13 Lincoln 1975-01-13
#> 14 Lincoln 1975-01-14
#> 15 Morrill 1975-01-01
#> 16 Morrill 1975-01-02
#> 17 Morrill 1975-01-03
#> 18 Morrill 1975-01-04
#> 19 Morrill 1975-01-05
#> 20 Morrill 1975-01-06
#> 21 Morrill 1975-01-07
leaders %>%
rowwise() %>%
mutate(date = list(seq(stdate, enddate, by = '1 day'))) %>%
unnest(date) %>%
select(code, leader, date) -> leader_days
leader_days %>% data.frame
#> code leader date
#> 1 Lincoln Archie 1975-01-01
#> 2 Lincoln Archie 1975-01-02
#> 3 Lincoln Archie 1975-01-03
#> 4 Lincoln Archie 1975-01-04
#> 5 Lincoln Archie 1975-01-05
#> 6 Lincoln Archie 1975-01-06
#> 7 Lincoln Archie 1975-01-07
#> 8 Lincoln Archie 1975-01-08
#> 9 Lincoln Archie 1975-01-09
#> 10 Lincoln Archie 1975-01-10
#> 11 Lincoln Archie 1975-01-11
#> 12 Lincoln Archie 1975-01-12
#> 13 Lincoln Archie 1975-01-13
#> 14 Lincoln Archie 1975-01-14
#> 15 Morrill Brian 1975-01-01
#> 16 Morrill Brian 1975-01-02
#> 17 Morrill Brian 1975-01-03
#> 18 Morrill Brian 1975-01-04
#> 19 Morrill Cornelius 1975-01-04
#> 20 Morrill Cornelius 1975-01-05
#> 21 Morrill Cornelius 1975-01-06
#> 22 Morrill Cornelius 1975-01-07
#> 23 Morrill Cornelius 1975-01-08
#> 24 Morrill Pete 1975-01-08
#> 25 Morrill Pete 1975-01-09
#> 26 Morrill Pete 1975-01-10
#> 27 Morrill Pete 1975-01-11
#> 28 Morrill Pete 1975-01-12
#> 29 Morrill Pete 1975-01-13
#> 30 Morrill Pete 1975-01-14
If we wanted to standardize these leader-day data to the state system data, we would semi-join the leader-day data (the left object) with the state-day object (the right object), returning just the leader-day data with valid days in the state system data.
leader_days %>%
semi_join(., state_days) %>%
data.frame
#> Joining with `by = join_by(code, date)`
#> code leader date
#> 1 Lincoln Archie 1975-01-01
#> 2 Lincoln Archie 1975-01-02
#> 3 Lincoln Archie 1975-01-03
#> 4 Lincoln Archie 1975-01-04
#> 5 Lincoln Archie 1975-01-05
#> 6 Lincoln Archie 1975-01-06
#> 7 Lincoln Archie 1975-01-07
#> 8 Lincoln Archie 1975-01-08
#> 9 Lincoln Archie 1975-01-09
#> 10 Lincoln Archie 1975-01-10
#> 11 Lincoln Archie 1975-01-11
#> 12 Lincoln Archie 1975-01-12
#> 13 Lincoln Archie 1975-01-13
#> 14 Lincoln Archie 1975-01-14
#> 15 Morrill Brian 1975-01-01
#> 16 Morrill Brian 1975-01-02
#> 17 Morrill Brian 1975-01-03
#> 18 Morrill Brian 1975-01-04
#> 19 Morrill Cornelius 1975-01-04
#> 20 Morrill Cornelius 1975-01-05
#> 21 Morrill Cornelius 1975-01-06
#> 22 Morrill Cornelius 1975-01-07
Notice that Pete drops from these data because, in this simple
example, Pete was a puppet head of state installed by Archie when
“Lincoln” invaded and occupied “Morrill”. The semi-join here is simply
standardizing the leader data to the state system data, which is
effectively what’s happening with the semi-joins in
create_leaderdays()
(and its aggregation function:
create_leaderyears()
).
Anti-Join
The anti-join is another type of filtering join, returning all rows
from the left object (x
) without a match in the
right object (y
). This type of join appears just once in
peacesciencer. Prominently, peacesciencer
prepares and presents two data sets in this
package—false_cow_dyads
and
false_gw_dyads
—that represent directed dyad-years in the
Correlates of War and Gleditsch-Ward systems that were active in the
same year, but never at the same time on the same year.
Here are those dyads for context.
false_cow_dyads
#> # A tibble: 60 × 4
#> ccode1 ccode2 year in_ps
#> <dbl> <dbl> <int> <dbl>
#> 1 115 817 1975 1
#> 2 210 255 1945 1
#> 3 211 255 1945 1
#> 4 223 678 1990 1
#> 5 223 680 1990 1
#> 6 255 210 1945 1
#> 7 255 211 1945 1
#> 8 255 260 1990 1
#> 9 255 265 1990 1
#> 10 255 290 1945 1
#> # ℹ 50 more rows
false_gw_dyads
#> # A tibble: 38 × 4
#> gwcode1 gwcode2 year in_ps
#> <dbl> <dbl> <int> <dbl>
#> 1 99 100 1830 1
#> 2 99 211 1830 1
#> 3 100 99 1830 1
#> 4 100 615 1830 1
#> 5 115 817 1975 1
#> 6 211 99 1830 1
#> 7 211 615 1830 1
#> 8 255 850 1945 1
#> 9 300 305 1918 1
#> 10 300 345 1918 1
#> # ℹ 28 more rows
These were created by two scripts that, for each year in the respective state system data, creates every possible daily dyadic pairing and truncates the dyads to just those that had at least one day of overlap. This is a computationally demanding procedure compared to what peacesciencer does (which creates every possible dyadic pair in a given year, given the state system data supplied to it). However, it creates the possibility of same false dyads in a given year that showed no overlap.
Consider the case of Suriname (115) and the Republic of Vietnam (817) in 1975 as illustrative here.
check_both <- function(x) {
gw_states %>%
mutate(data = "G-W") %>%
filter(gwcode %in% x) -> gwrows
cow_states %>%
mutate(startdate = ymd(paste0(styear,"/",stmonth, "/", stday)),
enddate = ymd(paste0(endyear,"/",endmonth,"/",endday))) %>%
select(stateabb:statenme, startdate, enddate) %>%
mutate(data = "CoW") %>%
rename(statename = statenme) %>%
filter(ccode %in% x) -> cowrows
dat <- bind_rows(gwrows, cowrows) %>%
select(gwcode, ccode, stateabb, everything())
return(dat)
}
check_both(c(115, 817))
#> # A tibble: 4 × 7
#> gwcode ccode stateabb statename startdate enddate data
#> <dbl> <dbl> <chr> <chr> <date> <date> <chr>
#> 1 115 NA SUR Surinam 1975-11-25 2017-12-31 G-W
#> 2 817 NA RVN Vietnam, Republic of 1954-05-01 1975-04-30 G-W
#> 3 NA 115 SUR Suriname 1975-11-25 2016-12-31 CoW
#> 4 NA 817 RVN Republic of Vietnam 1954-06-04 1975-04-30 CoW
Notice both Suriname and Republic of Vietnam were both active in
1975. Suriname appears on Nov. 25, 1975 whereas the Republic of Vietnam
exits on April 30, 1975. However, there is no daily overlap between the
two because they did not exist at any point on the same day in 1975.
These are false dyads. anti_join()
is used in the create_dyadyears()
function to remove
these observations before presenting them to the user.
Here is a simple example of what an anti-join is doing with these examples in mind.
valid_dyads <- tibble(ccode1 = c(2, 20, 200),
ccode2 = c(20, 200, 900),
year = c(2016, 2017, 2018))
valid_dyads %>%
bind_rows(., false_cow_dyads %>% select(ccode1:year)) -> valid_and_invalid
valid_and_invalid
#> # A tibble: 63 × 3
#> ccode1 ccode2 year
#> <dbl> <dbl> <dbl>
#> 1 2 20 2016
#> 2 20 200 2017
#> 3 200 900 2018
#> 4 115 817 1975
#> 5 210 255 1945
#> 6 211 255 1945
#> 7 223 678 1990
#> 8 223 680 1990
#> 9 255 210 1945
#> 10 255 211 1945
#> # ℹ 53 more rows
valid_and_invalid %>%
# remove those invalid dyads-years
anti_join(., false_cow_dyads)
#> Joining with `by = join_by(ccode1, ccode2, year)`
#> # A tibble: 3 × 3
#> ccode1 ccode2 year
#> <dbl> <dbl> <dbl>
#> 1 2 20 2016
#> 2 20 200 2017
#> 3 200 900 2018