A stock photo of assorted national flags (Getty Images)
A stock photo of assorted national flags (Getty Images)

My graduate studies program director asked me to teach an independent study for a graduate student this semester. The goal is to better train the student for their research agenda beyond what I could plausibly teach them in a given semester.1 Toward that end, I’m going to offer most (if not all) of the independent study sessions as posts on my blog. This should help the student and possibly help others who stumble onto my website. Going forward, I’m probably just going to copy-paste this introduction for future posts for this independent study.

The particular student is pursuing a research program in international political economy. Substantively, much of what they want to do is outside my wheelhouse. However, I can offer some things to help the student with their research. The first lesson will be about various state (country) classification systems.

Here’s a table of contents for what follows.

  1. The Issue: There Are So Many Different Classification Systems!
  2. Identify a Temporal Domain for a Cross-National Analysis (Because State Codes Change Over Time)
  3. Make One Classification System a “Master”, and Don’t Use the Country Name
  4. Use R to Create a Panel of States (and States over Time)

The Issue: There Are So Many Different Classification Systems!

It should not shock a graduate student in political science/policy analysis to learn that there is no universal standard for state classification. Indeed, various data sources and agencies will have varying definitions of what territorial unit counts as a state for classification purposes. Each data source/agency will also have a different coding scheme as well.

Take, for example, the following classification systems. The first, Correlates of War (CoW), leans on integers that range from 2 (the United States) to 990 (Samoa) to code states from 1816 to 2016. The second, the Gleditsch-Ward system, is a slight derivation of the CoW system. The overlap is substantial and the numerical range is effectively the same, but important distinctions emerge as Gleditsch-Ward interpret independent states differently. The third is two-character and three-character codes provided by the Organisation Internationale de Normalisation (ISO) 3166 Maintenance Agency, one that Americans will at least recognize as having tight integration with the American National Standards Institute as well as broad use elsewhere. The fourth is the United Nations’ M49 classification system. The fifth is the Geopolitical Entities, Names, and Codes (GENC) Standard (in both two-character and three-character form), which provides names and codes for U.S. recognized entities and subdivisions. GENC supplanted the Federal Information Processing Standard (FIPS) about 10 years ago for this purpose. To round things out, we’ll include the Eurostat classification system (which greatly resembles ISO’s two-character code), the FIPS codes (which also looks a lot like ISO’s two-character code), and the World Bank code (which is very similar to but slightly incompatible with ISO’s three-character code).

Here is how a few territorial units are coded, selected on whether their English country name starts with “T” and as these codes appear in the {countrycode} package.

Select Territorial Units and Their Various Codes
Country Name CoW Code Gleditsch-Ward Code ISO (2) ISO (3) UN M49 GENC (2) GENC (3) Eurostat FIPS World Bank
Taiwan 713 713 TW TWN TW TWN TW TW TWN
Tajikistan 702 702 TJ TJK 762 TJ TJK TJ TI TJK
Tanzania 510 510 TZ TZA 834 TZ TZA TZ TZ TZA
Thailand 800 800 TH THA 764 TH THA TH TH THA
Timor-Leste 860 860 TL TLS 626 TL TLS TL TT TLS
Togo 461 461 TG TGO 768 TG TGO TG TO TGO
Tokelau TK TKL 772 TK TKL TK TL
Tonga 955 TO TON 776 TO TON TO TN TON
Trinidad & Tobago 52 52 TT TTO 780 TT TTO TT TD TTO
Tunisia 616 616 TN TUN 788 TN TUN TN TS TUN
Turkey 640 640 TR TUR 792 TR TUR TR TU TUR
Turkmenistan 701 701 TM TKM 795 TM TKM TM TX TKM
Turks & Caicos Islands TC TCA 796 TC TCA TC TK TCA
Tuscany 337
Tuvalu 947 TV TUV 798 TV TUV TV TV TUV
Two Sicilies 329

It seems a bit daunting to see so many differences among these classification systems. With that in mind, I recommend a student (in particular, my student this semester) to do the following.

Identify a Temporal Domain for a Cross-National Analysis (Because State Codes Change Over Time)

My student is interested in a cross-national analysis of a group of states—regionally or globally, I can’t yet tell—with respect to a host of financial indicators. The extent to which the analysis involves financial indicators means the temporal domain of the analysis is not going to be that long, all things considered. However, my student is going to want to make explicit the temporal domain first because that will have some implications for state classification.

Namely, a state may undergo a massive transformation at some point in the data. Consider an analysis that leans on the full domain of data made available by the World Bank. World Bank data (e.g. GDP) are generally available as early as 1960 and may, in some cases, go to a very recently concluded calendar year (e.g. 2019, since 2020 just ended). If that’s the full domain, the student will want to be mindful of some major events that have important implications for state classification.

Consider the most obvious case here: the disintegration of the Soviet Union. Different classification systems code the disintegration of the Soviet Union differently.

  • CoW, Gleditsch-Ward: CoW and Gleditsch-Ward code the creation of new states that followed in effectively the same way. Both understand the Soviet Union as effectively dominated by Russia, which precedes and succeeds the Soviet Union with the same code the Soviet Union had (365). Moldova (359), Estonia (366), Latvia (367), Lithuania (368), Ukraine (369), Belarus (370), Armenia (371), Georgia (372), Azerbaijan (373), Turkmenistan (371), Tajikistan (702), Kyrgyzstan (703), Uzbekistan (704), and Kazakhstan (705) emerge as independent states in 1991.
  • UN M49: Per Wikipedia, the Soviet Union had a UN M49 code of 810. The disintegration of the Soviet Union creates new codes starting in 1991 for Armenia (051), Azerbaijan (031), Georgia (268), Kazakhstan (398), Kyrgyzstan (417), Tajikistan (762), Turkmenistan (795), Uzbekistan (860), Estonia (233), Latvia (428), Lithuania (440), Belarus (112), Moldova (498), Ukraine (804), and Russia (643). Importantly, the UN classification system sees the Russian Federation as a new entity entirely, and not just the dominant component of the Soviet Union.
  • ISO: ISO does not readily advertise a temporal consideration to its classification scheme. Some digging identifies an “exceptional reservation” for the Soviet Union as SU for the two-character code and SUN for the three-character code. The Russian Federation is RU and RUS, respectively. Whereas CoW, Gleditsch-Ward, and the UN M49 classifications end the Soviet Union in 1991, ISO appears to only note this code emerges in 2008 and is “transitionally reserved from September 1992.”

For these four systems, CoW and Gleditsch-Ward are in effective agreement. There might be a slight difference among the days, but not the years nor the codes. UN M49 treats Russia as separate from the Soviet Union, in contrast with CoW and Gleditsch-Ward, but is in agreement about the year of the change. ISO treats Russia as separate from the Soviet Union, in agreement with UN M49, but the year of the change is different. Different systems, different coding procedures, different results.

This is just the biggest case. However, there are other major events that lead to divergences in classification systems. Among them: the unification of Vietnam, the unification of Yemen, the Ethiopian Civil War (and creation of Eritrea), the unification of Germany, and—another biggie—the disintegration of Yugoslavia.

I mention this only to note that if the temporal domain is something like 2000 to 2019, there won’t be too many issues (other than some slight interpretations of the split between Serbia and Montenegro around 2006). If you want the full enchilada of a temporal domain—the Correlates of War domain from 1816 to the present—there will be plenty of peculiarities/oddities in the classification system you choose that are worth knowing (the extent to which you’re going to be merging in data from multiple sources).

No matter, take inventory of the temporal domain you want first. State codes change over time. You’ll want to take stock of what headaches you can expect in your travels.

Make One Classification System a “Master”, and Don’t Use the Country Name

Vincent Arel-Bundock’s {countrycode} package—which I’ll discuss later—is going to be useful for getting different classification systems to integrate with each other. However, my student (and the reader) should be reticent to treat {countrycode} as magic or to use it uncritically. Namely, my student and the reader should treat one classification system as a “master” system for the particular project.

The system that the student/reader makes the “master” system is to their discretion. However, the master system should probably be the system that emerges as a center of gravity for the particular project. For example, I do a lot of research on inter-state conflict across time and space. The bulk of the data I use is in the CoW ecosystem. Naturally, CoW’s state system membership is ultimately my “master” system. It integrates perfectly with other components of the CoW data ecosystem (e.g. trade, material capabilities). One data source I integrate into these projects—the Polity regime type data—has a different classification system. When that arises, I standardize—as well as I can—the Polity system codes to the CoW codes and integrate into my data based on the matching CoW codes. Again, {countrycode} is wonderful for this purpose (more on that later), but it is not magic and there’s always going to be some cleanup issues to address in the process. But, it’s imperative on me, in my case, to treat the CoW system as a master system because it’s the center of gravity for what I’m doing. It makes my job ultimately easier.

A student doing a lot of cross-national financial analyses will probably lean on the ISO system as the master system. Namely, ISO classification is everywhere and prominently used in International Monetary Fund and World Bank data. I believe the Penn World Table also uses the ISO system for its data.

One caution, though. The student/reader should not treat the English country name as master system. A person who does this will be flagging discrepancies between a lot of countries/states, like “Bahamas, The”/”Bahamas”, “Brunei”/”Brunei Darussalam”, “Burma”/”Myanmar”, “Congo (Brazzaville)”/”Congo”/”Republic of Congo” and many, many more. To be fair, retaining country names in the data frame is going to be useful for diagnostic purposes, but it should not ever be the master system for classification.

Use a code, not a proper noun.

Use R to Create a Panel of States (and States over Time)

The remainder of this post will advise the student on how to use a few lines in R and some R packages to generate a panel of states (and states over time). First, here are the R packages we’ll be using.

library(tidyverse) # for all things workflow
library(countrycode) # for integration among different classification systems
library(peacesciencer) # my R package for peace science stuff
library(ISOcodes) # for ISO and UN M 49 codes

I do want the student/reader to notice one thing I’m doing here. Namely, I have an underlying code and a country name alongside it as well. Don’t use the country name for classification purposes, but do use it for debugging purposes. A reader may get fluent in CoW codes or ISO codes, but, in the event of a matching issue, sometimes it’s good to see the full country name.

Create a State-Year Panel of CoW States

This comes pre-processed in my {peacesciencer} package. create_stateyears() defaults to returning CoW state system members for all available years from 1816 to the most recently concluded calendar year.

create_stateyears()
#> # A tibble: 16,731 x 3
#>    ccode statenme                  year
#>    <dbl> <chr>                    <int>
#>  1     2 United States of America  1816
#>  2     2 United States of America  1817
#>  3     2 United States of America  1818
#>  4     2 United States of America  1819
#>  5     2 United States of America  1820
#>  6     2 United States of America  1821
#>  7     2 United States of America  1822
#>  8     2 United States of America  1823
#>  9     2 United States of America  1824
#> 10     2 United States of America  1825
#> # … with 16,721 more rows

Create a State-Year Panel of Gleditsch-Ward states

create_stateyears() can do the same for Gleditsch-Ward states, but requires the user to specify they want states from the Gleditsch-Ward system.

create_stateyears(system="gw")
#> # A tibble: 18,289 x 3
#>    gwcode statename                 year
#>     <dbl> <chr>                    <int>
#>  1      2 United States of America  1816
#>  2      2 United States of America  1817
#>  3      2 United States of America  1818
#>  4      2 United States of America  1819
#>  5      2 United States of America  1820
#>  6      2 United States of America  1821
#>  7      2 United States of America  1822
#>  8      2 United States of America  1823
#>  9      2 United States of America  1824
#> 10      2 United States of America  1825
#> # … with 18,279 more rows

Create a Panel of ISO Codes

ISO codes are ubiquitous in economic data. I do have some misgivings about using {countrycode} to create a panel of countries, even for the ISO codes. Recall my concern that ISO codes are not very transparent about when (or even if) a code changes at particular point in time. No matter, the {ISOcodes} package has this information

Recall my earlier plea, though: pick one system as a “master” system, even among ISO codes. I’m partial to the three-character ISO codes so I’ll use that here.

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name)
#> # A tibble: 249 x 2
#>    Alpha_3 Name                
#>    <chr>   <chr>               
#>  1 ABW     Aruba               
#>  2 AFG     Afghanistan         
#>  3 AGO     Angola              
#>  4 AIA     Anguilla            
#>  5 ALA     Åland Islands       
#>  6 ALB     Albania             
#>  7 AND     Andorra             
#>  8 ARE     United Arab Emirates
#>  9 ARG     Argentina           
#> 10 ARM     Armenia             
#> # … with 239 more rows

{ISOcodes} does have another data frame for “retired” codes. This is ISO_3166_3 in the {ISOcodes} package. I encourage my student to take stock of how applicable some of these observations are for their particular analysis. My previous point about ISO codes—they don’t neatly communicate a temporal dimension—still holds.

ISO_3166_3 %>% as_tibble() %>%
  # Get rid of codes we don't want because we're focusing on three-character
  select(-Alpha_4, -Numeric)
A Table of Retired ISO Countries/Observations
ISO (3) Name Date Withdrawn Comment
AFI French Afars and Issas 1977
ANT Netherlands Antilles 1993-07-12
ATB British Antarctic Territory 1979
BUR Burma, Socialist Republic of the Union of 1989-12-05
BYS Byelorussian SSR Soviet Socialist Republic 1992-06-15
CSK Czechoslovakia, Czechoslovak Socialist Republic 1993-06-15
SCG Serbia and Montenegro 2006-06-05
CTE Canton and Enderbury Islands 1984
DDR German Democratic Republic 1990-10-30
DHY Dahomey 1977
ATF French Southern and Antarctic Territories 1979 now split between AQ and TF
FXX France, Metropolitan 1997-07-14
GEL Gilbert and Ellice Islands 1979 now split into Kiribati and Tuvalu
HVO Upper Volta, Republic of 1984
JTN Johnston Island 1986
MID Midway Islands 1986
NHB New Hebrides 1980
ATN Dronning Maud Land 1983
NTZ Neutral Zone 1993-07-12 formerly between Saudi Arabia and Iraq
PCI Pacific Islands (trust territory) 1986 divided into FM, MH, MP, and PW
PUS US Miscellaneous Pacific Islands 1986
PCZ Panama Canal Zone 1980
RHO Southern Rhodesia 1980
SKM Sikkim 1975
SUN USSR, Union of Soviet Socialist Republics 1992-08-30
TMP East Timor 2002-05-20 was Portuguese Timor
VDR Viet-Nam, Democratic Republic of 1977
WAK Wake Island 1986
YMD Yemen, Democratic, People's Democratic Republic of 1990-08-14
YUG Yugoslavia, Socialist Federal Republic of 1993-07-28
ZAR Zaire, Republic of 1997-07-14

Create a State-Year Panel of ISO Codes

If I understand these data correctly, the last change to ISO classification (that could pose a problem for merging from a CoW perspective) concerns the separation between Serbia and Montenegro in 2006. Taking this information to heart, let’s assume we wanted a state-year panel based off ISO codes for all ISO observations from 2010 to 2020. Toward that end, we’d do something like this.

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name) %>%
  mutate(styear = 2010,
         endyear = 2020) %>%
  rowwise() %>%
  mutate(year = list(seq(styear, endyear))) %>%
  unnest(year) %>%
  select(-styear, -endyear)
#> # A tibble: 2,739 x 3
#>    Alpha_3 Name   year
#>    <chr>   <chr> <int>
#>  1 ABW     Aruba  2010
#>  2 ABW     Aruba  2011
#>  3 ABW     Aruba  2012
#>  4 ABW     Aruba  2013
#>  5 ABW     Aruba  2014
#>  6 ABW     Aruba  2015
#>  7 ABW     Aruba  2016
#>  8 ABW     Aruba  2017
#>  9 ABW     Aruba  2018
#> 10 ABW     Aruba  2019
#> # … with 2,729 more rows

Create a Panel of UN M49 Codes

{ISOcodes} also has UN M49 codes as well (UN_M.49_Countries) , though this requires some light cleaning.

UN_M.49_Countries %>% as_tibble() %>% 
  select(-ISO_Alpha_3) %>%
  mutate(Name = str_trim(Name, side="left"))
#> # A tibble: 249 x 2
#>    Code  Name               
#>    <chr> <chr>              
#>  1 004   Afghanistan        
#>  2 248   Åland Islands      
#>  3 008   Albania            
#>  4 012   Algeria            
#>  5 016   American Samoa     
#>  6 020   Andorra            
#>  7 024   Angola             
#>  8 660   Anguilla           
#>  9 010   Antarctica         
#> 10 028   Antigua and Barbuda
#> # … with 239 more rows

Use {countrycode} for Matching/Merging Across Classification Systems

While I encourage the student/reader to treat one classification system as a “master”, it’s highly unlikely the classification system that is the “master” will be the only one encountered in a particular project. For example, let’s assume our master system is the three-character ISO code. However, we’re going to merge in data (say: CoW’s trade data) that uses the CoW state system classification. {countrycode} will be very useful in matching one classification to another.

countrycode() is the primary function in Arel-Bundock’s package for that purpose. The user should create a column using the countrycode() function that identifies the source column (here: Alpha_3), identifies what type of classification that is (here: "iso3c"), and returns the equivalent code we want ("cown", for Correlates of War numeric code).

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name) %>%
  mutate(ccode = countrycode(Alpha_3, "iso3c", "cown"))
#> Warning in countrycode(Alpha_3, "iso3c", "cown"): Some values were not matched unambiguously: ABW, AIA, ALA, ASM, ATA, ATF, BES, BLM, BMU, BVT, CCK, COK, CUW, CXR, CYM, ESH, FLK, FRO, GGY, GIB, GLP, GRL, GUF, GUM, HKG, HMD, IMN, IOT, JEY, MAC, MAF, MNP, MSR, MTQ, MYT, NCL, NFK, NIU, PCN, PRI, PSE, PYF, REU, SGS, SHN, SJM, SPM, SRB, SXM, TCA, TKL, UMI, VGB, VIR, WLF
#> # A tibble: 249 x 3
#>    Alpha_3 Name                 ccode
#>    <chr>   <chr>                <dbl>
#>  1 ABW     Aruba                   NA
#>  2 AFG     Afghanistan            700
#>  3 AGO     Angola                 540
#>  4 AIA     Anguilla                NA
#>  5 ALA     Åland Islands           NA
#>  6 ALB     Albania                339
#>  7 AND     Andorra                232
#>  8 ARE     United Arab Emirates   696
#>  9 ARG     Argentina              160
#> 10 ARM     Armenia                371
#> # … with 239 more rows

I do want the reader to observe something. countrycode() cannot perfectly match observations. The extent to which there are important differences among classification systems, perfect one-to-one matching is impossible (and it’s why I recommend treating one classification as a master system). When countrycode() cannot find a one-to-one match, it returns an NA and will tell you which inputs were not matched for your own diagnostic purposes. In our case, these are the NAs.

ISO Codes Without CoW Codes
ISO (3) Name CoW Code
ABW Aruba
AIA Anguilla
ALA Åland Islands
ASM American Samoa
ATA Antarctica
ATF French Southern Territories
BES Bonaire, Sint Eustatius and Saba
BLM Saint Barthélemy
BMU Bermuda
BVT Bouvet Island
CCK Cocos (Keeling) Islands
COK Cook Islands
CUW Curaçao
CXR Christmas Island
CYM Cayman Islands
ESH Western Sahara
FLK Falkland Islands (Malvinas)
FRO Faroe Islands
GGY Guernsey
GIB Gibraltar
GLP Guadeloupe
GRL Greenland
GUF French Guiana
GUM Guam
HKG Hong Kong
HMD Heard Island and McDonald Islands
IMN Isle of Man
IOT British Indian Ocean Territory
JEY Jersey
MAC Macao
MAF Saint Martin (French part)
MNP Northern Mariana Islands
MSR Montserrat
MTQ Martinique
MYT Mayotte
NCL New Caledonia
NFK Norfolk Island
NIU Niue
PCN Pitcairn
PRI Puerto Rico
PSE Palestine, State of
PYF French Polynesia
REU Réunion
SGS South Georgia and the South Sandwich Islands
SHN Saint Helena, Ascension and Tristan da Cunha
SJM Svalbard and Jan Mayen
SPM Saint Pierre and Miquelon
SRB Serbia
SXM Sint Maarten (Dutch part)
TCA Turks and Caicos Islands
TKL Tokelau
UMI United States Minor Outlying Islands
VGB Virgin Islands, British
VIR Virgin Islands, U.S.
WLF Wallis and Futuna

Some of this is by design. For example, there’s no CoW code for Aruba (ABW) because Aruba does not exist in the CoW system. That’ll be the bulk of the warnings returned by countrycode() for a case like this and you can safely ignore those. Some of this is, well, a headache you’ll need to fix yourself. For example, Serbia (SRB) always throws countrycode() for a loop, but Serbia has always been 345 in the CoW system. You can fix that yourself with an addendum to the mutate() wrapper. Something like ccode = ifelse(Alpha_3 == "SRB", 345, ccode) will work.

ISO_3166_1 %>% as_tibble() %>%
  # Alpha_2 = iso2c, if you wanted it.
  # I want the three-character one.
  select(Alpha_3, Name) %>%
  mutate(ccode = countrycode(Alpha_3, "iso3c", "cown"),
         ccode = ifelse(Alpha_3 == "SRB", 345, ccode)) 
#> # A tibble: 249 x 3
#>    Alpha_3 Name                 ccode
#>    <chr>   <chr>                <dbl>
#>  1 ABW     Aruba                   NA
#>  2 AFG     Afghanistan            700
#>  3 AGO     Angola                 540
#>  4 AIA     Anguilla                NA
#>  5 ALA     Åland Islands           NA
#>  6 ALB     Albania                339
#>  7 AND     Andorra                232
#>  8 ARE     United Arab Emirates   696
#>  9 ARG     Argentina              160
#> 10 ARM     Armenia                371
#> # … with 239 more rows

I use this to underscore that {countrycode} is one of the most useful R packages merging and matching across different state/country classification systems. However, it is not magic and should not be used uncritically. Always inspect the output.

  1. I’ll be using they/them pronouns here mostly for maximum anonymity.