I’ve written a little bit elsewhere on my blog about how you should store your bigger data frames with an experiment on the World Values Survey. The fundamental takeaways were the following. First, a relational database (i.e. SQL) provides the absolute fastest read times even if the object sizes of these relational databases are as large as raw text (i.e. they have no file compression). Second, serialization through the
fst package provides the second fastest load time but the file size that emerges from this serialization is discernibly larger than another serialization processes native to R. Further, serialization through
fst discards variable/value label information that may come in the data set. Third, serialization in R (i.e.
readRDS()) is a best of both worlds, of a sort. The load time is slower than what a researcher would get through a relational database or serialization via
fst, but the file size is the absolute smallest, the load time is not too slow relative to other options, and an R serialized data frame retains variable/value label information. R serialization became my go-to shortly thereafter.
However, there might be another option through quick serialization (
qs) of R objects. I only discovered this package last week and wanted to compare it with some other options as an experiment quite similar to what I did in January of last year. What follows is a comparison of different serialization processes in their read/save times.
The Data and the Setup
I’m using the General Social Survey (GSS) from 1972 to 2018 as the trial data for this experiment. I’m doing this for a few reasons. For one, the data are not as “long” as the World Values Survey’s (WVS) cross-national data from 1981 to the most recent survey waves. The World Values Survey is 341,271 rows long across all countries and all six waves while the GSS is 64,814 rows long. However, the data are much, much wider. WVS has 1,410 columns, which could be squeezed into SQL relational database that otherwise typically balks at any data frame having more than a thousand columns. The GSS have 6,108 columns from 1972 to 2018. No relational database (of which I’m aware) will store 6,000+ columns into a single database. A user could manually split those into subsets and save it as a relational database, but this is kind of impractical and the file size—a la my experiment from last year—would be prohibitively large. Further, it would not save the variable/value labels that the GSS (like the World Values Survey) provides in its data distributed as SPSS or Stata data frames.
The setup will compare the read and save times for serialization through 1) R’s serialized data frames (RDS) that comes in base R, 2) “lightning fast serialization” (fst) of data frames from the
fst package, and 3) “quick serialization” (qs) from the
qs package. The read time experiments will also be complemented with a comparison of the read times for the GSS in its SPSS format, because I’m a glutton for R commands that take several minutes to run.
The code that generates the results is available as a gist on my Github, as are the data. The
_source directory on my website’s Github directory will show additional code for formatting those results. Background information: I’m conducting this experiment on my war horse desktop rig running Microsoft R Open 3.5.3 on Ubuntu 18.04. I’m also dicking around on the internet, listening to music on YouTube while writing this post. So yes, there’s background noise in these results, not that it will materially change much.
Here is a table that compares the read times of the GSS data from 1972 to 2018. For each format, I read the data into R 10 times and stored the time it took to read the data and assign it to an object. I will only note, for clarification, that the default SPSS data frame includes all columns in uppercase. I made all the columns to be in lower case before saving it to either an fst object, qs object, or RDS object.
The results suggest some recurring themes from the last experiment on the WVS data. Namely, the SPSS and Stata binaries that our longest running survey houses provide are time-consuming to load in R. They’re necessary to provide for many reasons; not every consumer of the data uses R and these software packages come with some useful information (i.e. variable/value labels). However, they are time-consuming to load. It’s why I take every download of WVS or GSS data and immediately look for another means to store it.
Further, serialization through the
fst package results in what is easily the fastest average load time. This was true in the WVS experiment, excluding the relational database approach (that is untenable in this application). However, fst compression has one important drawback: it discards the variable/value labels, as well as some other characteristics of the dataframe. For example, I love storing things, by default, as tibbles. fst binaries do not allow for that.
There is something curious about the R serialization approach. In my WVS experiment, the average read time was around 7.3 seconds. In this GSS experiment, the average read time was over 19 seconds. This is curious for several reasons. One, the WVS data are much, much bigger in overall file size. The GSS R serialization data frame is almost 31 MB while the WVS data frame is almost 46 MB. Two, my current computer is much, much more powerful than the Macbook I used to do the WVS experiment. It should have—I would have thought—breezed through the GSS data. My only explanation that I can conjure from the top of my head is that R serialization works better with longer data than wider data. The GSS data are wider than longer and it’s those additional columns, with variable/value labels, that makes it a chore for the
readRDS() function in base R.
Between both the
fst package and
readRDS() function in base R is quick serialization with the
qs package. The average read time is almost twice the average read time from the
fst package. However, the difference is a matter of seconds. Plus, the
qs package allows the user to retain important data frame information, like variable/value labels and whether the data frame is a tibble.
I think the write-time experiment will show how much utility there is from quick serialization through the
qs package. The next table reports average write times, with minimum times, maximum times, and the ensuing file size. For comparison sake, the SPSS binary of the GSS data is 427.9 MB.
|Method||Average Time||Minimum||Maximum||File Size|
I think the table has the following three takeaways. First, the R serialized data frame has the smallest file size, much like it did in the WVS experiment. It can also retain the variable/value label data and whether you stored the data frame as a tibble or not. However, the write time is obscene relative to the other options. Combined with the longer read times, R serialization looks much less attractive (it seems) when the data are “wider” than “longer.”
Second, serialization through the
fst package is easily the fastest but the file size that emerges is easily the largest. The data frame is roughly three times the size of the R serialized data frame or the
qs serialized data frame. However, this fst binary does not retain variable/value labels and the user would have to manually declared it as a tibble when reading the data into R. This latter point is not a glaring limitation, but it would be nice to have.
Third, what emerges becomes quite the endorsement of the
qs package. Yes, the save time is about twice the save time of serialization through the
fst package. However, the difference is in seconds and not actual minutes. Further, the file size is much smaller than the
fst binary and it stores/saves variable/value label information. If you have data that look like the GSS data I use here, consider downloading it now.