Chapter 3 Import, Export, and Save Data
This chapter reviews importing external data into Python and R, including CSV files, Excel files, and other structured data files. There is often more than one way to import data into Python and R; each example below highlights one method per file type.
The data set we use for demonstration is the New York State Math Test Results by Grade from 2006 - 2011, downloaded from data.gov on September 30, 2021.
The final section presents approaches to exporting and saving data.
3.1 CSV
Comma separated value (CSV) files are text files with fields separated by commas. They are useful for “rectangular” data, where rows represent observations and columns represent variables or features.
Python
The pandas function read_csv()
is a common approach to importing CSV files into Python.
import pandas as pd
d = pd.read_csv('data/ny_math_test.csv')
d.loc[0:2, ["Grade", "Year", "Mean Scale Score"]]
Grade Year Mean Scale Score
0 3 2006 700
1 4 2006 699
2 5 2006 691
R
The most-common way of importing a CSV file into R is with the base R function read.csv()
.
Grade Year Mean.Scale.Score
1 3 2006 700
2 4 2006 699
3 5 2006 691
Notice that the spaces in the column names have been automatically replaced with periods.
Two packages that provide alternatives to read.csv()
are readr and data.table. The function read_csv()
from readr returns a tibble, and the function fread()
from data.table returns a data.table.
3.2 XLS/XLSX (Excel)
Excel files are native to Microsoft Excel. Prior to 2007, Excel files had an extension of XLS. With the launch of Excel 2007, the extension was changed to XLSX. Excel files can have multiple sheets of data—a complexity that needs to be accounted for when importing XLS(X) data into Python and R.
Python
The pandas function read_excel()
is a common approach to importing Excel files into Python. The sheet_name
argument allows you to specify which sheet you want to import; you can specify a sheet by its (zero-indexed) ordering or by its name. Since this Excel file only has one sheet, we do not need to use the argument. In addition, specifying sheet_name = None
will read in all sheets and return a “dict” data structure where the key is the sheet name and the value is a DataFrame.
R
readxl is a well-documented and actively maintained package for importing Excel files into R. The workhorse function from the package is read_excel()
. The sheet
argument allows you to specify which sheet you want to import by position or by name. Since this Excel file has only one sheet, we can omit the argument.
library(readxl)
d_xls <- read_excel("data/ny_math_test.xlsx")
d_xls[1:3, c("Grade", "Year", "Mean Scale Score")]
# A tibble: 3 × 3
Grade Year `Mean Scale Score`
<chr> <dbl> <dbl>
1 3 2006 700
2 4 2006 699
3 5 2006 691
The result is a tibble, a tidyverse data frame.
It’s worth noting we can use the range
argument to specify a range of cells to import. For example, if the top left corner of the data was B5 and the bottom right corner of the data was J54, we could enter range="B5:J54"
to just import that section of data.
3.3 JSON
JSON (JavaScript Object Notation) is a flexible format for storing data. JSON files are composed of text and can be viewed in any text editor. Because of their flexibility, the data stored in a JSON files can be quite complex—and consequently, there’s is no one-size-fits-all method for importing JSON files into Python or R.
Python
Below is one approach to importing a JSON file. We first import Python’s built-in json package and use its loads()
function to read in the lines of the JSON file. The file is accessed using the open()
function and its associated read
method. Next we use the pandas function json_normalize()
to convert the ‘data’ structure of the JSON data into a DataFrame. Finally we add column names to the DataFrame.
import json
# Load data using Python JSON module
with open('data/ny_math_test.json', 'r') as f:
data = json.loads(f.read())
import pandas as pd
d_json = pd.json_normalize(data, record_path =['data'])
# Add column names
names = list()
for i in range(23):
names.append(data['meta']['view']['columns'][i]['name'])
d_json.columns = names
d_json.loc[0:2, ["Grade", "Year", "Mean Scale Score"]]
Grade Year Mean Scale Score
0 3 2006 700
1 4 2006 699
2 5 2006 691
Again, this is just one approach that assumes we want a DataFrame.
R
jsonlite is one of several R packages available for importing JSON files into R. The read_json()
function takes a JSON file and returns a list or data frame depending on the structure of the data file and its arguments. We set simplifyVector = TRUE
so the data is simplified into a matrix.
The d_json
object is a list with two elements: “meta” and “data”. The “data” element is a matrix that contains the data of interest. The “meta” element contains the column names for the data (among much else). Notice we had to “drill down” in the list to find the column names. We assign column names to the matrix using the colnames()
function and then convert the matrix to a data frame using the as.data.frame()
function.
colnames(d_json$data) <- d_json$meta$view$columns$fieldName
d_json <- as.data.frame(d_json$data)
d_json[1:3,c("grade", "year", "mean_scale_score")]
grade year mean_scale_score
1 3 2006 700
2 4 2006 699
3 5 2006 691
3.4 XML
XML (eXtensible Markup Language) is a markup language that was designed to store data. Like JSON files, XML files are text and can be viewed in a text editor or web browser, and this flexibility allows them to house complex data.
Python
The pandas library provides the read_xml()
function for importing XML files. The records in the ny_math_test.xml
file are identified as nodes named “row”. The 168 rows are nested in one outer node also called “row”. We use the xpath
argument row//row
to specify that we want to elect all row elements that are descendant of the single row element.
import pandas as pd
d_xml = pd.read_xml('data/ny_math_test.xml', xpath = "row//row")
d_xml.loc[0:2, ["grade", "year", "mean_scale_score"]]
grade year mean_scale_score
0 3 2006 700
1 4 2006 699
2 5 2006 691
R
xml2 is a relatively small but powerful package for importing and working with XML files. The read_xml()
function imports an XML file and returns a list of pointers to XML nodes. There are a number of ways to proceed once you import an XML file, such as using the xml_find_all()
function to find nodes that match an xpath expression. Below we take a simple approach and convert the XML nodes into a list using the as_list()
function that is part of the xml2 package. Once we have the XML nodes in a list, we can use the bind_rows()
function in the dplyr package to create a data frame. Notice we have to drill down into the list to select the element that contains the data. After this we need to do one more thing: unlist each the columns into vectors. We do this by applying the unlist
function to each column of d
. We save the result by assigning to d[]
, which overwrites each element (or column) of d
with the unlisted result.
library(xml2)
d_xml <- read_xml('data/ny_math_test.xml')
d_list <- as_list(d_xml)
d <- dplyr::bind_rows(d_list$response$row)
d[] <- lapply(d, unlist)
d[1:3, c("grade", "year", "mean_scale_score")]
# A tibble: 3 × 3
grade year mean_scale_score
<chr> <chr> <chr>
1 3 2006 700
2 4 2006 699
3 5 2006 691
The result is a tibble, a tidyverse data frame. We would most likely want to proceed to converting certain columns to numeric.
3.5 Exporting/Writing/Saving data and variables
There are several ways to export/write/save files from Python and R. The following examples highlight some of these ways.
Python
The pandas function to_csv()
saves a pandas DataFrame as a csv file.
The Python package pickle allows you to write (save) any object from the Python environment and read (load) any object you have written into the Python environment.
The code below writes to a pickle file. The first line opens the file object being written to. In the open
function, ‘file_name’ specifies the file path of the file object. Then, ‘wb’ stands for ‘write binary’, which means the file is being written in binary form (1s and 0s). After the as keyword, ‘file_’, is the user selected name of the file object.
The second line uses the pickle.dump()
function. This function requires two arguments: the object being written and the name of the file object.
import pickle
# Define the file name
file_name = 'data.pickle'
# Write the variable to the file system
with open(file_name, 'wb') as file_:
pickle.dump(d, file_)
Conversely, the code below reads from a pickle file. The first line opens the file object being read from. In the open
function, data.pickle
specifies the file path of the file object. Then, rb
stands for “read binary,” which means the file is being read in binary form (1s and 0s). The as
argument (my_file
in the example below), is the user-selected name of the file object.
The second line uses the pickle.load()
function. This function requires one argument: the name of the file object.
R
To export a matrix or data frame to a CSV file, use the write.csv()
function. To export to a file with a different field separator, such as a tab, use write.table()
. The minimal arguments for write.csv()
are the object and the file name. To export a data frame named dat
as a file named dat.csv
to your current working directory, you’d use:
By default, a column for row names or numbers is included in the exported csv file. To turn that off, set row.names = FALSE
, like so:
To append a matrix or data frame to an existing csv file, set append = TRUE
.
See also sink()
, cat()
, and writeLines()
for sending text and output to a file.
To save and load R objects for future use in R, there are two options:
- Save and load a single object using
saveRDS()
andreadRDS()
. - Save multiple objects using
save()
andload()
.
Save and load a single object
The minimal arguments for saveRDS()
are the object and a file name with an .rds
extension. For example, to save a single data frame named dat
to your current working directory as dat.rds
:
To load the rds file into R from your current working directory, use the readRDS()
function. Notice we must assign the result of readRDS()
to an object. The object name need not match the file name.
The advantage of saving and loading native R objects is the preservation of characteristics such as factors, attributes, classes, etc. Any object can be saved, including model objects, functions, vectors, lists, etc.
Save multiple objects
The minimal arguments for save()
are the objects to save and a file name with a .rda
extension. Objects can also be specified as a character vector to the list
argument. For example, to save a data frame named dat
, a model object named m
, and a plot object called p
, to your current working directory as work.rda
:
Or with objects specified as a character vector:
To load the rda file from your current working directory, use the load()
function. Notice we do not assign the result to an object name. The result of the load()
function is to load the objects into your global environment.
Upon successful execution of the load()
function, the dat
, m
, and p
objects will be loaded into your global environment. Any objects already in your global environment with the same name will be overwritten without warning.
You can also save everything in your global environment into a rda
file using the save.image()
function. It works just like the save()
function except you do not specify which objects to save. You simply provide a file name. If you do not specify a file name, a default name of .Rdata
is used. To load the file use the load()
function. Again, all objects will be loaded into the gloabal environment, overwriting any existing objects with the same name.