Chapter 3 Import, Export, and Save Data

This chapter reviews importing external data into Python and R, including CSV files, Excel files, and other structured data files. There is often more than one way to import data into Python and R; each example below highlights one method per file type.

The data set we use for demonstration is the New York State Math Test Results by Grade from 2006 - 2011, downloaded from data.gov on September 30, 2021.

The final section presents approaches to exporting and saving data.

3.1 CSV

Comma separated value (CSV) files are text files with fields separated by commas. They are useful for “rectangular” data, where rows represent observations and columns represent variables or features.

Python

The pandas function read_csv() is a common approach to importing CSV files into Python.

import pandas as pd
d = pd.read_csv('data/ny_math_test.csv')
d.loc[0:2, ["Grade", "Year", "Mean Scale Score"]]
  Grade  Year  Mean Scale Score
0     3  2006               700
1     4  2006               699
2     5  2006               691

R

The most-common way of importing a CSV file into R is with the base R function read.csv().

d <- read.csv("data/ny_math_test.csv")
d[1:3, c("Grade", "Year", "Mean.Scale.Score")]
  Grade Year Mean.Scale.Score
1     3 2006              700
2     4 2006              699
3     5 2006              691

Notice that the spaces in the column names have been automatically replaced with periods.

Two packages that provide alternatives to read.csv() are readr and data.table. The function read_csv() from readr returns a tibble, and the function fread() from data.table returns a data.table.

3.2 XLS/XLSX (Excel)

Excel files are native to Microsoft Excel. Prior to 2007, Excel files had an extension of XLS. With the launch of Excel 2007, the extension was changed to XLSX. Excel files can have multiple sheets of data—a complexity that needs to be accounted for when importing XLS(X) data into Python and R.

Python

The pandas function read_excel() is a common approach to importing Excel files into Python. The sheet_name argument allows you to specify which sheet you want to import; you can specify a sheet by its (zero-indexed) ordering or by its name. Since this Excel file only has one sheet, we do not need to use the argument. In addition, specifying sheet_name = None will read in all sheets and return a “dict” data structure where the key is the sheet name and the value is a DataFrame.

import pandas as pd  
d = pd.read_excel('data/ny_math_test.xlsx')  
d.loc[0:2, ["Grade", "Year", "Mean Scale Score"]]  

R

readxl is a well-documented and actively maintained package for importing Excel files into R. The workhorse function from the package is read_excel(). The sheet argument allows you to specify which sheet you want to import by position or by name. Since this Excel file has only one sheet, we can omit the argument.

library(readxl)
d_xls <- read_excel("data/ny_math_test.xlsx")
d_xls[1:3, c("Grade", "Year", "Mean Scale Score")]
# A tibble: 3 × 3
  Grade  Year `Mean Scale Score`
  <chr> <dbl>              <dbl>
1 3      2006                700
2 4      2006                699
3 5      2006                691

The result is a tibble, a tidyverse data frame.

It’s worth noting we can use the range argument to specify a range of cells to import. For example, if the top left corner of the data was B5 and the bottom right corner of the data was J54, we could enter range="B5:J54" to just import that section of data.

3.3 JSON

JSON (JavaScript Object Notation) is a flexible format for storing data. JSON files are composed of text and can be viewed in any text editor. Because of their flexibility, the data stored in a JSON files can be quite complex—and consequently, there’s is no one-size-fits-all method for importing JSON files into Python or R.

Python

Below is one approach to importing a JSON file. We first import Python’s built-in json package and use its loads() function to read in the lines of the JSON file. The file is accessed using the open() function and its associated read method. Next we use the pandas function json_normalize() to convert the ‘data’ structure of the JSON data into a DataFrame. Finally we add column names to the DataFrame.

import json
# Load data using Python JSON module
with open('data/ny_math_test.json', 'r') as f:
    data = json.loads(f.read())

import pandas as pd  
d_json = pd.json_normalize(data, record_path =['data'])

# Add column names
names = list()
for i in range(23): 
  names.append(data['meta']['view']['columns'][i]['name'])
d_json.columns = names

d_json.loc[0:2, ["Grade", "Year", "Mean Scale Score"]]  
  Grade  Year Mean Scale Score
0     3  2006              700
1     4  2006              699
2     5  2006              691

Again, this is just one approach that assumes we want a DataFrame.

R

jsonlite is one of several R packages available for importing JSON files into R. The read_json() function takes a JSON file and returns a list or data frame depending on the structure of the data file and its arguments. We set simplifyVector = TRUE so the data is simplified into a matrix.

library(jsonlite)
d_json <- read_json('data/ny_math_test.json', simplifyVector = TRUE)

The d_json object is a list with two elements: “meta” and “data”. The “data” element is a matrix that contains the data of interest. The “meta” element contains the column names for the data (among much else). Notice we had to “drill down” in the list to find the column names. We assign column names to the matrix using the colnames() function and then convert the matrix to a data frame using the as.data.frame() function.

colnames(d_json$data) <- d_json$meta$view$columns$fieldName
d_json <- as.data.frame(d_json$data)
d_json[1:3,c("grade", "year", "mean_scale_score")]
  grade year mean_scale_score
1     3 2006              700
2     4 2006              699
3     5 2006              691

3.4 XML

XML (eXtensible Markup Language) is a markup language that was designed to store data. Like JSON files, XML files are text and can be viewed in a text editor or web browser, and this flexibility allows them to house complex data.

Python

The pandas library provides the read_xml() function for importing XML files. The records in the ny_math_test.xml file are identified as nodes named “row”. The 168 rows are nested in one outer node also called “row”. We use the xpath argument row//row to specify that we want to elect all row elements that are descendant of the single row element.

import pandas as pd
d_xml = pd.read_xml('data/ny_math_test.xml', xpath = "row//row")

d_xml.loc[0:2, ["grade", "year", "mean_scale_score"]]  
  grade  year  mean_scale_score
0     3  2006               700
1     4  2006               699
2     5  2006               691

R

xml2 is a relatively small but powerful package for importing and working with XML files. The read_xml() function imports an XML file and returns a list of pointers to XML nodes. There are a number of ways to proceed once you import an XML file, such as using the xml_find_all() function to find nodes that match an xpath expression. Below we take a simple approach and convert the XML nodes into a list using the as_list() function that is part of the xml2 package. Once we have the XML nodes in a list, we can use the bind_rows() function in the dplyr package to create a data frame. Notice we have to drill down into the list to select the element that contains the data. After this we need to do one more thing: unlist each the columns into vectors. We do this by applying the unlist function to each column of d. We save the result by assigning to d[], which overwrites each element (or column) of d with the unlisted result.

library(xml2)
d_xml <- read_xml('data/ny_math_test.xml')
d_list <- as_list(d_xml)
d <- dplyr::bind_rows(d_list$response$row)
d[] <- lapply(d, unlist)
d[1:3, c("grade", "year", "mean_scale_score")]
# A tibble: 3 × 3
  grade year  mean_scale_score
  <chr> <chr> <chr>           
1 3     2006  700             
2 4     2006  699             
3 5     2006  691             

The result is a tibble, a tidyverse data frame. We would most likely want to proceed to converting certain columns to numeric.

3.5 Exporting/Writing/Saving data and variables

There are several ways to export/write/save files from Python and R. The following examples highlight some of these ways.

Python

The pandas function to_csv() saves a pandas DataFrame as a csv file.

# Pass a file name to the function
d.to_csv("data.csv")

The Python package pickle allows you to write (save) any object from the Python environment and read (load) any object you have written into the Python environment.

The code below writes to a pickle file. The first line opens the file object being written to. In the open function, ‘file_name’ specifies the file path of the file object. Then, ‘wb’ stands for ‘write binary’, which means the file is being written in binary form (1s and 0s). After the as keyword, ‘file_’, is the user selected name of the file object.

The second line uses the pickle.dump() function. This function requires two arguments: the object being written and the name of the file object.

import pickle

# Define the file name
file_name = 'data.pickle'

# Write the variable to the file system
with open(file_name, 'wb') as file_:
    pickle.dump(d, file_)

Conversely, the code below reads from a pickle file. The first line opens the file object being read from. In the open function, data.pickle specifies the file path of the file object. Then, rb stands for “read binary,” which means the file is being read in binary form (1s and 0s). The as argument (my_file in the example below), is the user-selected name of the file object.

The second line uses the pickle.load() function. This function requires one argument: the name of the file object.

# Read the specified file from the file system and load into variable
with open('data.pickle', 'rb') as my_file:
    d = pickle.load(my_file)

R

To export a matrix or data frame to a CSV file, use the write.csv() function. To export to a file with a different field separator, such as a tab, use write.table(). The minimal arguments for write.csv() are the object and the file name. To export a data frame named dat as a file named dat.csv to your current working directory, you’d use:

write.csv(dat, file = "dat.csv")

By default, a column for row names or numbers is included in the exported csv file. To turn that off, set row.names = FALSE, like so:

write.csv(dat, file = "dat.csv", row.names = FALSE)

To append a matrix or data frame to an existing csv file, set append = TRUE.

See also sink(), cat(), and writeLines() for sending text and output to a file.

To save and load R objects for future use in R, there are two options:

  1. Save and load a single object using saveRDS() and readRDS().
  2. Save multiple objects using save() and load().
Save and load a single object

The minimal arguments for saveRDS() are the object and a file name with an .rds extension. For example, to save a single data frame named dat to your current working directory as dat.rds:

saveRDS(dat, file = "dat.rds")

To load the rds file into R from your current working directory, use the readRDS() function. Notice we must assign the result of readRDS() to an object. The object name need not match the file name.

d <- readRDS("dat.rds")

The advantage of saving and loading native R objects is the preservation of characteristics such as factors, attributes, classes, etc. Any object can be saved, including model objects, functions, vectors, lists, etc.

Save multiple objects

The minimal arguments for save() are the objects to save and a file name with a .rda extension. Objects can also be specified as a character vector to the list argument. For example, to save a data frame named dat, a model object named m, and a plot object called p, to your current working directory as work.rda:

save(dat, m, p, file = "work.rda")

Or with objects specified as a character vector:

save(list = c("dat", "m", "p"), file = "work.rda")

To load the rda file from your current working directory, use the load() function. Notice we do not assign the result to an object name. The result of the load() function is to load the objects into your global environment.

load("work.rda")

Upon successful execution of the load() function, the dat, m, and p objects will be loaded into your global environment. Any objects already in your global environment with the same name will be overwritten without warning.

You can also save everything in your global environment into a rda file using the save.image() function. It works just like the save() function except you do not specify which objects to save. You simply provide a file name. If you do not specify a file name, a default name of .Rdata is used. To load the file use the load() function. Again, all objects will be loaded into the gloabal environment, overwriting any existing objects with the same name.