Chapter 2 Data Structures
This chapter compares and contrasts data structures in Python and R.
2.1 One-dimensional data
A one-dimensional data structure can be visualized as a column in a spreadsheet or as a list of values.
Python
There are many ways to organize one-dimensional data in Python. Three of the most common one-dimensional data structures are lists, NumPy arrays, and pandas Series. All three are ordered and mutable, and they can contain data of different types.
Lists in Python do not need to be explicitly declared; they are indicated by the use of square brackets.
Values in lists can be accessed by using square brackets. Python indexing begins at 0, so to extract the first element, we would use the index 0. Python also allows for negative indexing; using an index of -1 will return the last value in the list. Indexing a range in Python is not inclusive of the last index.
1
'hello'
[2, 3]
NumPy arrays, on the other hand, need to be declared using the numpy.array()
function, and the NumPy package needs to be imported.
['1' '2' '3' 'hello']
Accessing data in a NumPy array is the same as indexing a list.
'1'
'hello'
array(['2', '3'], dtype='<U11')
pandas Series also need to be declared using the pandas.Series()
function. Like NumPy, the pandas package must be imported as well. The pandas package is built on NumPy, so we can input data into a pandas Series using a NumPy array. We can extract data from a Series by indexing, similar to indexing a list or NumPy array.
import pandas as pd
import numpy as np
data = np.array([1, 2, 3, "hello"])
ser1 = pd.Series(data)
print(ser1)
0 1
1 2
2 3
3 hello
dtype: object
'1'
1 2
2 3
dtype: object
Similarly, we can use iloc[]
(note the square brackets) to subset elements based on integer position.
1 2
2 3
dtype: object
We can relabel the indices of the Series to whatever we like using the index
attribute within the Series()
function.
import pandas as pd
import numpy as np
ser2 = pd.Series(data, index = ['a', 'b', 'c', 'd'])
print(ser2)
a 1
b 2
c 3
d hello
dtype: object
We can then use our own specified indices to select and index our data. Indexing with our labels can be done in two ways; one approach uses the .loc[]
function and is similar to indexing arrays and lists with square brackets; the other approach follows this form: Series.label_name
. Note that ranges referenced in loc[]
are inclusive on both ends; normally, Python interprets ranges as having open upper ends (e.g., print(["a", "b", "c"][0:2])
returns ["a", "b"]
.)
'2'
b 2
c 3
d hello
dtype: object
'hello'
'2'
Mathematical operations cannot be carried out on lists, but they can be carried out on NumPy arrays and pandas Series. In general, lists are better for short data sets that won’t be the targets of mathematical operations. NumPy arrays and pandas Series are better for long data sets and for data sets that will be operated on mathematically.
R
In R a one-dimensional data structure is called a vector. We can create a vector using the c()
function. A vector in R can only contain one type of data (all numbers, all strings, etc). The columns of data frames—a data structure discussed in section 2.2 below—are vectors. If multiple types of data are put into a vector, the data will be coerced according to the hierarchy logical
< integer
< double
< complex
< character
. This means if you mix, say, integers and character data, all the data will be coerced to character.
[1] 23 43 55
[1] "23" "43" "hi"
Values in a vector can be accessed by position using indexing brackets. R indexes elements of a vector starting at 1. Index values are inclusive. For example, a_vec[2:3
] selects the second and third elements of a_vec
.
[1] 43
[1] 43 55
2.2 Two-dimensional data
Two-dimensional data are rectangular in nature, consisting of rows and columns. These can be the type of data you might find in a spreadsheet with a mix of data types in the columns; they can also be matrices as you might encounter in matrix algebra.
Python
In Python, two common two-dimensional data structures are the NumPy array (introduced above in its one-dimensional form) and the pandas DataFrame.
A two-dimensional NumPy array is made in a similar way to the one-dimensional array using the numpy.array()
function.
[['1' '2' '3' 'hello']
['4' '5' '6' 'world']]
Data can be selected from a two-dimensional NumPy array via [row, column]
indexing:
'1'
'world'
array([['2', '3'],
['5', '6']], dtype='<U11')
A pandas DataFrame is made using the pandas.DataFrame()
function, and it shares many functional similarities with the pandas Series.
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
df = pd.DataFrame(data)
print(df)
0 1 2 3
0 1 2 3 hello
1 4 5 6 world
To select certain rows and columns from a pandas DataFrame based on their integer position, we can use iloc[]
(as we did with the pandas Series above). The values inside the brackets following iloc
reflect the rows and the columns, respectively.
'1'
0 1
1 2
2 3
3 hello
Name: 0, dtype: object
0 2
1 5
Name: 1, dtype: object
As we did with the pandas Series, we can change the indices and the column names of the DataFrame and use those to select certain data. We change the indices using the index
attribute in pandas.DataFrame()
, and we change the column names using the columns
attribute. These new labels can be used with loc[]
for subsetting. (And, as mentioned above, if a range is passed to loc[]
, it’s treated as inclusive on both ends.)
import pandas as pd
import numpy as np
data = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
df = pd.DataFrame(data, index = ["a", "b"], columns = ["column 1", "column 2", "column 3", "column 4"])
print(df.loc[["a", "b"], "column 1"])
a 1
b 4
Name: column 1, dtype: object
One thing to note is that NumPy arrays can actually have an arbitrary number of dimensions, whereas pandas DataFrames can only have two.
R
Two-dimensional data structures in R include the matrix and data frame. A matrix can contain only one data type. A data frame can contain multiple vectors, each of which can consist of different data types.
Create a matrix with the matrix()
function. Create a data frame with the data.frame()
function. Most imported data comes into R as a data frame.
# Matrix; populated down by column by default
m <- matrix(data = c(1, 3, 5, 7), nrow = 2, ncol = 2)
m
[,1] [,2]
[1,] 1 5
[2,] 3 7
name age
1 Rob 35
2 Cindy 37
Values in a matrix and data frame can be accessed by position using indexing brackets. The first number(s) refers to rows; the second number(s) refers to columns. Leaving row or column numbers empty selects all rows or columns.
[1] 5
name age
2 Cindy 37
2.3 Three-dimensional and higher data
Three-dimensional and higher data can be visualized as multiple rectangular structures stratified by extra variables. These are sometimes referred to as arrays. Analysts usually prefer two-dimensional data frames to arrays. Data frames can accommodate multidimensional data by including the additional dimensions as variables.
Python
To create a three-dimensional and higher data structure in Python, we again use a NumPy array. We can think of the three-dimensional array as a stack of two-dimensional arrays. We construct this in the same way as the one- and two-dimensional arrays.
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
We can also construct a three-dimensional NumPy array using the reshape()
function on an existing array. The argument of reshape()
is where you input your desired dimensions: strata, rows, and then columns. Here, the arange()
function is used to create a NumPy array containing the numbers 1 through 12 (to recreate the same array shown above).
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[ 7, 8, 9],
[10, 11, 12]]])
Indexing the three-dimensional array follows the same format as the two-dimensional arrays. Since we can think of the three-dimensional array as a stack of two-dimensional arrays, we can extract each “stacked” two-dimensional array. Here we extract the first of the “stacked” two-dimensional arrays:
array([[1, 2, 3],
[4, 5, 6]])
We can also extract entire rows and columns, or individual array elements:
array([7, 8, 9])
array([ 7, 10])
6
The three-dimensional arrays can be converted to two-dimensional arrays again using the reshape
function:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9],
[10, 11, 12]])
R
The array()
function in R can create three-dimensional and higher data structures. Arrays are like vectors and matrices in that they can only contain one data type. In fact matrices and arrays are sometimes described as vectors with instructions on how to layout the data.
We can specify the dimension number and size using the dim
argument. Below we specify 2 rows, 3 columns, and 2 strata using a vector: c(2,3,2)
. This creates a three-dimensional data structure. The data in the example are simply the numbers 1 through 12.
, , 1
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
, , 2
[,1] [,2] [,3]
[1,] 7 9 11
[2,] 8 10 12
Values in arrays can be accessed by position using indexing brackets.
[1] 3
[,1] [,2]
[1,] 3 9
[2,] 4 10
The dimensions can be named using the dimnames()
function. Notice the names must be a list.
, , Z = z1
Y
X y1 y2 y3
x1 1 3 5
x2 2 4 6
, , Z = z2
Y
X y1 y2 y3
x1 7 9 11
x2 8 10 12
The as.data.frame.table()
function can collapse an array into a two-dimensional structure that may be easier to use with standard statistical and graphical routines. The responseName
argument allows you to provide a suitable column name for the values in the array.
X Y Z value
1 x1 y1 z1 1
2 x2 y1 z1 2
3 x1 y2 z1 3
4 x2 y2 z1 4
5 x1 y3 z1 5
6 x2 y3 z1 6
7 x1 y1 z2 7
8 x2 y1 z2 8
9 x1 y2 z2 9
10 x2 y2 z2 10
11 x1 y3 z2 11
12 x2 y3 z2 12
2.4 General data structures
Both R and Python provide general “catch-all” data structures that can contain data of any shape, type, and amount.
Python
The most general data structures in Python are the list and the tuple. Both lists and tuples are ordered collections of objects called elements. The elements can be other lists/tuples, arrays, integers, objects, etc.
Lists are mutable objects; elements can be reordered or deleted, and new elements can be added after the list has been created. Tuples, on the other hand, are immutable; once a tuple is created it cannot be changed.
Lists are created using square brackets. Here, we create a list and add an element to the list after it is created using the append
function.
[1, 2, 'a', 'b', [3, 4, 5]]
[1, 2, 'a', 'b', [3, 4, 5], 'c']
Tuples are created using parenthesis. Here we create a tuple.
(1, 2, 'a', 'b', [3, 4, 5])
Let’s try to use the append function to explore the immutability of the tuple. We expect to get an error.
'tuple' object has no attribute 'append'
We can refer to specific list/tuple elements by using square brackets. In the square brackets we put the index number of the element. The element in the first position is at index 0.
1
1
'c'
[3, 4, 5]
R
The most general data structure in R is the list. A list is an ordered collection of objects, which are referred to as the components. The components can be vectors, matrices, arrays, data frames, and other lists. The components are always numbered but can also have names. The results of statistical functions are often returned as lists.
We can create lists with the list()
function. The list below contains three components: a vector named “x”, a matrix named “y”, and a data frame named “z”. Notice the m
and d
objects were created in the two-dimensional data section earlier in this chapter.
$x
[1] 1 2 3
$y
[,1] [,2]
[1,] 1 5
[2,] 3 7
$z
name age
1 Rob 35
2 Cindy 37
We can refer to list components by their order number or name (if present). To use order number, use indexing brackets. Single brackets returns a list. Double brackets return the component itself.
$y
[,1] [,2]
[1,] 1 5
[2,] 3 7
[,1] [,2]
[1,] 1 5
[2,] 3 7
Use the $
operator to refer to components by name. This returns the component itself.
[,1] [,2]
[1,] 1 5
[2,] 3 7
Finally, it is worth noting that a data frame is a special case of a list consisting of components with the same length. The is.list()
function returns TRUE
if an object is a list and FALSE
otherwise.
name age
1 Rob 35
2 Cindy 37
'data.frame': 2 obs. of 2 variables:
$ name: chr "Rob" "Cindy"
$ age : num 35 37
[1] TRUE