Chapter 2 Data Structures

This chapter compares and contrasts data structures in Python and R.

2.1 One-dimensional data

A one-dimensional data structure can be visualized as a column in a spreadsheet or as a list of values.

Python

There are many ways to organize one-dimensional data in Python. Three of the most common one-dimensional data structures are lists, NumPy arrays, and pandas Series. All three are ordered and mutable, and they can contain data of different types.

Lists in Python do not need to be explicitly declared; they are indicated by the use of square brackets.

l = [1, 2, 3, 'hello']

Values in lists can be accessed by using square brackets. Python indexing begins at 0, so to extract the first element, we would use the index 0. Python also allows for negative indexing; using an index of -1 will return the last value in the list. Indexing a range in Python is not inclusive of the last index.

# Extract first element
l[0]
1
# Extract last element
l[-1]
'hello'
# Extract second and third elements
l[1:3]
[2, 3]

NumPy arrays, on the other hand, need to be declared using the numpy.array() function, and the NumPy package needs to be imported.

import numpy as np

arr = np.array([1, 2, 3, 'hello'])
print(arr)
['1' '2' '3' 'hello']

Accessing data in a NumPy array is the same as indexing a list.

# Extract first element 
arr[0]
'1'
# Extract last element
arr[-1]
'hello'
# Extract second and third elements
arr[1:3]
array(['2', '3'], dtype='<U11')

pandas Series also need to be declared using the pandas.Series() function. Like NumPy, the pandas package must be imported as well. The pandas package is built on NumPy, so we can input data into a pandas Series using a NumPy array. We can extract data from a Series by indexing, similar to indexing a list or NumPy array.

import pandas as pd 
import numpy as np

data = np.array([1, 2, 3, "hello"])
ser1 = pd.Series(data)
print(ser1)
0        1
1        2
2        3
3    hello
dtype: object
# Extract first element 
ser1[0]
'1'
# Extract second and third elements 
ser1[1:3]
1    2
2    3
dtype: object

Similarly, we can use iloc[] (note the square brackets) to subset elements based on integer position.

# Extract second and third elements 
ser1.iloc[1:3]
1    2
2    3
dtype: object

We can relabel the indices of the Series to whatever we like using the index attribute within the Series() function.

import pandas as pd 
import numpy as np

ser2 = pd.Series(data, index = ['a', 'b', 'c', 'd'])
print(ser2)
a        1
b        2
c        3
d    hello
dtype: object

We can then use our own specified indices to select and index our data. Indexing with our labels can be done in two ways; one approach uses the .loc[] function and is similar to indexing arrays and lists with square brackets; the other approach follows this form: Series.label_name. Note that ranges referenced in loc[] are inclusive on both ends; normally, Python interprets ranges as having open upper ends (e.g., print(["a", "b", "c"][0:2]) returns ["a", "b"].)

# Extract element in row b
ser2.loc["b"]
'2'
# Extract elements from row b to the end
ser2.loc["b":]
b        2
c        3
d    hello
dtype: object
# Extract element in row "d"
ser2.d
'hello'
# Extract element in row "b"
ser2.b
'2'

Mathematical operations cannot be carried out on lists, but they can be carried out on NumPy arrays and pandas Series. In general, lists are better for short data sets that won’t be the targets of mathematical operations. NumPy arrays and pandas Series are better for long data sets and for data sets that will be operated on mathematically.

R

In R a one-dimensional data structure is called a vector. We can create a vector using the c() function. A vector in R can only contain one type of data (all numbers, all strings, etc). The columns of data frames—a data structure discussed in section 2.2 below—are vectors. If multiple types of data are put into a vector, the data will be coerced according to the hierarchy logical < integer < double < complex < character. This means if you mix, say, integers and character data, all the data will be coerced to character.

x1 <- c(23, 43, 55)
x1
[1] 23 43 55
# All values coerced to character
x2 <- c(23, 43, 'hi')
x2
[1] "23" "43" "hi"

Values in a vector can be accessed by position using indexing brackets. R indexes elements of a vector starting at 1. Index values are inclusive. For example, a_vec[2:3] selects the second and third elements of a_vec.

# Extract the second value
x1[2]
[1] 43
# Extract the second and third values
x1[2:3]
[1] 43 55

2.2 Two-dimensional data

Two-dimensional data are rectangular in nature, consisting of rows and columns. These can be the type of data you might find in a spreadsheet with a mix of data types in the columns; they can also be matrices as you might encounter in matrix algebra.

Python

In Python, two common two-dimensional data structures are the NumPy array (introduced above in its one-dimensional form) and the pandas DataFrame.

A two-dimensional NumPy array is made in a similar way to the one-dimensional array using the numpy.array() function.

import numpy as np

arr2d = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
print(arr2d)
[['1' '2' '3' 'hello']
 ['4' '5' '6' 'world']]

Data can be selected from a two-dimensional NumPy array via [row, column] indexing:

import numpy as np

# Extract first element 
arr2d[0, 0]
'1'
# Extract last element 
arr2d[-1, -1]
'world'
# Extract second and third columns (all rows)
arr2d[:, 1:3]
array([['2', '3'],
       ['5', '6']], dtype='<U11')

A pandas DataFrame is made using the pandas.DataFrame() function, and it shares many functional similarities with the pandas Series.

import pandas as pd
import numpy as np

data = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
df = pd.DataFrame(data)
print(df)
   0  1  2      3
0  1  2  3  hello
1  4  5  6  world

To select certain rows and columns from a pandas DataFrame based on their integer position, we can use iloc[] (as we did with the pandas Series above). The values inside the brackets following iloc reflect the rows and the columns, respectively.

# Extract first element (first row, first column)
df.loc[0, 0]
'1'
# Extract first row, all columns
df.loc[0, :]
0        1
1        2
2        3
3    hello
Name: 0, dtype: object
# Extract second column, all rows
df.loc[:, 1]
0    2
1    5
Name: 1, dtype: object

As we did with the pandas Series, we can change the indices and the column names of the DataFrame and use those to select certain data. We change the indices using the index attribute in pandas.DataFrame(), and we change the column names using the columns attribute. These new labels can be used with loc[] for subsetting. (And, as mentioned above, if a range is passed to loc[], it’s treated as inclusive on both ends.)

import pandas as pd
import numpy as np

data = np.array([[1, 2, 3, "hello"], [4, 5, 6, "world"]])
df = pd.DataFrame(data, index = ["a", "b"], columns = ["column 1", "column 2", "column 3", "column 4"])
print(df.loc[["a", "b"], "column 1"])
a    1
b    4
Name: column 1, dtype: object

One thing to note is that NumPy arrays can actually have an arbitrary number of dimensions, whereas pandas DataFrames can only have two.

R

Two-dimensional data structures in R include the matrix and data frame. A matrix can contain only one data type. A data frame can contain multiple vectors, each of which can consist of different data types.

Create a matrix with the matrix() function. Create a data frame with the data.frame() function. Most imported data comes into R as a data frame.

# Matrix; populated down by column by default
m <- matrix(data = c(1, 3, 5, 7), nrow = 2, ncol = 2)
m
     [,1] [,2]
[1,]    1    5
[2,]    3    7
# Data frame
d <- data.frame(name = c("Rob", "Cindy"),
                age = c(35, 37))
d
   name age
1   Rob  35
2 Cindy  37

Values in a matrix and data frame can be accessed by position using indexing brackets. The first number(s) refers to rows; the second number(s) refers to columns. Leaving row or column numbers empty selects all rows or columns.

# Extract value in row 1, column 2
m[1, 2]
[1] 5
# Extract values in row 2
d[2, ]
   name age
2 Cindy  37

2.3 Three-dimensional and higher data

Three-dimensional and higher data can be visualized as multiple rectangular structures stratified by extra variables. These are sometimes referred to as arrays. Analysts usually prefer two-dimensional data frames to arrays. Data frames can accommodate multidimensional data by including the additional dimensions as variables.

Python

To create a three-dimensional and higher data structure in Python, we again use a NumPy array. We can think of the three-dimensional array as a stack of two-dimensional arrays. We construct this in the same way as the one- and two-dimensional arrays.

import numpy as np 

arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

We can also construct a three-dimensional NumPy array using the reshape() function on an existing array. The argument of reshape() is where you input your desired dimensions: strata, rows, and then columns. Here, the arange() function is used to create a NumPy array containing the numbers 1 through 12 (to recreate the same array shown above).

arr3d_2 = np.arange(1, 13).reshape(2, 2, 3)
arr3d_2
array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Indexing the three-dimensional array follows the same format as the two-dimensional arrays. Since we can think of the three-dimensional array as a stack of two-dimensional arrays, we can extract each “stacked” two-dimensional array. Here we extract the first of the “stacked” two-dimensional arrays:

# Extract first strata (first "stacked" 2-D array)
arr3d[0]
array([[1, 2, 3],
       [4, 5, 6]])

We can also extract entire rows and columns, or individual array elements:

# Extract first row of second strata (second "stacked" 2-D array)
arr3d[1, 0]
array([7, 8, 9])
# Extract first column of second strata 
arr3d[1, :, 0]
array([ 7, 10])
# Extract the number 6 (first strata, second row, third column)
arr3d[0, 1, 2]
6

The three-dimensional arrays can be converted to two-dimensional arrays again using the reshape function:

arr3d_2d = arr3d.reshape(4, 3)
arr3d_2d
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])

R

The array() function in R can create three-dimensional and higher data structures. Arrays are like vectors and matrices in that they can only contain one data type. In fact matrices and arrays are sometimes described as vectors with instructions on how to layout the data.

We can specify the dimension number and size using the dim argument. Below we specify 2 rows, 3 columns, and 2 strata using a vector: c(2,3,2). This creates a three-dimensional data structure. The data in the example are simply the numbers 1 through 12.

a1 <- array(data = 1:12, dim = c(2, 3, 2))
a1
, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

Values in arrays can be accessed by position using indexing brackets.

# Extract value in row 1, column 2, strata 1
a1[1, 2, 1]
[1] 3
# Extract column 2 in both strata
# Result is returned as matrix
a1[, 2, ]
     [,1] [,2]
[1,]    3    9
[2,]    4   10

The dimensions can be named using the dimnames() function. Notice the names must be a list.

dimnames(a1) <- list("X" = c("x1", "x2"), 
                     "Y" = c("y1", "y2", "y3"), 
                     "Z" = c("z1", "z2"))
a1
, , Z = z1

    Y
X    y1 y2 y3
  x1  1  3  5
  x2  2  4  6

, , Z = z2

    Y
X    y1 y2 y3
  x1  7  9 11
  x2  8 10 12

The as.data.frame.table() function can collapse an array into a two-dimensional structure that may be easier to use with standard statistical and graphical routines. The responseName argument allows you to provide a suitable column name for the values in the array.

as.data.frame.table(a1, responseName = "value")
    X  Y  Z value
1  x1 y1 z1     1
2  x2 y1 z1     2
3  x1 y2 z1     3
4  x2 y2 z1     4
5  x1 y3 z1     5
6  x2 y3 z1     6
7  x1 y1 z2     7
8  x2 y1 z2     8
9  x1 y2 z2     9
10 x2 y2 z2    10
11 x1 y3 z2    11
12 x2 y3 z2    12

2.4 General data structures

Both R and Python provide general “catch-all” data structures that can contain data of any shape, type, and amount.

Python

The most general data structures in Python are the list and the tuple. Both lists and tuples are ordered collections of objects called elements. The elements can be other lists/tuples, arrays, integers, objects, etc.

Lists are mutable objects; elements can be reordered or deleted, and new elements can be added after the list has been created. Tuples, on the other hand, are immutable; once a tuple is created it cannot be changed.

Lists are created using square brackets. Here, we create a list and add an element to the list after it is created using the append function.

lst = [1, 2, 'a', 'b', [3, 4, 5]]
lst 
[1, 2, 'a', 'b', [3, 4, 5]]
lst.append('c')
lst
[1, 2, 'a', 'b', [3, 4, 5], 'c']

Tuples are created using parenthesis. Here we create a tuple.

tuple = (1, 2, 'a','b', [3, 4, 5])
tuple 
(1, 2, 'a', 'b', [3, 4, 5])

Let’s try to use the append function to explore the immutability of the tuple. We expect to get an error.

tuple.append('c')
'tuple' object has no attribute 'append'

We can refer to specific list/tuple elements by using square brackets. In the square brackets we put the index number of the element. The element in the first position is at index 0.

# Extract the first element of the list and the tuple
lst[0]
1
tuple[0]
1
# Extract the last element of each 
lst[-1]
'c'
tuple[-1]
[3, 4, 5]

R

The most general data structure in R is the list. A list is an ordered collection of objects, which are referred to as the components. The components can be vectors, matrices, arrays, data frames, and other lists. The components are always numbered but can also have names. The results of statistical functions are often returned as lists.

We can create lists with the list() function. The list below contains three components: a vector named “x”, a matrix named “y”, and a data frame named “z”. Notice the m and d objects were created in the two-dimensional data section earlier in this chapter.

l <- list(x = c(1, 2, 3),
          y = m,
          z = d)
l
$x
[1] 1 2 3

$y
     [,1] [,2]
[1,]    1    5
[2,]    3    7

$z
   name age
1   Rob  35
2 Cindy  37

We can refer to list components by their order number or name (if present). To use order number, use indexing brackets. Single brackets returns a list. Double brackets return the component itself.

# Second element returned as list
l[2]
$y
     [,1] [,2]
[1,]    1    5
[2,]    3    7
# Second element returned as itself (matrix)
l[[2]]
     [,1] [,2]
[1,]    1    5
[2,]    3    7

Use the $ operator to refer to components by name. This returns the component itself.

l$y
     [,1] [,2]
[1,]    1    5
[2,]    3    7

Finally, it is worth noting that a data frame is a special case of a list consisting of components with the same length. The is.list() function returns TRUE if an object is a list and FALSE otherwise.

# Object d is data frame
d
   name age
1   Rob  35
2 Cindy  37
str(d)
'data.frame':   2 obs. of  2 variables:
 $ name: chr  "Rob" "Cindy"
 $ age : num  35 37
# ...And a data frame is a list
is.list(d)
[1] TRUE