Chapter 4 Data Manipulation

This chapter discusses strategies for filtering, selecting, modifying, and deriving variables in data. Unless otherwise stated, examples are for DataFrames (Python) and data frames (R) and use the mtcars data frame that is included with R.

# Python
import pandas
mtcars = pandas.read_csv('data/mtcars.csv')
# R
data(mtcars)
# Drop row names to match Python version of data
rownames(mtcars) <- NULL

4.1 View variable names and types

View and inspect the names of variables and their types (numeric, string, logical, etc.).

Python

The .info() function in pandas returns information about a DataFrame.

Setting the argument verbose to True prints the names of the columns, their lengths excluding NULL values, and their data types (dtype) in a table. The function lists the unique data types in the DataFrame, and it prints the amount of memory the DataFrame occupies.

mtcars.info(verbose = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 2.9 KB

Setting verbose to False excludes the table describing each column.

mtcars.info(verbose = False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Columns: 11 entries, mpg to carb
dtypes: float64(5), int64(6)
memory usage: 2.9 KB

If a DataFrame has 100 or fewer columns, the verbose argument defaults to True.

R

The str() function in R lists the names of the variables, their types, the first few values of each, and the object dimensions.

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

To see just the names of the data frame, use the names() function.

names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

To see just the dimensions of the data frame, use the dim() function. It returns the number of rows and columns, respectively.

dim(mtcars)
[1] 32 11

4.2 Select variables

How to select specific columns of data frames.

Python

The period operator . provides access to a column in a DataFrame as a vector. This returns a pandas Series. A pandas Series can do everything a NumPy array can do.

mtcars.mpg
0     21.0
1     21.0
2     22.8
3     21.4
4     18.7
5     18.1
6     14.3
7     24.4
8     22.8
9     19.2
10    17.8
11    16.4
12    17.3
13    15.2
14    10.4
15    10.4
16    14.7
17    32.4
18    30.4
19    33.9
20    21.5
21    15.5
22    15.2
23    13.3
24    19.2
25    27.3
26    26.0
27    30.4
28    15.8
29    19.7
30    15.0
31    21.4
Name: mpg, dtype: float64

Indexing also provides access to columns as a pandas Series. Single and double quotations both work.

mtcars['mpg']
0     21.0
1     21.0
2     22.8
3     21.4
4     18.7
5     18.1
6     14.3
7     24.4
8     22.8
9     19.2
10    17.8
11    16.4
12    17.3
13    15.2
14    10.4
15    10.4
16    14.7
17    32.4
18    30.4
19    33.9
20    21.5
21    15.5
22    15.2
23    13.3
24    19.2
25    27.3
26    26.0
27    30.4
28    15.8
29    19.7
30    15.0
31    21.4
Name: mpg, dtype: float64

Operations on NumPy arrays are faster than operations on pandas Series. But in many cases, using pandas Series is fine performance-wise. This consideration is primarily important for large data sets on which many operations are performed. The .values function returns a NumPy array from a pandas DataFrame.

mtcars['mpg'].values
array([21. , 21. , 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,
       16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5,
       15.2, 13.3, 19.2, 27.3, 26. , 30.4, 15.8, 19.7, 15. , 21.4])

Double indexing returns a pandas DataFrame instead of a NumPy array or pandas Series.

mtcars[['mpg']]
     mpg
0   21.0
1   21.0
2   22.8
3   21.4
4   18.7
5   18.1
6   14.3
7   24.4
8   22.8
9   19.2
10  17.8
11  16.4
12  17.3
13  15.2
14  10.4
15  10.4
16  14.7
17  32.4
18  30.4
19  33.9
20  21.5
21  15.5
22  15.2
23  13.3
24  19.2
25  27.3
26  26.0
27  30.4
28  15.8
29  19.7
30  15.0
31  21.4

The head() and tail() functions return the first five or last five values, respectively. Use the n argument to change the number of values. This function works on NumPy arrays, pandas Series and pandas DataFrames.

# First six values
mtcars.mpg.head()
0    21.0
1    21.0
2    22.8
3    21.4
4    18.7
Name: mpg, dtype: float64
# Last row of DataFrame
mtcars.tail(n = 1)
     mpg  cyl   disp   hp  drat    wt  qsec  vs  am  gear  carb
31  21.4    4  121.0  109  4.11  2.78  18.6   1   1     4     2

R

The dollar sign operator, $, provides access to a column in a data frame as a vector.

mtcars$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

Double-indexing brackets also provide access to columns as a vector.

mtcars[["mpg"]]
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

Single-indexing brackets work as well, but they return a data frame instead of a vector (if used with a data frame).

mtcars["mpg"]
    mpg
1  21.0
2  21.0
3  22.8
4  21.4
5  18.7
6  18.1
7  14.3
8  24.4
9  22.8
10 19.2
11 17.8
12 16.4
13 17.3
14 15.2
15 10.4
16 10.4
17 14.7
18 32.4
19 30.4
20 33.9
21 21.5
22 15.5
23 15.2
24 13.3
25 19.2
26 27.3
27 26.0
28 30.4
29 15.8
30 19.7
31 15.0
32 21.4

Single-indexing brackets also allow selection of rows when used with a comma. The syntax is rows, columns

# First three rows
mtcars[1:3, "mpg"]
[1] 21.0 21.0 22.8

Finally single-indexing brackets allow us to select multiple columns. Request columns either by name or position using a vector.

mtcars[c("mpg", "cyl")] 
    mpg cyl
1  21.0   6
2  21.0   6
3  22.8   4
4  21.4   6
5  18.7   8
6  18.1   6
7  14.3   8
8  24.4   4
9  22.8   4
10 19.2   6
11 17.8   6
12 16.4   8
13 17.3   8
14 15.2   8
15 10.4   8
16 10.4   8
17 14.7   8
18 32.4   4
19 30.4   4
20 33.9   4
21 21.5   4
22 15.5   8
23 15.2   8
24 13.3   8
25 19.2   8
26 27.3   4
27 26.0   4
28 30.4   4
29 15.8   8
30 19.7   6
31 15.0   8
32 21.4   4
# Equivalent to mtcars[1:2] 

The head() and tail() functions return the first six or last six values, respectively. Use the n argument to change the number of values. These functions work with vectors, data frames, matrices, and more.

# First six values
head(mtcars$mpg)
[1] 21.0 21.0 22.8 21.4 18.7 18.1
# Last row of data frame
tail(mtcars, n = 1)
    mpg cyl disp  hp drat   wt qsec vs am gear carb
32 21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

4.3 Filter/Subset variables

Selecting rows of a rectangular data set that meet certain conditions.

Python

We can filter rows of a DataFrame based on subsetting conditions; the data type returned depends on the filtration method.

The following example returns a DataFrame, not a Series, as there is more than one column selected from the DataFrame. Use a list, square brackets [], to subset more than one column.

mtcars[mtcars["mpg"] > 30][["mpg", "cyl"]]
     mpg  cyl
17  32.4    4
18  30.4    4
19  33.9    4
27  30.4    4

Both pandas Series and NumPy arrays can be used for fast performance and vector operations. Many functions require a vector as input.

The following code returns one column, mpg, as a pandas Series. A pandas Series is one column from a pandas DataFrame.

mtcars[mtcars["mpg"] > 30]["mpg"]
17    32.4
18    30.4
19    33.9
27    30.4
Name: mpg, dtype: float64

The following code also returns a pandas Series, but it uses the . operator to select a column, rather than square brackets [].

mtcars[mtcars["mpg"] > 30].mpg
17    32.4
18    30.4
19    33.9
27    30.4
Name: mpg, dtype: float64

Both of the following lines of code return NumPy arrays using the .values function. df1 is one dimension, reflecting just the mpg column, and df2 is two dimensions, reflecting the mpg and cyl columns.

df1 = mtcars[mtcars["mpg"] > 30]["mpg"].values
df2 = mtcars[mtcars["mpg"] > 30][["mpg", "cyl"]].values

You can also filter using multiple row conditions:

mtcars[mtcars["mpg"] > 30][mtcars["hp"] < 66]
     mpg  cyl  disp  hp  drat     wt   qsec  vs  am  gear  carb
18  30.4    4  75.7  52  4.93  1.615  18.52   1   1     4     2
19  33.9    4  71.1  65  4.22  1.835  19.90   1   1     4     1

<string>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

R

In base R, we can use subsetting brackets or the subset() function to select rows based on some condition. Below, we demonstrate both approaches, returning only those rows with mpg greater than 30. We begin with subsetting brackets, which take three arguments:

  1. i: the condition to subset on.
  2. j: the columns to show. If none specified, all columns are returned
  3. drop: an optional logical argument (TRUE/FALSE) to determine whether or not to coerce the output to the lowest possible dimension. The default is TRUE.

We rarely type the first two argument names, i and j, when using subsetting brackets.

This example returns only the rows with mpg > 30 and all columns. Notice we need to preface mpg with mtcars$ to tell R where to find the “mpg” column and that we need to provide a comma after the condition.

mtcars[mtcars$mpg > 30, ]
    mpg cyl disp  hp drat    wt  qsec vs am gear carb
18 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
19 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
20 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
28 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

We can select the columns to return in the second argument as a vector. Notice that we only need to specify the column names as a character vector. We can also use numbers corresponding to the column positions, as well as conditional statements.

mtcars[mtcars$mpg > 30, c("mpg", "wt", "gear")]
    mpg    wt gear
18 32.4 2.200    4
19 30.4 1.615    4
20 33.9 1.835    4
28 30.4 1.513    5

Show the first three columns:

mtcars[mtcars$mpg > 30, 1:3]
    mpg cyl disp
18 32.4   4 78.7
19 30.4   4 75.7
20 33.9   4 71.1
28 30.4   4 95.1

Show columns with names consisting of only two characters. The nchar() function counts the number of characters in a string. The expression nchar(names(mtcars)) == 2 returns a vector of TRUE/FALSE values where TRUE indicates the column name is only two characters in length.

mtcars[mtcars$mpg > 30, nchar(names(mtcars)) == 2]
    hp    wt vs am
18  66 2.200  1  1
19  52 1.615  1  1
20  65 1.835  1  1
28 113 1.513  1  1

Notice that when we specify only one column, the brackets return a vector.

mtcars[mtcars$mpg > 30, "mpg"]
[1] 32.4 30.4 33.9 30.4

To get a data frame, set the drop argument to FALSE.

mtcars[mtcars$mpg > 30, "mpg", drop = FALSE]
    mpg
18 32.4
19 30.4
20 33.9
28 30.4

The subset() function allows us to refer to column names without using the $ extractor function or quoting column names. It also has a drop argument, but its default is FALSE. It has four arguments:

  1. x: the data frame to subset.
  2. subset: the condition to subset on.
  3. select: the columns to select.
  4. drop: an optional logical argument (TRUE/FALSE) to determine whether or not to coerce the output to the lowest possible dimension. The default is FALSE.

We rarely type the first three argument names, x, subset and select, when using subset().

Below we replicate the previous examples using subset().

# Return rows where mpg > 30 and all columns
subset(mtcars, mpg > 30)
    mpg cyl disp  hp drat    wt  qsec vs am gear carb
18 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
19 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
20 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
28 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
# Return rows where mpg > 30 and the mpg, wt, and gear columns
subset(mtcars, mpg > 30, c(mpg, wt, gear))
    mpg    wt gear
18 32.4 2.200    4
19 30.4 1.615    4
20 33.9 1.835    4
28 30.4 1.513    5
# Return rows where mpg > 30 and the first three columns
subset(mtcars, mpg > 30, 1:3)
    mpg cyl disp
18 32.4   4 78.7
19 30.4   4 75.7
20 33.9   4 71.1
28 30.4   4 95.1
# Return rows where mpg > 30 and columns consisting of two characters
subset(mtcars, mpg > 30, nchar(names(mtcars)) == 2)
    hp    wt vs am
18  66 2.200  1  1
19  52 1.615  1  1
20  65 1.835  1  1
28 113 1.513  1  1
# Return rows where mpg > 30 and mpg column, as a vector
subset(mtcars, mpg > 30, mpg, drop = TRUE)
[1] 32.4 30.4 33.9 30.4
# Return rows where mpg > 30 and mpg column, as a data frame
subset(mtcars, mpg > 30, mpg)
    mpg
18 32.4
19 30.4
20 33.9
28 30.4

Another difference between subsetting brackets and the subset() function is how they handle missing values. Subsetting brackets return missing values while subset() does not. We demonstrate below with a toy data frame. Notice the x column has a missing value.

dframe <- data.frame(x = c(1, NA, 5), 
                     y = c(12, 21, 34))
dframe
   x  y
1  1 12
2 NA 21
3  5 34

When we condition on x < 3, the subsetting bracket approach returns a row with NA values.

dframe[dframe$x < 3, ]
    x  y
1   1 12
NA NA NA

The subset() approach ignores the missing value.

subset(dframe, x < 3)
  x  y
1 1 12

To replicate the subset() result with the subsetting brackets, we need to include an additional condition to only show rows where x is not missing. We can do that with the is.na() function. The is.na() function returns TRUE if a value is missing and FALSE otherwise. If we preface with !, we get TRUE if a value is not missing and FALSE otherwise.

dframe[dframe$x < 3 & !is.na(dframe$x),]
  x  y
1 1 12

See also the filter() function in the dplyr package and the enhanced subsetting brackets in the data.table package.

4.4 Rename variables

Python

Column names can be changed using the function .rename(). Below, we change the column names cyl and wt to cylinder and WT, respectively.

mtcars.rename(columns = {"cyl":"cylinder", "wt":"WT"})
     mpg  cylinder   disp   hp  drat     WT   qsec  vs  am  gear  carb
0   21.0         6  160.0  110  3.90  2.620  16.46   0   1     4     4
1   21.0         6  160.0  110  3.90  2.875  17.02   0   1     4     4
2   22.8         4  108.0   93  3.85  2.320  18.61   1   1     4     1
3   21.4         6  258.0  110  3.08  3.215  19.44   1   0     3     1
4   18.7         8  360.0  175  3.15  3.440  17.02   0   0     3     2
5   18.1         6  225.0  105  2.76  3.460  20.22   1   0     3     1
6   14.3         8  360.0  245  3.21  3.570  15.84   0   0     3     4
7   24.4         4  146.7   62  3.69  3.190  20.00   1   0     4     2
8   22.8         4  140.8   95  3.92  3.150  22.90   1   0     4     2
9   19.2         6  167.6  123  3.92  3.440  18.30   1   0     4     4
10  17.8         6  167.6  123  3.92  3.440  18.90   1   0     4     4
11  16.4         8  275.8  180  3.07  4.070  17.40   0   0     3     3
12  17.3         8  275.8  180  3.07  3.730  17.60   0   0     3     3
13  15.2         8  275.8  180  3.07  3.780  18.00   0   0     3     3
14  10.4         8  472.0  205  2.93  5.250  17.98   0   0     3     4
15  10.4         8  460.0  215  3.00  5.424  17.82   0   0     3     4
16  14.7         8  440.0  230  3.23  5.345  17.42   0   0     3     4
17  32.4         4   78.7   66  4.08  2.200  19.47   1   1     4     1
18  30.4         4   75.7   52  4.93  1.615  18.52   1   1     4     2
19  33.9         4   71.1   65  4.22  1.835  19.90   1   1     4     1
20  21.5         4  120.1   97  3.70  2.465  20.01   1   0     3     1
21  15.5         8  318.0  150  2.76  3.520  16.87   0   0     3     2
22  15.2         8  304.0  150  3.15  3.435  17.30   0   0     3     2
23  13.3         8  350.0  245  3.73  3.840  15.41   0   0     3     4
24  19.2         8  400.0  175  3.08  3.845  17.05   0   0     3     2
25  27.3         4   79.0   66  4.08  1.935  18.90   1   1     4     1
26  26.0         4  120.3   91  4.43  2.140  16.70   0   1     5     2
27  30.4         4   95.1  113  3.77  1.513  16.90   1   1     5     2
28  15.8         8  351.0  264  4.22  3.170  14.50   0   1     5     4
29  19.7         6  145.0  175  3.62  2.770  15.50   0   1     5     6
30  15.0         8  301.0  335  3.54  3.570  14.60   0   1     5     8
31  21.4         4  121.0  109  4.11  2.780  18.60   1   1     4     2

Alternatively, column names can be changed by replacing the vector of column names with a new vector. Below, we create a vector of columns that replaces drat with axle_ratio using conditional match and that replaces disp with DISP using indexing.

column_names = mtcars.columns.values

# Using conditional match
column_names[column_names == "drat"] = "axle_ratio"

# Using indexing
column_names[2] = "DISP"

mtcars.columns = column_names
mtcars.columns
Index(['mpg', 'cyl', 'DISP', 'hp', 'axle_ratio', 'wt', 'qsec', 'vs', 'am',
       'gear', 'carb'],
      dtype='object')

R

Variable names can be changed by referencing their indexes (i.e., their positions among the columns in a data frame). Below the second column is cyl. We change the name to cylinders.

names(mtcars)[2]
[1] "cyl"
names(mtcars)[2] <- "cylinders"
names(mtcars)
 [1] "mpg"       "cylinders" "disp"      "hp"        "drat"      "wt"       
 [7] "qsec"      "vs"        "am"        "gear"      "carb"     

Variable names can also be changed by conditional match. Below we find the variable name that matches drat and change it to axle_ratio.

names(mtcars)[names(mtcars) == "drat"]
[1] "drat"
names(mtcars)[names(mtcars) == "drat"] <- "axle_ratio"
names(mtcars)
 [1] "mpg"        "cylinders"  "disp"       "hp"         "axle_ratio"
 [6] "wt"         "qsec"       "vs"         "am"         "gear"      
[11] "carb"      

More than one variable name can be changed using a vector of positions or matches.

names(mtcars)[c(6,8)] <- c("weight", "engine")
# Alternatively: names(mtcars)[names(mtcars) %in% c("wt", "vs")] <- c("weight", "engine")

names(mtcars)
 [1] "mpg"        "cylinders"  "disp"       "hp"         "axle_ratio"
 [6] "weight"     "qsec"       "engine"     "am"         "gear"      
[11] "carb"      

See also the rename() function in the dplyr package.

4.5 Create, replace and remove variables

We often need to create variables that are functions of other variables, or we need to replace existing variables with updated versions thereof.

Python

Adding a new variable via indexing notation and assigning a result to it adds a new column.

# Add column for kilometer per liter
mtcars['kpl'] = mtcars.mpg / 2.352

Doing the same with an existing column updates the values in the column.

# Update to liters per 100 kilometers
mtcars['kpl'] = 100 / mtcars.kpl 

Alternatively, the . notation can be used to update the values in a column.

# Update to liters per 50 Kilometers
mtcars.kpl = 50 / mtcars.kpl 

To remove a column, use the .drop() function.

# Drop the kpl variable
mtcars.drop(columns = ['kpl'])
     mpg  cyl   DISP   hp  axle_ratio     wt   qsec  vs  am  gear  carb
0   21.0    6  160.0  110        3.90  2.620  16.46   0   1     4     4
1   21.0    6  160.0  110        3.90  2.875  17.02   0   1     4     4
2   22.8    4  108.0   93        3.85  2.320  18.61   1   1     4     1
3   21.4    6  258.0  110        3.08  3.215  19.44   1   0     3     1
4   18.7    8  360.0  175        3.15  3.440  17.02   0   0     3     2
5   18.1    6  225.0  105        2.76  3.460  20.22   1   0     3     1
6   14.3    8  360.0  245        3.21  3.570  15.84   0   0     3     4
7   24.4    4  146.7   62        3.69  3.190  20.00   1   0     4     2
8   22.8    4  140.8   95        3.92  3.150  22.90   1   0     4     2
9   19.2    6  167.6  123        3.92  3.440  18.30   1   0     4     4
10  17.8    6  167.6  123        3.92  3.440  18.90   1   0     4     4
11  16.4    8  275.8  180        3.07  4.070  17.40   0   0     3     3
12  17.3    8  275.8  180        3.07  3.730  17.60   0   0     3     3
13  15.2    8  275.8  180        3.07  3.780  18.00   0   0     3     3
14  10.4    8  472.0  205        2.93  5.250  17.98   0   0     3     4
15  10.4    8  460.0  215        3.00  5.424  17.82   0   0     3     4
16  14.7    8  440.0  230        3.23  5.345  17.42   0   0     3     4
17  32.4    4   78.7   66        4.08  2.200  19.47   1   1     4     1
18  30.4    4   75.7   52        4.93  1.615  18.52   1   1     4     2
19  33.9    4   71.1   65        4.22  1.835  19.90   1   1     4     1
20  21.5    4  120.1   97        3.70  2.465  20.01   1   0     3     1
21  15.5    8  318.0  150        2.76  3.520  16.87   0   0     3     2
22  15.2    8  304.0  150        3.15  3.435  17.30   0   0     3     2
23  13.3    8  350.0  245        3.73  3.840  15.41   0   0     3     4
24  19.2    8  400.0  175        3.08  3.845  17.05   0   0     3     2
25  27.3    4   79.0   66        4.08  1.935  18.90   1   1     4     1
26  26.0    4  120.3   91        4.43  2.140  16.70   0   1     5     2
27  30.4    4   95.1  113        3.77  1.513  16.90   1   1     5     2
28  15.8    8  351.0  264        4.22  3.170  14.50   0   1     5     4
29  19.7    6  145.0  175        3.62  2.770  15.50   0   1     5     6
30  15.0    8  301.0  335        3.54  3.570  14.60   0   1     5     8
31  21.4    4  121.0  109        4.11  2.780  18.60   1   1     4     2

R

Adding a new variable name after the dollar sign notation and assigning a result to it adds a new column.

# Add column for kilometer per liter
mtcars$kpl <- mtcars$mpg / 2.352

Doing the same with an existing variable updates the values in a column.

# Update to liters per 100 Kilometers
mtcars$kpl <- 100 / mtcars$kpl 

To remove a variable, assign it NULL.

# Drop the kpl variable
mtcars$kpl <- NULL

See also the mutate() function in the dplyr package.

4.6 Create strings from numbers

You may have data that is numeric but that needs to be treated as a string.

Python

You can change the data type of a column in a DataFrame using the astype function.

mtcars['am'] = mtcars['am'].astype(str)
type(mtcars.am[0]) # check the type of the first item in am column
<class 'str'>

A potential number-to-string conversion task in Python might be formatting 5-digit American ZIP Codes. Some zip codes begin with 0, but if stored as a numeric value, the 0 is dropped. For example, consider the following pandas DataFrame. Notice that the leading 0 is dropped from two of the ZIP Codes.

zc = pandas.read_csv('data/zc.csv')
print(zc)
  state    zip
0    VT   5001
1    VA  22901
2    NH   3282

One way to fix this is using the string zfill() method. First, we convert the numeric column to string type using the method we just demonstrated. Then we access the zip column using zc.zip and the zfill() method using str.zfill with the width parameter set to 5. This pads the string with 0 on the left as necessary to make each value five characters wide.

zc['zip'] = zc['zip'].astype(str)
zc['zip'] = zc.zip.str.zfill(5)
print(zc)
  state    zip
0    VT  05001
1    VA  22901
2    NH  03282

If we knew we were importing zip codes using read_csv, we could also use the dtype argument to specify which storage type to use for the zip column. Below, we pass a dictionary that maps the str type to the zip column. The result is a properly formatted ZIP Code column.

zc = pandas.read_csv('data/zc.csv', dtype = {'zip':'str'})
print(zc)
  state    zip
0    VT  05001
1    VA  22901
2    NH  03282

R

The as.character() function takes a vector and converts it to string format.

head(mtcars$am)
[1] 1 1 1 0 0 0
head(as.character(mtcars$am))
[1] "1" "1" "1" "0" "0" "0"

To save the results of this process—this conversion—we need to assign the result to the data frame.

# Add new string variable am_ch
mtcars$am_ch <- as.character(mtcars$am)
head(mtcars$am_ch)
[1] "1" "1" "1" "0" "0" "0"

The factor() function can also be used to convert a numeric vector into a categorical variable. However, the result is not exactly a string: A factor is made of integers with character labels. Factors are useful for character data that have a fixed set of levels (e.g., “grade 1,” grade 2,” etc.).

# Convert to factor
head(mtcars$am)
[1] 1 1 1 0 0 0
head(factor(mtcars$am))
[1] 1 1 1 0 0 0
Levels: 0 1
# Convert to factor with labels
head(factor(mtcars$am, labels = c("automatic", "manual")))
[1] manual    manual    manual    automatic automatic automatic
Levels: automatic manual

As before, to save the results of this conversion to the data frame, we need to assign the results to a variable:

# Create factor variable am_fac
mtcars$am_fac <- factor(mtcars$am, labels = c("automatic", "manual"))
head(mtcars$am_fac)
[1] manual    manual    manual    automatic automatic automatic
Levels: automatic manual

A common number-to-string conversion task in R is formatting 5-digit American ZIP Codes as demonstrated above in Python. Some zip codes begin with 0, but if stored as a numeric value, the 0 is dropped.

zip_codes <- c(03766, 03748, 22901, 03264)
zip_codes
[1]  3766  3748 22901  3264

We need to store the zip code as a character value so the 0 is preserved. One way to do this is via the sprintf() function in base R. The first argument is the format string or conversion specification. A conversion specification begins with %. The following 0 and 5 says to format the zip_codes vector as a 5-digit string padded by zeroes on the left. The final i says we’re working with integer values.

sprintf("%05i", zip_codes)
[1] "03766" "03748" "22901" "03264"

See also the str_pad() function in the stringr package.

4.7 Create numbers from strings

String variables that ought to be numbers usually have some character data contained in the values, such as units (e.g., “4 cm”). To create numbers from strings, it’s important to remove any character data that cannot be converted to a number.

Python

The astype(float) or astype(int) function will coerce strings to numerical representation.

For demonstration, let’s say we have the following NumPy array:

import numpy as np
weight = np.array(["125 lbs.", "132 lbs.", "156 lbs."])

The astype(float) function throws an error due to the presence of strings. The astype() function is for NumPy arrays.

try:
  weight.astype(float)
except ValueError:
  print("ValueError: could not convert string to float: '125 lbs.'")
ValueError: could not convert string to float: '125 lbs.'

One way to approach this is to first remove the strings from the objects and then use astype(float). Below we use the strip() function to find ” lbs.” using a list comprehension.

# [] indicates a list in python
# np.array() changes the list back into an array
weight = np.array([w.strip(" lbs.") for w in weight])

Now we can use the astype() function to change the elements in weight from str to float.

weight.astype(float)
array([125., 132., 156.])

R

The as.numeric() function will attempt to coerce strings to numeric type if possible. Any non-numeric values are coerced to NA.

For demonstration, say we have the following vector:

weight <- c("125 lbs.", "132 lbs.", "156 lbs.")

The as.numeric() function returns all NA due to presence of character data.

as.numeric(weight)
Warning: NAs introduced by coercion
[1] NA NA NA

There are many ways to approach this. A common approach is to first remove the characters and then use as.numeric(). Below, we use the gsub() function to find “lbs.” and replace it with nothing (find-and-replace procedures are discussed more below).

weightN <- gsub(" lbs.", "", weight)
as.numeric(weightN)
[1] 125 132 156

The parse_number() function in the readr package can often take care of these situations automatically.

readr::parse_number(weight)
[1] 125 132 156

4.8 Combine strings

String concatenation—turning “Jane” and “Smith” into “Jane Smith”—is easily done in both languages.

Python

The + operator can combine strings in Python.

species = 'yellow-bellied sea snake'
tail_shape = 'paddle-shaped'

statement = 'The ' + species + ' has a ' + tail_shape + ' tail that helps it swim.'
print(statement)
The yellow-bellied sea snake has a paddle-shaped tail that helps it swim.

R

The paste() and paste0() functions combine strings in R. The former concatenates strings and places spaces between them; the latter concatenates sans spaces.

species <- 'rainbow boa'
appearance <- 'iridescent'
location <- 'Central and South America'

statement1 <- paste('The', species, 'has an', appearance, 'sheen.')
statement1
[1] "The rainbow boa has an iridescent sheen."
# Note that spaces must be provided explicitly when using paste0()
statement2 <- paste0('The ', species, ' is found in ', location)
statement2
[1] "The rainbow boa is found in Central and South America"

4.9 Finding and replacing patterns within strings

This section reviews key functions in Python and R for finding and replacing character patterns. The functions we discuss can search for fixed character patterns (e.g., “Meredith Rollins” to case-sensitively match that name and that name alone) or regular expression (regex) patterns (e.g., \w+ to capture all instances of >=1 alphabetic character). Note that in R, meta characters like w (to match word characters) and d (to match digits) are escaped with two backslashes (e.g., \\w and \\d). In Python, regex patterns are generally headed by r, which allows meta characters in the regex itself to be escaped with just one \ (e.g., r"\w+"). Regex is an enormous topic, and we don’t discuss it at any length here, but you can learn more about regular expressions—and how they’re implemented in different programming languages—at these websites: https://www.regular-expressions.info/; https://regexone.com/

Python

The re module provides a set of functions for searching and manipulating strings. The search() function does exactly as its name suggests: It identifies matches for a fixed or regex character pattern in a string. sub() searches for and replaces character patterns (fixed or regex). The count argument in sub() allows a user to specify how many instances of the matched pattern they want to to replace; e.g., use count = 1 to replace just the first instance of a match.

import re
statement = 'Pencils with an HB graphite grade are commonly used for writing. An HB pencil is approximately equal to a #2 pencil.'

# Search for "HB" using fixed and regex patterns
search_result1 = re.search(pattern = "HB", string = statement)
print(search_result1)
<re.Match object; span=(16, 18), match='HB'>
search_result2 = re.search(pattern = r"[H,B]{2}", string = statement)
print(search_result2)
<re.Match object; span=(16, 18), match='HB'>
# Replace all instances of "HB"
all_replaced = re.sub(pattern = 'HB', repl = 'HB (hard black)', string = statement)
print(all_replaced)
Pencils with an HB (hard black) graphite grade are commonly used for writing. An HB (hard black) pencil is approximately equal to a #2 pencil.
# Replace just the first instance of HB
one_replaced = re.sub(pattern = 'HB', repl = 'HB (hard black)', string = statement, count = 1)
print(one_replaced)
Pencils with an HB (hard black) graphite grade are commonly used for writing. An HB pencil is approximately equal to a #2 pencil.
# Search and replace using a regex pattern instead of a fixed string
# (Replace a space after a period with a line break)
regex_replaced = re.sub(pattern = r'(?<=\.)\s{1}', repl = '\n', string = statement)
print(regex_replaced)
Pencils with an HB graphite grade are commonly used for writing.
An HB pencil is approximately equal to a #2 pencil.

R

The standard-issue string-search function in base R is grep(); it returns the index of the elements in a set of one or more strings for which a pattern match was found. (grepl() acts similarly but returns a vector of TRUE/FALSE indicating whether a match was found in each string passed to the function.) The functions sub() and gsub() can be used to find and replace instances of a pattern: The former replaces just the first instance; the latter replaces all instances. The search pattern can be provided as a raw character string or as a regular expression.

statements <- c('Great Pencil Co. primarily sells pencils of the following grades: HB; B; and 3B.',
                'Great Pencil Co. has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades.')

# Search for a pattern and return the indexes of elements for which a match is found
grep(pattern = 'pencil', x = statements, fixed = TRUE) # use `fixed = T` to search for exactly the declared string: "pencil" in all lowercase, which is only found in one string
[1] 1
grep(pattern = '(?i)pencil', x = statements) # using regex allows for flexible searches, such as a case-insensitive search for "P/pencil", which, unlike "pencil" is found in in both strings
[1] 1 2
# Replace the first instance of a pattern (Co. --> Company)
revised <- sub(pattern = 'Co\\.', replacement = 'Company', x = statements)
revised
[1] "Great Pencil Company primarily sells pencils of the following grades: HB; B; and 3B."                            
[2] "Great Pencil Company has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades."
# Replace all instances of a pattern (; --> ,)
revised2 <- gsub(pattern = ';', replacement = ',', x = revised)
revised2
[1] "Great Pencil Company primarily sells pencils of the following grades: HB, B, and 3B."                            
[2] "Great Pencil Company has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades."
# Find and replace a pattern using regex (3B --> 2B)
final <- sub(pattern = '\\d{1}', replacement = '2', x = revised2)
final
[1] "Great Pencil Company primarily sells pencils of the following grades: HB, B, and 2B."                            
[2] "Great Pencil Company has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades."

Those functions can be used to trim excess (or all) white space in character strings.

spaced_string <- c('This      string    started  out with too    many    spaces.')
# Replace all instances of >=2 spaces with single spaces
gsub(pattern = '\\s{2,}', replacement = ' ', x = spaced_string)
[1] "This string started out with too many spaces."
# Remove all white space
collapse_these <- c('9:00 - 10:15', '10:15 - 11:30', '11:30 - 12:00')
gsub(pattern = '\\s', replacement = '', x = collapse_these)
[1] "9:00-10:15"  "10:15-11:30" "11:30-12:00"

The package stringi also provides an array of string-search and string-manipulation functions, including stri_detect(), stri_replace(), and stri_extract(), all of which easily handle fixed and regex search patterns. For example:

library(stringi)
user_dat <- data.frame(name = c('Shire, Jane E', 'Winchester, Marcus L', 'Fox, Sal'),
                       id_number = c('aaa101', 'aaa102', 'aaa103'))
user_dat
                  name id_number
1        Shire, Jane E    aaa101
2 Winchester, Marcus L    aaa102
3             Fox, Sal    aaa103
# Say we want to to eliminate the 'aaa' patterns from the user IDs and then add
# middle initials---for those users who have them---to the data frame
user_dat$id_number <- stri_replace(user_dat$id_number, regex = '^\\w{3}(?=\\d+)', replacement = '') # alternatively: stri_extract(user_dat$id_number, regex = '(?<=\\w{3})\\d+$')

user_dat$middle_initial <- stri_extract(user_dat$name, regex = '\\b\\w{1}$')
user_dat
                  name id_number middle_initial
1        Shire, Jane E       101              E
2 Winchester, Marcus L       102              L
3             Fox, Sal       103           <NA>

4.10 Change case

How to change the case of strings. The most common case transformations are lower case, upper case, and title case.

Python

The lower(), upper(), and title() functions convert strings to lower, upper, and title case, respectively. We can use a list comprehension to apply these functions to each string in a list.

col_names = [col.upper() for col in mtcars.columns]
mtcars.columns = col_names

R

The tolower() and toupper() functions convert case to lower and upper, respectively.

names(mtcars) <- toupper(names(mtcars))
names(mtcars)
 [1] "MPG"        "CYLINDERS"  "DISP"       "HP"         "AXLE_RATIO"
 [6] "WEIGHT"     "QSEC"       "ENGINE"     "AM"         "GEAR"      
[11] "CARB"       "AM_CH"      "AM_FAC"    
names(mtcars) <- tolower(names(mtcars))
names(mtcars)
 [1] "mpg"        "cylinders"  "disp"       "hp"         "axle_ratio"
 [6] "weight"     "qsec"       "engine"     "am"         "gear"      
[11] "carb"       "am_ch"      "am_fac"    

The stringr package provides a convenient title-case conversion function, str_to_title(), which capitalizes the first letter of each string.

stringr::str_to_title(names(mtcars))
 [1] "Mpg"        "Cylinders"  "Disp"       "Hp"         "Axle_ratio"
 [6] "Weight"     "Qsec"       "Engine"     "Am"         "Gear"      
[11] "Carb"       "Am_ch"      "Am_fac"    

4.11 Drop duplicate rows

How to find and drop duplicate elements.

Python

The duplicated() function determines which rows of a DataFrame are duplicates of previous rows.

First, we create a DataFrame with a duplicate row by using the pandas concat() function. concat() combines DataFrames by rows or columns (by rows is the default).

# Create DataFrame with duplicate rows
import pandas as pd
mtcars2 = pd.concat([mtcars.iloc[0:3,0:6], mtcars.iloc[0:1,0:6]])

The duplicated() function returns a logical vector. TRUE indicates a row is a duplicate of a previous row.

mtcars2.duplicated()
0    False
1    False
2    False
0     True
dtype: bool

R

The duplicated() function “determines which elements of a vector or data frame are duplicates of elements with smaller subscripts” (from ?duplicated).

# Create data frame with duplicate rows
mtcars2 <- rbind(mtcars[1:3,1:6], mtcars[1,1:6])
# Last row is duplicate of first
mtcars2
   mpg cylinders disp  hp axle_ratio weight
1 21.0         6  160 110       3.90  2.620
2 21.0         6  160 110       3.90  2.875
3 22.8         4  108  93       3.85  2.320
4 21.0         6  160 110       3.90  2.620

The duplicated() function returns a logical vector. TRUE indicates that a given element is a duplicate of a previous one (and here, element = row).

# The last row is a duplicate
duplicated(mtcars2)
[1] FALSE FALSE FALSE  TRUE

The TRUE/FALSE vector can be used to extract or drop duplicate rows. Since TRUE in indexing brackets will keep a row, we can use ! to negate the logicals and keep those that are not TRUE.

# Drop the duplicate and update the data frame
mtcars3 <- mtcars2[!duplicated(mtcars2), ]
mtcars3
   mpg cylinders disp  hp axle_ratio weight
1 21.0         6  160 110       3.90  2.620
2 21.0         6  160 110       3.90  2.875
3 22.8         4  108  93       3.85  2.320
# Extract and investigate the duplicate row
mtcars2[duplicated(mtcars2), ]
  mpg cylinders disp  hp axle_ratio weight
4  21         6  160 110        3.9   2.62

The anyDuplicated() function returns the row number of duplicate rows.

anyDuplicated(mtcars2)
[1] 4

4.12 Format dates

With formatted dates, we can calculate elapsed times, extract components of a date, properly order the names of months, and more.

Python

The Python module datetime can be used to create various date and time objects. Here, we discuss four of the main classes within datetime

The first class is the date() class. This creates a date object whose only attributes are year, month, and day.

Below, we create a date object using the date() class. The attributes are specified as integers in the arguments of date() in the following order: date(year, month, day).

import datetime as dt 

x = dt.date(2001, 4, 12)
print(x)
2001-04-12

To get today’s date, we can use the date.today() function:

today = dt.date.today()
print(today)
2023-06-14

Note that the output of both x and today are only year-month-day because they are date objects.

We can extract each of these attributes (year, month, day) from the date object as follows:

today.year 
2023
today.month
6
today.day
14

Next is the time() class. This class creates time objects containing information about only a time. The attributes that go into the time() class are hours, minutes, and seconds in that order. Like the date class, these attributes must be input as integers.

y = dt.time(11, 34, 56)
print(y)
11:34:56

If you want a time object containing only hours and minutes, or only seconds, etc. you can specify the attributes by name when creating the time object.

only_hrs = dt.time(hour = 10)
only_mins = dt.time(minute = 55)

print(only_hrs)
10:00:00
print(only_mins)
00:55:00

Again, similar to the date class, we can extract hour, minute, and second attributes from time-classed objects:

y.hour
11
y.minute
34
y.second
56
y.microsecond
0

The datetime() class that creates a datetime object containing information about both date and time. The attributes must be input as integers wiht the following default order: year, month, day, hour, minute, and second. Like the date and time classes, we can specify specific attributes in the arguments using the names thereof. If we don’t specify any time components, the datetime object defaults to time 00:00:00.

# Input attributes in order
z = dt.datetime(1981, 4, 12, 11, 34, 56)
print(z)
1981-04-12 11:34:56
# Input attributes using attribute names (any order)
z2= dt.datetime(year = 2021, day = 6, month = 12, hour = 6)
print(z2)
2021-12-06 06:00:00
# No time attributes
z3 = dt.datetime(1981, 4, 12)
print(z3)
1981-04-12 00:00:00

Again, we can extract attributes in exactly the same way as we did for the date and time classes.

z.year
1981
z.day
12
z.hour
11

The final class discussed here is the timedelta class. This class is used to store date/time differences between date objects.

The default settings for a timedelta object are as follows: timedelta(weeks = 0, days = 0, hours = 0, minutes = 0, seconds = 0, milliseconds = 0, microseconds = 0).

To add or subtract dates and times to/from these objects, we can use operators like + and -:

# Create a datetime object for the current time
d1 = dt.datetime.now()
print(d1)
2023-06-14 13:44:22.573430
# Add 550 days to the datetime object 
d2 = d1 + dt.timedelta(days = 550)
print(d2)
2024-12-15 13:44:22.573430
# Subtract five hours from the datetime object 
d3 = d1 - dt.timedelta(hours = 5)
print(d3)
2023-06-14 08:44:22.573430

Finally, we will discuss how to convert strings to datetime objects and vice versa.

The attribute strftime() converts datetime objects to strings. In the argument of strftime(), you can specify the format you would like.

d1
datetime.datetime(2023, 6, 14, 13, 44, 22, 573430)
d1.strftime("%A %m %Y")
'Wednesday 06 2023'
d1.strftime("%a %m %y")
'Wed 06 23'

The attribute strptime() converts strings into datetime objects. In the argument of strptime(), you must specify the string and then the format of the string.

d4 = "27/10/98 11:03:9.033"

d1.strptime(d4, "%d/%m/%y %H:%M:%S.%f")
datetime.datetime(1998, 10, 27, 11, 3, 9, 33000)

R

Dates in R can be stored as a Date class or a Date-Time class. Dates are stored as the number of days since January 1, 1970. Date-Times are stored as the number of seconds since January 1, 1970. With dates stored in this manner, we can calculate elapsed time in units such as days, weeks, hours, minutes, and so forth.

Below are the dates of the first five NASA Columbia Space Shuttle flights entered as a character vector.

date <- c("12 April 1981", 
          "12 November 1981", 
          "22 March 1982", 
          "27 June 1982", 
          "11 November 1982")

R does not immediately recognize these as a Date class. To format as a Date class, we can use either the base R as.Date() function or one of the convenience functions in the lubridate package. The as.Date() function requires a specified POSIX conversion specification as documented in ?strptime. Below, the conversion code %d %B %Y indicates that the date is entered as a two-digit day of month (%d), a full month name (%B), and a year with century (%Y).

date1 <- as.Date(date, format = "%d %B %Y")
date1
[1] "1981-04-12" "1981-11-12" "1982-03-22" "1982-06-27" "1982-11-11"

The dates now print in year-month-day format; however, they are stored internally as number of days since January 1, 1970. This can be seen by using as.numeric() on the date1 vector.

as.numeric(date1)
[1] 4119 4333 4463 4560 4697

The lubridate package provides a series of functions that are permutations of the letters “m”, “d”, and “y” to represent the order of date components. To format the original “date” vector, we use the dmy() function since the date components are ordered as day, month and year. We must load the lubridate package to use this function.

library(lubridate)
date2 <- dmy(date)
date2
[1] "1981-04-12" "1981-11-12" "1982-03-22" "1982-06-27" "1982-11-11"

When dates are formatted, we can easily extract information such as the day of week or the month. For example, to extract the day of week of the launches as an ordered factor, we can use the lubridate function wday() with label = TRUE and abbr = FALSE.

wday(date2, label = TRUE, abbr = FALSE)
[1] Sunday   Thursday Monday   Sunday   Thursday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

To calculate elapsed time between launches in days, we can use the base R diff() function.

diff(date2)
Time differences in days
[1] 214 130  97 137

To store a date as a Date-Time class, we need to include a time component. Below are the first five Columbia launch dates with times. (UTC refers to Universal Coordinated Time.)

datetime <- c("12 April 1981 12:00:04 UTC",
              "12 November 1981 15:10:00 UTC",
              "22 March 1982 16:00:00 UTC",
              "27 June 1982 15:00:00 UTC",
              "11 November 1982 12:19:00 UTC")

To format as a Date-Time class, we can use either the base R as.POSIXct() function or one of the convenience functions in the lubridate package. To use as.POSIXct(), we need to include additional POSIX conversion specifications for the hour, minute, and second of launch. The %H:%M:%S specification refers to hours, minutes, and seconds. The tz argument specifies the time zone of the times.

datetime1 <- as.POSIXct(datetime, 
                        format = "%d %B %Y %H:%M:%S", 
                        tz = "UTC")
datetime1
[1] "1981-04-12 12:00:04 UTC" "1981-11-12 15:10:00 UTC"
[3] "1982-03-22 16:00:00 UTC" "1982-06-27 15:00:00 UTC"
[5] "1982-11-11 12:19:00 UTC"

When we use as.numeric() on the datetime1 vector, we see it is stored as number of seconds since January 1, 1970.

as.numeric(datetime1)
[1] 355924804 374425800 385660800 394038000 405865140

Using lubridate, we can append _hms() to any of the “mdy” functions to format dates with time components as a Date-Time class. Notice the default time zone in lubridate is UTC.

datetime2 <- dmy_hms(datetime)
datetime2
[1] "1981-04-12 12:00:04 UTC" "1981-11-12 15:10:00 UTC"
[3] "1982-03-22 16:00:00 UTC" "1982-06-27 15:00:00 UTC"
[5] "1982-11-11 12:19:00 UTC"

To calculate the elapsed time between launches in hours, we can use the lubridate function time_length() with unit = "hours". Below, we use diff() and then pipe the result to time_length().

diff(datetime2) |> time_length(unit = "hours")
[1] 5139.166 3120.833 2327.000 3285.317

For more information on working with dates and times in R, see the vignette accompanying the lubridate package.

4.13 Randomly sample rows

How to take a random sample of rows from a data frame. The sample is usually either a fixed size or a proportion.

Python

The pandas package provides a function for taking a sample of fixed size or a proportion, sample(). To sample with replacement, set replace = TRUE.

By default, the random sample will change every time the code is run. To always generate the same “random” sample, set random_state to any positive integer.

To create a sample with a fixed number of rows, use the n argument.

# Sample five rows from mtcars
mtcars.sample(n = 5, replace = True)
     MPG  CYL   DISP   HP  AXLE_RATIO  ...  VS  AM  GEAR CARB       KPL
1   21.0    6  160.0  110        3.90  ...   0   1     4    4  4.464286
4   18.7    8  360.0  175        3.15  ...   0   0     3    2  3.975340
31  21.4    4  121.0  109        4.11  ...   1   1     4    2  4.549320
26  26.0    4  120.3   91        4.43  ...   0   1     5    2  5.527211
11  16.4    8  275.8  180        3.07  ...   0   0     3    3  3.486395

[5 rows x 12 columns]

To create a sample of a proportion, use the frac argument.

# Sample 20% of the rows from mtcars
mtcars.sample(frac = 0.20, random_state = 1)
     MPG  CYL   DISP   HP  AXLE_RATIO  ...  VS  AM  GEAR CARB       KPL
27  30.4    4   95.1  113        3.77  ...   1   1     5    2  6.462585
3   21.4    6  258.0  110        3.08  ...   1   0     3    1  4.549320
22  15.2    8  304.0  150        3.15  ...   0   0     3    2  3.231293
18  30.4    4   75.7   52        4.93  ...   1   1     4    2  6.462585
23  13.3    8  350.0  245        3.73  ...   0   0     3    4  2.827381
17  32.4    4   78.7   66        4.08  ...   1   1     4    1  6.887755

[6 rows x 12 columns]

The NumPy function random.choice() in combination with the loc() function can be used to sample from a DataFrame.

The random.choice() function creates a random sample according to the given parameters. The loc() function is used to access rows and columns by index.

import numpy as np

# Create a random sample of size five with replacement
random_sample = np.random.choice(len(mtcars), (5, ), replace=True)

# Use random_sample to sample from mtcars
mtcars.loc[random_sample, ]
     MPG  CYL   DISP   HP  AXLE_RATIO  ...  VS  AM  GEAR CARB       KPL
25  27.3    4   79.0   66        4.08  ...   1   1     4    1  5.803571
0   21.0    6  160.0  110        3.90  ...   0   1     4    4  4.464286
10  17.8    6  167.6  123        3.92  ...   1   0     4    4  3.784014
5   18.1    6  225.0  105        2.76  ...   1   0     3    1  3.847789
27  30.4    4   95.1  113        3.77  ...   1   1     5    2  6.462585

[5 rows x 12 columns]

As before, setting random.seed() to a positive integer will ensure that the same “random” sample is generated each time the code is run.

np.random.seed(123)
sample = np.random.choice(len(mtcars), (5,), replace=True)
mtcars.loc[sample, ]
     MPG  CYL   DISP   HP  AXLE_RATIO    WT   QSEC  VS AM  GEAR  CARB       KPL
30  15.0    8  301.0  335        3.54  3.57  14.60   0  1     5     8  3.188776
13  15.2    8  275.8  180        3.07  3.78  18.00   0  0     3     3  3.231293
30  15.0    8  301.0  335        3.54  3.57  14.60   0  1     5     8  3.188776
2   22.8    4  108.0   93        3.85  2.32  18.61   1  1     4     1  4.846939
28  15.8    8  351.0  264        4.22  3.17  14.50   0  1     5     4  3.358844

R

There are many ways to sample rows from a data frame in R. The dplyr package provides a convenience function, slice_sample(), for taking either a fixed sample size or a proportion.

# Sample five rows from mtcars
dplyr::slice_sample(mtcars, n = 5)
   mpg cylinders  disp  hp axle_ratio weight  qsec engine am gear carb am_ch
1 33.9         4  71.1  65       4.22  1.835 19.90      1  1    4    1     1
2 15.2         8 304.0 150       3.15  3.435 17.30      0  0    3    2     0
3 10.4         8 460.0 215       3.00  5.424 17.82      0  0    3    4     0
4 22.8         4 140.8  95       3.92  3.150 22.90      1  0    4    2     0
5 10.4         8 472.0 205       2.93  5.250 17.98      0  0    3    4     0
     am_fac
1    manual
2 automatic
3 automatic
4 automatic
5 automatic
# Sample 20% of rows from mtcars
dplyr::slice_sample(mtcars, prop = 0.20)
   mpg cylinders  disp  hp axle_ratio weight  qsec engine am gear carb am_ch
1 30.4         4  75.7  52       4.93  1.615 18.52      1  1    4    2     1
2 15.2         8 275.8 180       3.07  3.780 18.00      0  0    3    3     0
3 10.4         8 460.0 215       3.00  5.424 17.82      0  0    3    4     0
4 14.7         8 440.0 230       3.23  5.345 17.42      0  0    3    4     0
5 21.4         6 258.0 110       3.08  3.215 19.44      1  0    3    1     0
6 16.4         8 275.8 180       3.07  4.070 17.40      0  0    3    3     0
     am_fac
1    manual
2 automatic
3 automatic
4 automatic
5 automatic
6 automatic

To sample with replacement, set replace = TRUE.

The base R functions sample() and runif() can be combined to sample fixed sizes or approximate proportions.

# Sample five rows from mtcars
# First, get five random row numbers
i <- sample(nrow(mtcars), size = 5)
# Then, use i to subset those rows
mtcars[i, ]
    mpg cylinders  disp  hp axle_ratio weight  qsec engine am gear carb am_ch
27 26.0         4 120.3  91       4.43  2.140 16.70      0  1    5    2     1
3  22.8         4 108.0  93       3.85  2.320 18.61      1  1    4    1     1
4  21.4         6 258.0 110       3.08  3.215 19.44      1  0    3    1     0
18 32.4         4  78.7  66       4.08  2.200 19.47      1  1    4    1     1
7  14.3         8 360.0 245       3.21  3.570 15.84      0  0    3    4     0
      am_fac
27    manual
3     manual
4  automatic
18    manual
7  automatic
# Sample about 20% of rows from mtcars
# First, generate random values on range of [0,1]
i <- runif(nrow(mtcars))
# Then, use i < 0.20 logical vector to select rows that correspond to TRUE
mtcars[i < 0.20, ]
    mpg cylinders  disp  hp axle_ratio weight  qsec engine am gear carb am_ch
3  22.8         4 108.0  93       3.85  2.320 18.61      1  1    4    1     1
13 17.3         8 275.8 180       3.07  3.730 17.60      0  0    3    3     0
17 14.7         8 440.0 230       3.23  5.345 17.42      0  0    3    4     0
19 30.4         4  75.7  52       4.93  1.615 18.52      1  1    4    2     1
27 26.0         4 120.3  91       4.43  2.140 16.70      0  1    5    2     1
30 19.7         6 145.0 175       3.62  2.770 15.50      0  1    5    6     1
      am_fac
3     manual
13 automatic
17 automatic
19    manual
27    manual
30    manual

The random sample will change every time the code is run. To always generate the same “random” sample, use the set.seed() function with any positive integer.

set.seed(123)
i <- runif(nrow(mtcars))
mtcars[i < 0.20,]
    mpg cylinders  disp  hp axle_ratio weight  qsec engine am gear carb am_ch
6  18.1         6 225.0 105       2.76   3.46 20.22      1  0    3    1     0
15 10.4         8 472.0 205       2.93   5.25 17.98      0  0    3    4     0
18 32.4         4  78.7  66       4.08   2.20 19.47      1  1    4    1     1
30 19.7         6 145.0 175       3.62   2.77 15.50      0  1    5    6     1
      am_fac
6  automatic
15 automatic
18    manual
30    manual