Chapter 4 Data Manipulation
This chapter discusses strategies for filtering, selecting, modifying, and deriving variables in data. Unless otherwise stated, examples are for DataFrames (Python) and data frames (R) and use the mtcars
data frame that is included with R.
4.1 View variable names and types
View and inspect the names of variables and their types (numeric, string, logical, etc.).
Python
The .info()
function in pandas returns information about a DataFrame.
Setting the argument verbose
to True
prints the names of the columns, their lengths excluding NULL
values, and their data types (dtype
) in a table. The function lists the unique data types in the DataFrame, and it prints the amount of memory the DataFrame occupies.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 mpg 32 non-null float64
1 cyl 32 non-null int64
2 disp 32 non-null float64
3 hp 32 non-null int64
4 drat 32 non-null float64
5 wt 32 non-null float64
6 qsec 32 non-null float64
7 vs 32 non-null int64
8 am 32 non-null int64
9 gear 32 non-null int64
10 carb 32 non-null int64
dtypes: float64(5), int64(6)
memory usage: 2.9 KB
Setting verbose
to False
excludes the table describing each column.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Columns: 11 entries, mpg to carb
dtypes: float64(5), int64(6)
memory usage: 2.9 KB
If a DataFrame has 100 or fewer columns, the verbose
argument defaults to True
.
R
The str()
function in R lists the names of the variables, their types, the first few values of each, and the object dimensions.
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
To see just the names of the data frame, use the names()
function.
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
To see just the dimensions of the data frame, use the dim()
function. It returns the number of rows and columns, respectively.
[1] 32 11
4.2 Select variables
How to select specific columns of data frames.
Python
The period operator .
provides access to a column in a DataFrame as a vector. This returns a pandas Series. A pandas Series can do everything a NumPy array can do.
0 21.0
1 21.0
2 22.8
3 21.4
4 18.7
5 18.1
6 14.3
7 24.4
8 22.8
9 19.2
10 17.8
11 16.4
12 17.3
13 15.2
14 10.4
15 10.4
16 14.7
17 32.4
18 30.4
19 33.9
20 21.5
21 15.5
22 15.2
23 13.3
24 19.2
25 27.3
26 26.0
27 30.4
28 15.8
29 19.7
30 15.0
31 21.4
Name: mpg, dtype: float64
Indexing also provides access to columns as a pandas Series. Single and double quotations both work.
0 21.0
1 21.0
2 22.8
3 21.4
4 18.7
5 18.1
6 14.3
7 24.4
8 22.8
9 19.2
10 17.8
11 16.4
12 17.3
13 15.2
14 10.4
15 10.4
16 14.7
17 32.4
18 30.4
19 33.9
20 21.5
21 15.5
22 15.2
23 13.3
24 19.2
25 27.3
26 26.0
27 30.4
28 15.8
29 19.7
30 15.0
31 21.4
Name: mpg, dtype: float64
Operations on NumPy arrays are faster than operations on pandas Series. But in many cases, using pandas Series is fine performance-wise. This consideration is primarily important for large data sets on which many operations are performed. The .values
function returns a NumPy array from a pandas DataFrame.
array([21. , 21. , 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,
16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5,
15.2, 13.3, 19.2, 27.3, 26. , 30.4, 15.8, 19.7, 15. , 21.4])
Double indexing returns a pandas DataFrame instead of a NumPy array or pandas Series.
mpg
0 21.0
1 21.0
2 22.8
3 21.4
4 18.7
5 18.1
6 14.3
7 24.4
8 22.8
9 19.2
10 17.8
11 16.4
12 17.3
13 15.2
14 10.4
15 10.4
16 14.7
17 32.4
18 30.4
19 33.9
20 21.5
21 15.5
22 15.2
23 13.3
24 19.2
25 27.3
26 26.0
27 30.4
28 15.8
29 19.7
30 15.0
31 21.4
The head()
and tail()
functions return the first five or last five values, respectively. Use the n
argument to change the number of values. This function works on NumPy arrays, pandas Series and pandas DataFrames.
0 21.0
1 21.0
2 22.8
3 21.4
4 18.7
Name: mpg, dtype: float64
mpg cyl disp hp drat wt qsec vs am gear carb
31 21.4 4 121.0 109 4.11 2.78 18.6 1 1 4 2
R
The dollar sign operator, $
, provides access to a column in a data frame as a vector.
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
Double-indexing brackets also provide access to columns as a vector.
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
Single-indexing brackets work as well, but they return a data frame instead of a vector (if used with a data frame).
mpg
1 21.0
2 21.0
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
11 17.8
12 16.4
13 17.3
14 15.2
15 10.4
16 10.4
17 14.7
18 32.4
19 30.4
20 33.9
21 21.5
22 15.5
23 15.2
24 13.3
25 19.2
26 27.3
27 26.0
28 30.4
29 15.8
30 19.7
31 15.0
32 21.4
Single-indexing brackets also allow selection of rows when used with a comma. The syntax is rows, columns
[1] 21.0 21.0 22.8
Finally single-indexing brackets allow us to select multiple columns. Request columns either by name or position using a vector.
mpg cyl
1 21.0 6
2 21.0 6
3 22.8 4
4 21.4 6
5 18.7 8
6 18.1 6
7 14.3 8
8 24.4 4
9 22.8 4
10 19.2 6
11 17.8 6
12 16.4 8
13 17.3 8
14 15.2 8
15 10.4 8
16 10.4 8
17 14.7 8
18 32.4 4
19 30.4 4
20 33.9 4
21 21.5 4
22 15.5 8
23 15.2 8
24 13.3 8
25 19.2 8
26 27.3 4
27 26.0 4
28 30.4 4
29 15.8 8
30 19.7 6
31 15.0 8
32 21.4 4
The head()
and tail()
functions return the first six or last six values, respectively. Use the n
argument to change the number of values. These functions work with vectors, data frames, matrices, and more.
[1] 21.0 21.0 22.8 21.4 18.7 18.1
mpg cyl disp hp drat wt qsec vs am gear carb
32 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
4.3 Filter/Subset variables
Selecting rows of a rectangular data set that meet certain conditions.
Python
We can filter rows of a DataFrame based on subsetting conditions; the data type returned depends on the filtration method.
The following example returns a DataFrame, not a Series, as there is more than one column selected from the DataFrame. Use a list, square brackets [], to subset more than one column.
mpg cyl
17 32.4 4
18 30.4 4
19 33.9 4
27 30.4 4
Both pandas Series and NumPy arrays can be used for fast performance and vector operations. Many functions require a vector as input.
The following code returns one column, mpg
, as a pandas Series. A pandas Series is one column from a pandas DataFrame.
17 32.4
18 30.4
19 33.9
27 30.4
Name: mpg, dtype: float64
The following code also returns a pandas Series, but it uses the .
operator to select a column, rather than square brackets []
.
17 32.4
18 30.4
19 33.9
27 30.4
Name: mpg, dtype: float64
Both of the following lines of code return NumPy arrays using the .values
function. df1
is one dimension, reflecting just the mpg
column, and df2
is two dimensions, reflecting the mpg
and cyl
columns.
df1 = mtcars[mtcars["mpg"] > 30]["mpg"].values
df2 = mtcars[mtcars["mpg"] > 30][["mpg", "cyl"]].values
You can also filter using multiple row conditions:
mpg cyl disp hp drat wt qsec vs am gear carb
18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
<string>:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
R
In base R, we can use subsetting brackets or the subset()
function to select rows based on some condition. Below, we demonstrate both approaches, returning only those rows with mpg
greater than 30. We begin with subsetting brackets, which take three arguments:
i
: the condition to subset on.j
: the columns to show. If none specified, all columns are returneddrop
: an optional logical argument (TRUE/FALSE) to determine whether or not to coerce the output to the lowest possible dimension. The default is TRUE.
We rarely type the first two argument names, i
and j
, when using subsetting brackets.
This example returns only the rows with mpg > 30 and all columns. Notice we need to preface mpg with mtcars$
to tell R where to find the “mpg” column and that we need to provide a comma after the condition.
mpg cyl disp hp drat wt qsec vs am gear carb
18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
We can select the columns to return in the second argument as a vector. Notice that we only need to specify the column names as a character vector. We can also use numbers corresponding to the column positions, as well as conditional statements.
mpg wt gear
18 32.4 2.200 4
19 30.4 1.615 4
20 33.9 1.835 4
28 30.4 1.513 5
Show the first three columns:
mpg cyl disp
18 32.4 4 78.7
19 30.4 4 75.7
20 33.9 4 71.1
28 30.4 4 95.1
Show columns with names consisting of only two characters. The nchar()
function counts the number of characters in a string. The expression nchar(names(mtcars)) == 2
returns a vector of TRUE
/FALSE
values where TRUE
indicates the column name is only two characters in length.
hp wt vs am
18 66 2.200 1 1
19 52 1.615 1 1
20 65 1.835 1 1
28 113 1.513 1 1
Notice that when we specify only one column, the brackets return a vector.
[1] 32.4 30.4 33.9 30.4
To get a data frame, set the drop
argument to FALSE.
mpg
18 32.4
19 30.4
20 33.9
28 30.4
The subset()
function allows us to refer to column names without using the $
extractor function or quoting column names. It also has a drop argument, but its default is FALSE.
It has four arguments:
x
: the data frame to subset.subset
: the condition to subset on.select
: the columns to select.drop
: an optional logical argument (TRUE/FALSE) to determine whether or not to coerce the output to the lowest possible dimension. The default is FALSE.
We rarely type the first three argument names, x
, subset
and select
, when using subset()
.
Below we replicate the previous examples using subset()
.
mpg cyl disp hp drat wt qsec vs am gear carb
18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
# Return rows where mpg > 30 and the mpg, wt, and gear columns
subset(mtcars, mpg > 30, c(mpg, wt, gear))
mpg wt gear
18 32.4 2.200 4
19 30.4 1.615 4
20 33.9 1.835 4
28 30.4 1.513 5
mpg cyl disp
18 32.4 4 78.7
19 30.4 4 75.7
20 33.9 4 71.1
28 30.4 4 95.1
# Return rows where mpg > 30 and columns consisting of two characters
subset(mtcars, mpg > 30, nchar(names(mtcars)) == 2)
hp wt vs am
18 66 2.200 1 1
19 52 1.615 1 1
20 65 1.835 1 1
28 113 1.513 1 1
[1] 32.4 30.4 33.9 30.4
mpg
18 32.4
19 30.4
20 33.9
28 30.4
Another difference between subsetting brackets and the subset()
function is how they handle missing values. Subsetting brackets return missing values while subset()
does not. We demonstrate below with a toy data frame. Notice the x
column has a missing value.
x y
1 1 12
2 NA 21
3 5 34
When we condition on x < 3
, the subsetting bracket approach returns a row with NA
values.
x y
1 1 12
NA NA NA
The subset()
approach ignores the missing value.
x y
1 1 12
To replicate the subset()
result with the subsetting brackets, we need to include an additional condition to only show rows where x is not missing. We can do that with the is.na()
function. The is.na()
function returns TRUE
if a value is missing and FALSE
otherwise. If we preface with !
, we get TRUE
if a value is not missing and FALSE
otherwise.
x y
1 1 12
See also the filter()
function in the dplyr package and the enhanced subsetting brackets in the data.table package.
4.4 Rename variables
Python
Column names can be changed using the function .rename()
. Below, we change the column names cyl
and wt
to cylinder
and WT
, respectively.
mpg cylinder disp hp drat WT qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Alternatively, column names can be changed by replacing the vector of column names with a new vector. Below, we create a vector of columns that replaces drat
with axle_ratio
using conditional match and that replaces disp
with DISP
using indexing.
column_names = mtcars.columns.values
# Using conditional match
column_names[column_names == "drat"] = "axle_ratio"
# Using indexing
column_names[2] = "DISP"
mtcars.columns = column_names
mtcars.columns
Index(['mpg', 'cyl', 'DISP', 'hp', 'axle_ratio', 'wt', 'qsec', 'vs', 'am',
'gear', 'carb'],
dtype='object')
R
Variable names can be changed by referencing their indexes (i.e., their positions among the columns in a data frame). Below the second column is cyl
. We change the name to cylinders.
[1] "cyl"
[1] "mpg" "cylinders" "disp" "hp" "drat" "wt"
[7] "qsec" "vs" "am" "gear" "carb"
Variable names can also be changed by conditional match. Below we find the variable name that matches drat
and change it to axle_ratio.
[1] "drat"
[1] "mpg" "cylinders" "disp" "hp" "axle_ratio"
[6] "wt" "qsec" "vs" "am" "gear"
[11] "carb"
More than one variable name can be changed using a vector of positions or matches.
names(mtcars)[c(6,8)] <- c("weight", "engine")
# Alternatively: names(mtcars)[names(mtcars) %in% c("wt", "vs")] <- c("weight", "engine")
names(mtcars)
[1] "mpg" "cylinders" "disp" "hp" "axle_ratio"
[6] "weight" "qsec" "engine" "am" "gear"
[11] "carb"
See also the rename()
function in the dplyr package.
4.5 Create, replace and remove variables
We often need to create variables that are functions of other variables, or we need to replace existing variables with updated versions thereof.
Python
Adding a new variable via indexing notation and assigning a result to it adds a new column.
Doing the same with an existing column updates the values in the column.
Alternatively, the .
notation can be used to update the values in a column.
To remove a column, use the .drop()
function.
mpg cyl DISP hp axle_ratio wt qsec vs am gear carb
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
R
Adding a new variable name after the dollar sign notation and assigning a result to it adds a new column.
Doing the same with an existing variable updates the values in a column.
To remove a variable, assign it NULL
.
See also the mutate()
function in the dplyr package.
4.6 Create strings from numbers
You may have data that is numeric but that needs to be treated as a string.
Python
You can change the data type of a column in a DataFrame using the astype
function.
mtcars['am'] = mtcars['am'].astype(str)
type(mtcars.am[0]) # check the type of the first item in am column
<class 'str'>
A potential number-to-string conversion task in Python might be formatting 5-digit American ZIP Codes. Some zip codes begin with 0, but if stored as a numeric value, the 0 is dropped. For example, consider the following pandas DataFrame. Notice that the leading 0 is dropped from two of the ZIP Codes.
state zip
0 VT 5001
1 VA 22901
2 NH 3282
One way to fix this is using the string zfill()
method. First, we convert the numeric column to string type using the method we just demonstrated. Then we access the zip
column using zc.zip
and the zfill()
method using str.zfill
with the width parameter set to 5. This pads the string with 0
on the left as necessary to make each value five characters wide.
state zip
0 VT 05001
1 VA 22901
2 NH 03282
If we knew we were importing zip codes using read_csv
, we could also use the dtype
argument to specify which storage type to use for the zip
column. Below, we pass a dictionary that maps the str
type to the zip
column. The result is a properly formatted ZIP Code column.
state zip
0 VT 05001
1 VA 22901
2 NH 03282
R
The as.character()
function takes a vector and converts it to string format.
[1] 1 1 1 0 0 0
[1] "1" "1" "1" "0" "0" "0"
To save the results of this process—this conversion—we need to assign the result to the data frame.
[1] "1" "1" "1" "0" "0" "0"
The factor()
function can also be used to convert a numeric vector into a categorical variable. However, the result is not exactly a string: A factor is made of integers with character labels. Factors are useful for character data that have a fixed set of levels (e.g., “grade 1,” grade 2,” etc.).
[1] 1 1 1 0 0 0
[1] 1 1 1 0 0 0
Levels: 0 1
[1] manual manual manual automatic automatic automatic
Levels: automatic manual
As before, to save the results of this conversion to the data frame, we need to assign the results to a variable:
# Create factor variable am_fac
mtcars$am_fac <- factor(mtcars$am, labels = c("automatic", "manual"))
head(mtcars$am_fac)
[1] manual manual manual automatic automatic automatic
Levels: automatic manual
A common number-to-string conversion task in R is formatting 5-digit American ZIP Codes as demonstrated above in Python. Some zip codes begin with 0, but if stored as a numeric value, the 0 is dropped.
[1] 3766 3748 22901 3264
We need to store the zip code as a character value so the 0 is preserved. One way to do this is via the sprintf()
function in base R. The first argument is the format string or conversion specification. A conversion specification begins with %
. The following 0
and 5
says to format the zip_codes vector as a 5-digit string padded by zeroes on the left. The final i
says we’re working with integer values.
[1] "03766" "03748" "22901" "03264"
See also the str_pad()
function in the stringr package.
4.7 Create numbers from strings
String variables that ought to be numbers usually have some character data contained in the values, such as units (e.g., “4 cm”). To create numbers from strings, it’s important to remove any character data that cannot be converted to a number.
Python
The astype(float)
or astype(int)
function will coerce strings to numerical representation.
For demonstration, let’s say we have the following NumPy array:
The astype(float)
function throws an error due to the presence of strings. The astype()
function is for NumPy arrays.
try:
weight.astype(float)
except ValueError:
print("ValueError: could not convert string to float: '125 lbs.'")
ValueError: could not convert string to float: '125 lbs.'
One way to approach this is to first remove the strings from the objects and then use astype(float)
. Below we use the strip()
function to find ” lbs.” using a list comprehension.
# [] indicates a list in python
# np.array() changes the list back into an array
weight = np.array([w.strip(" lbs.") for w in weight])
Now we can use the astype()
function to change the elements in weight from str
to float
.
array([125., 132., 156.])
R
The as.numeric()
function will attempt to coerce strings to numeric type if possible. Any non-numeric values are coerced to NA
.
For demonstration, say we have the following vector:
The as.numeric()
function returns all NA due to presence of character data.
Warning: NAs introduced by coercion
[1] NA NA NA
There are many ways to approach this. A common approach is to first remove the characters and then use as.numeric()
. Below, we use the gsub()
function to find “lbs.” and replace it with nothing (find-and-replace procedures are discussed more below).
[1] 125 132 156
The parse_number()
function in the readr package can often take care of these situations automatically.
[1] 125 132 156
4.8 Combine strings
String concatenation—turning “Jane” and “Smith” into “Jane Smith”—is easily done in both languages.
Python
The +
operator can combine strings in Python.
species = 'yellow-bellied sea snake'
tail_shape = 'paddle-shaped'
statement = 'The ' + species + ' has a ' + tail_shape + ' tail that helps it swim.'
print(statement)
The yellow-bellied sea snake has a paddle-shaped tail that helps it swim.
R
The paste()
and paste0()
functions combine strings in R. The former concatenates strings and places spaces between them; the latter concatenates sans spaces.
species <- 'rainbow boa'
appearance <- 'iridescent'
location <- 'Central and South America'
statement1 <- paste('The', species, 'has an', appearance, 'sheen.')
statement1
[1] "The rainbow boa has an iridescent sheen."
# Note that spaces must be provided explicitly when using paste0()
statement2 <- paste0('The ', species, ' is found in ', location)
statement2
[1] "The rainbow boa is found in Central and South America"
4.9 Finding and replacing patterns within strings
This section reviews key functions in Python and R for finding and replacing character patterns. The functions we discuss can search for fixed character patterns (e.g., “Meredith Rollins” to case-sensitively match that name and that name alone) or regular expression (regex) patterns (e.g., \w+
to capture all instances of >=1 alphabetic character). Note that in R, meta characters like w
(to match word characters) and d
(to match digits) are escaped with two backslashes (e.g., \\w
and \\d
). In Python, regex patterns are generally headed by r
, which allows meta characters in the regex itself to be escaped with just one \
(e.g., r"\w+"
). Regex is an enormous topic, and we don’t discuss it at any length here, but you can learn more about regular expressions—and how they’re implemented in different programming languages—at these websites: https://www.regular-expressions.info/; https://regexone.com/
Python
The re module provides a set of functions for searching and manipulating strings. The search()
function does exactly as its name suggests: It identifies matches for a fixed or regex character pattern in a string. sub()
searches for and replaces character patterns (fixed or regex). The count
argument in sub()
allows a user to specify how many instances of the matched pattern they want to to replace; e.g., use count = 1
to replace just the first instance of a match.
import re
statement = 'Pencils with an HB graphite grade are commonly used for writing. An HB pencil is approximately equal to a #2 pencil.'
# Search for "HB" using fixed and regex patterns
search_result1 = re.search(pattern = "HB", string = statement)
print(search_result1)
<re.Match object; span=(16, 18), match='HB'>
<re.Match object; span=(16, 18), match='HB'>
# Replace all instances of "HB"
all_replaced = re.sub(pattern = 'HB', repl = 'HB (hard black)', string = statement)
print(all_replaced)
Pencils with an HB (hard black) graphite grade are commonly used for writing. An HB (hard black) pencil is approximately equal to a #2 pencil.
# Replace just the first instance of HB
one_replaced = re.sub(pattern = 'HB', repl = 'HB (hard black)', string = statement, count = 1)
print(one_replaced)
Pencils with an HB (hard black) graphite grade are commonly used for writing. An HB pencil is approximately equal to a #2 pencil.
# Search and replace using a regex pattern instead of a fixed string
# (Replace a space after a period with a line break)
regex_replaced = re.sub(pattern = r'(?<=\.)\s{1}', repl = '\n', string = statement)
print(regex_replaced)
Pencils with an HB graphite grade are commonly used for writing.
An HB pencil is approximately equal to a #2 pencil.
R
The standard-issue string-search function in base R is grep()
; it returns the index of the elements in a set of one or more strings for which a pattern match was found. (grepl()
acts similarly but returns a vector of TRUE
/FALSE
indicating whether a match was found in each string passed to the function.) The functions sub()
and gsub()
can be used to find and replace instances of a pattern: The former replaces just the first instance; the latter replaces all instances. The search pattern can be provided as a raw character string or as a regular expression.
statements <- c('Great Pencil Co. primarily sells pencils of the following grades: HB; B; and 3B.',
'Great Pencil Co. has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades.')
# Search for a pattern and return the indexes of elements for which a match is found
grep(pattern = 'pencil', x = statements, fixed = TRUE) # use `fixed = T` to search for exactly the declared string: "pencil" in all lowercase, which is only found in one string
[1] 1
grep(pattern = '(?i)pencil', x = statements) # using regex allows for flexible searches, such as a case-insensitive search for "P/pencil", which, unlike "pencil" is found in in both strings
[1] 1 2
# Replace the first instance of a pattern (Co. --> Company)
revised <- sub(pattern = 'Co\\.', replacement = 'Company', x = statements)
revised
[1] "Great Pencil Company primarily sells pencils of the following grades: HB; B; and 3B."
[2] "Great Pencil Company has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades."
# Replace all instances of a pattern (; --> ,)
revised2 <- gsub(pattern = ';', replacement = ',', x = revised)
revised2
[1] "Great Pencil Company primarily sells pencils of the following grades: HB, B, and 3B."
[2] "Great Pencil Company has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades."
# Find and replace a pattern using regex (3B --> 2B)
final <- sub(pattern = '\\d{1}', replacement = '2', x = revised2)
final
[1] "Great Pencil Company primarily sells pencils of the following grades: HB, B, and 2B."
[2] "Great Pencil Company has its headquarters in Maine, and Great Pencil Co. has supplied the Northeast for decades."
Those functions can be used to trim excess (or all) white space in character strings.
spaced_string <- c('This string started out with too many spaces.')
# Replace all instances of >=2 spaces with single spaces
gsub(pattern = '\\s{2,}', replacement = ' ', x = spaced_string)
[1] "This string started out with too many spaces."
# Remove all white space
collapse_these <- c('9:00 - 10:15', '10:15 - 11:30', '11:30 - 12:00')
gsub(pattern = '\\s', replacement = '', x = collapse_these)
[1] "9:00-10:15" "10:15-11:30" "11:30-12:00"
The package stringi also provides an array of string-search and string-manipulation functions, including stri_detect()
, stri_replace()
, and stri_extract()
, all of which easily handle fixed and regex search patterns. For example:
library(stringi)
user_dat <- data.frame(name = c('Shire, Jane E', 'Winchester, Marcus L', 'Fox, Sal'),
id_number = c('aaa101', 'aaa102', 'aaa103'))
user_dat
name id_number
1 Shire, Jane E aaa101
2 Winchester, Marcus L aaa102
3 Fox, Sal aaa103
# Say we want to to eliminate the 'aaa' patterns from the user IDs and then add
# middle initials---for those users who have them---to the data frame
user_dat$id_number <- stri_replace(user_dat$id_number, regex = '^\\w{3}(?=\\d+)', replacement = '') # alternatively: stri_extract(user_dat$id_number, regex = '(?<=\\w{3})\\d+$')
user_dat$middle_initial <- stri_extract(user_dat$name, regex = '\\b\\w{1}$')
user_dat
name id_number middle_initial
1 Shire, Jane E 101 E
2 Winchester, Marcus L 102 L
3 Fox, Sal 103 <NA>
4.10 Change case
How to change the case of strings. The most common case transformations are lower case, upper case, and title case.
Python
The lower()
, upper()
, and title()
functions convert strings to lower, upper, and title case, respectively. We can use a list comprehension to apply these functions to each string in a list.
R
The tolower()
and toupper()
functions convert case to lower and upper, respectively.
[1] "MPG" "CYLINDERS" "DISP" "HP" "AXLE_RATIO"
[6] "WEIGHT" "QSEC" "ENGINE" "AM" "GEAR"
[11] "CARB" "AM_CH" "AM_FAC"
[1] "mpg" "cylinders" "disp" "hp" "axle_ratio"
[6] "weight" "qsec" "engine" "am" "gear"
[11] "carb" "am_ch" "am_fac"
The stringr package provides a convenient title-case conversion function, str_to_title()
, which capitalizes the first letter of each string.
[1] "Mpg" "Cylinders" "Disp" "Hp" "Axle_ratio"
[6] "Weight" "Qsec" "Engine" "Am" "Gear"
[11] "Carb" "Am_ch" "Am_fac"
4.11 Drop duplicate rows
How to find and drop duplicate elements.
Python
The duplicated()
function determines which rows of a DataFrame are duplicates of previous rows.
First, we create a DataFrame with a duplicate row by using the pandas concat()
function. concat()
combines DataFrames by rows or columns (by rows is the default).
# Create DataFrame with duplicate rows
import pandas as pd
mtcars2 = pd.concat([mtcars.iloc[0:3,0:6], mtcars.iloc[0:1,0:6]])
The duplicated()
function returns a logical vector. TRUE indicates a row is a duplicate of a previous row.
0 False
1 False
2 False
0 True
dtype: bool
R
The duplicated()
function “determines which elements of a vector or data frame are duplicates of elements with smaller subscripts” (from ?duplicated
).
# Create data frame with duplicate rows
mtcars2 <- rbind(mtcars[1:3,1:6], mtcars[1,1:6])
# Last row is duplicate of first
mtcars2
mpg cylinders disp hp axle_ratio weight
1 21.0 6 160 110 3.90 2.620
2 21.0 6 160 110 3.90 2.875
3 22.8 4 108 93 3.85 2.320
4 21.0 6 160 110 3.90 2.620
The duplicated()
function returns a logical vector. TRUE
indicates that a given element is a duplicate of a previous one (and here, element = row).
[1] FALSE FALSE FALSE TRUE
The TRUE
/FALSE
vector can be used to extract or drop duplicate rows. Since TRUE
in indexing brackets will keep a row, we can use !
to negate the logicals and keep those that are not TRUE.
mpg cylinders disp hp axle_ratio weight
1 21.0 6 160 110 3.90 2.620
2 21.0 6 160 110 3.90 2.875
3 22.8 4 108 93 3.85 2.320
mpg cylinders disp hp axle_ratio weight
4 21 6 160 110 3.9 2.62
The anyDuplicated()
function returns the row number of duplicate rows.
[1] 4
4.12 Format dates
With formatted dates, we can calculate elapsed times, extract components of a date, properly order the names of months, and more.
Python
The Python module datetime can be used to create various date and time objects. Here, we discuss four of the main classes within datetime
The first class is the date()
class. This creates a date object whose only attributes are year, month, and day.
Below, we create a date object using the date()
class. The attributes are specified as integers in the arguments of date()
in the following order: date(year, month, day)
.
2001-04-12
To get today’s date, we can use the date.today()
function:
2023-06-14
Note that the output of both x
and today
are only year-month-day because they are date objects.
We can extract each of these attributes (year, month, day) from the date object as follows:
2023
6
14
Next is the time()
class. This class creates time objects containing information about only a time. The attributes that go into the time()
class are hours, minutes, and seconds in that order. Like the date class, these attributes must be input as integers.
11:34:56
If you want a time object containing only hours and minutes, or only seconds, etc. you can specify the attributes by name when creating the time object.
10:00:00
00:55:00
Again, similar to the date class, we can extract hour, minute, and second attributes from time-classed objects:
11
34
56
0
The datetime()
class that creates a datetime object containing information about both date and time. The attributes must be input as integers wiht the following default order: year, month, day, hour, minute, and second. Like the date and time classes, we can specify specific attributes in the arguments using the names thereof. If we don’t specify any time components, the datetime object defaults to time 00:00:00.
1981-04-12 11:34:56
# Input attributes using attribute names (any order)
z2= dt.datetime(year = 2021, day = 6, month = 12, hour = 6)
print(z2)
2021-12-06 06:00:00
1981-04-12 00:00:00
Again, we can extract attributes in exactly the same way as we did for the date and time classes.
1981
12
11
The final class discussed here is the timedelta class. This class is used to store date/time differences between date objects.
The default settings for a timedelta object are as follows: timedelta(weeks = 0, days = 0, hours = 0, minutes = 0, seconds = 0, milliseconds = 0, microseconds = 0)
.
To add or subtract dates and times to/from these objects, we can use operators like +
and -
:
2023-06-14 13:44:22.573430
2024-12-15 13:44:22.573430
2023-06-14 08:44:22.573430
Finally, we will discuss how to convert strings to datetime objects and vice versa.
The attribute strftime()
converts datetime objects to strings. In the argument of strftime()
, you can specify the format you would like.
datetime.datetime(2023, 6, 14, 13, 44, 22, 573430)
'Wednesday 06 2023'
'Wed 06 23'
The attribute strptime()
converts strings into datetime objects. In the argument of strptime()
, you must specify the string and then the format of the string.
datetime.datetime(1998, 10, 27, 11, 3, 9, 33000)
R
Dates in R can be stored as a Date class or a Date-Time class. Dates are stored as the number of days since January 1, 1970. Date-Times are stored as the number of seconds since January 1, 1970. With dates stored in this manner, we can calculate elapsed time in units such as days, weeks, hours, minutes, and so forth.
Below are the dates of the first five NASA Columbia Space Shuttle flights entered as a character vector.
R does not immediately recognize these as a Date class. To format as a Date class, we can use either the base R as.Date()
function or one of the convenience functions in the lubridate package. The as.Date()
function requires a specified POSIX conversion specification as documented in ?strptime
. Below, the conversion code %d %B %Y
indicates that the date is entered as a two-digit day of month (%d
), a full month name (%B
), and a year with century (%Y
).
[1] "1981-04-12" "1981-11-12" "1982-03-22" "1982-06-27" "1982-11-11"
The dates now print in year-month-day format; however, they are stored internally as number of days since January 1, 1970. This can be seen by using as.numeric()
on the date1
vector.
[1] 4119 4333 4463 4560 4697
The lubridate package provides a series of functions that are permutations of the letters “m”, “d”, and “y” to represent the order of date components. To format the original “date” vector, we use the dmy()
function since the date components are ordered as day, month and year. We must load the lubridate package to use this function.
[1] "1981-04-12" "1981-11-12" "1982-03-22" "1982-06-27" "1982-11-11"
When dates are formatted, we can easily extract information such as the day of week or the month. For example, to extract the day of week of the launches as an ordered factor, we can use the lubridate function wday()
with label = TRUE
and abbr = FALSE
.
[1] Sunday Thursday Monday Sunday Thursday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
To calculate elapsed time between launches in days, we can use the base R diff()
function.
Time differences in days
[1] 214 130 97 137
To store a date as a Date-Time class, we need to include a time component. Below are the first five Columbia launch dates with times. (UTC refers to Universal Coordinated Time.)
datetime <- c("12 April 1981 12:00:04 UTC",
"12 November 1981 15:10:00 UTC",
"22 March 1982 16:00:00 UTC",
"27 June 1982 15:00:00 UTC",
"11 November 1982 12:19:00 UTC")
To format as a Date-Time class, we can use either the base R as.POSIXct()
function or one of the convenience functions in the lubridate package. To use as.POSIXct()
, we need to include additional POSIX conversion specifications for the hour, minute, and second of launch. The %H:%M:%S
specification refers to hours, minutes, and seconds. The tz
argument specifies the time zone of the times.
[1] "1981-04-12 12:00:04 UTC" "1981-11-12 15:10:00 UTC"
[3] "1982-03-22 16:00:00 UTC" "1982-06-27 15:00:00 UTC"
[5] "1982-11-11 12:19:00 UTC"
When we use as.numeric()
on the datetime1
vector, we see it is stored as number of seconds since January 1, 1970.
[1] 355924804 374425800 385660800 394038000 405865140
Using lubridate, we can append _hms()
to any of the “mdy” functions to format dates with time components as a Date-Time class. Notice the default time zone in lubridate is UTC.
[1] "1981-04-12 12:00:04 UTC" "1981-11-12 15:10:00 UTC"
[3] "1982-03-22 16:00:00 UTC" "1982-06-27 15:00:00 UTC"
[5] "1982-11-11 12:19:00 UTC"
To calculate the elapsed time between launches in hours, we can use the lubridate function time_length()
with unit = "hours"
. Below, we use diff()
and then pipe the result to time_length()
.
[1] 5139.166 3120.833 2327.000 3285.317
For more information on working with dates and times in R, see the vignette accompanying the lubridate package.
4.13 Randomly sample rows
How to take a random sample of rows from a data frame. The sample is usually either a fixed size or a proportion.
Python
The pandas package provides a function for taking a sample of fixed size or a proportion, sample()
. To sample with replacement, set replace = TRUE
.
By default, the random sample will change every time the code is run. To always generate the same “random” sample, set random_state
to any positive integer.
To create a sample with a fixed number of rows, use the n
argument.
MPG CYL DISP HP AXLE_RATIO ... VS AM GEAR CARB KPL
1 21.0 6 160.0 110 3.90 ... 0 1 4 4 4.464286
4 18.7 8 360.0 175 3.15 ... 0 0 3 2 3.975340
31 21.4 4 121.0 109 4.11 ... 1 1 4 2 4.549320
26 26.0 4 120.3 91 4.43 ... 0 1 5 2 5.527211
11 16.4 8 275.8 180 3.07 ... 0 0 3 3 3.486395
[5 rows x 12 columns]
To create a sample of a proportion, use the frac
argument.
MPG CYL DISP HP AXLE_RATIO ... VS AM GEAR CARB KPL
27 30.4 4 95.1 113 3.77 ... 1 1 5 2 6.462585
3 21.4 6 258.0 110 3.08 ... 1 0 3 1 4.549320
22 15.2 8 304.0 150 3.15 ... 0 0 3 2 3.231293
18 30.4 4 75.7 52 4.93 ... 1 1 4 2 6.462585
23 13.3 8 350.0 245 3.73 ... 0 0 3 4 2.827381
17 32.4 4 78.7 66 4.08 ... 1 1 4 1 6.887755
[6 rows x 12 columns]
The NumPy function random.choice()
in combination with the loc()
function can be used to sample from a DataFrame.
The random.choice()
function creates a random sample according to the given parameters. The loc()
function is used to access rows and columns by index.
import numpy as np
# Create a random sample of size five with replacement
random_sample = np.random.choice(len(mtcars), (5, ), replace=True)
# Use random_sample to sample from mtcars
mtcars.loc[random_sample, ]
MPG CYL DISP HP AXLE_RATIO ... VS AM GEAR CARB KPL
25 27.3 4 79.0 66 4.08 ... 1 1 4 1 5.803571
0 21.0 6 160.0 110 3.90 ... 0 1 4 4 4.464286
10 17.8 6 167.6 123 3.92 ... 1 0 4 4 3.784014
5 18.1 6 225.0 105 2.76 ... 1 0 3 1 3.847789
27 30.4 4 95.1 113 3.77 ... 1 1 5 2 6.462585
[5 rows x 12 columns]
As before, setting random.seed()
to a positive integer will ensure that the same “random” sample is generated each time the code is run.
MPG CYL DISP HP AXLE_RATIO WT QSEC VS AM GEAR CARB KPL
30 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8 3.188776
13 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3 3.231293
30 15.0 8 301.0 335 3.54 3.57 14.60 0 1 5 8 3.188776
2 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1 4.846939
28 15.8 8 351.0 264 4.22 3.17 14.50 0 1 5 4 3.358844
R
There are many ways to sample rows from a data frame in R. The dplyr package provides a convenience function, slice_sample()
, for taking either a fixed sample size or a proportion.
mpg cylinders disp hp axle_ratio weight qsec engine am gear carb am_ch
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 1
2 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 0
3 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 0
4 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 0
5 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 0
am_fac
1 manual
2 automatic
3 automatic
4 automatic
5 automatic
mpg cylinders disp hp axle_ratio weight qsec engine am gear carb am_ch
1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
2 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 0
3 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 0
4 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 0
5 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0
6 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 0
am_fac
1 manual
2 automatic
3 automatic
4 automatic
5 automatic
6 automatic
To sample with replacement, set replace = TRUE
.
The base R functions sample()
and runif()
can be combined to sample fixed sizes or approximate proportions.
# Sample five rows from mtcars
# First, get five random row numbers
i <- sample(nrow(mtcars), size = 5)
# Then, use i to subset those rows
mtcars[i, ]
mpg cylinders disp hp axle_ratio weight qsec engine am gear carb am_ch
27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0
18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 1
7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 0
am_fac
27 manual
3 manual
4 automatic
18 manual
7 automatic
# Sample about 20% of rows from mtcars
# First, generate random values on range of [0,1]
i <- runif(nrow(mtcars))
# Then, use i < 0.20 logical vector to select rows that correspond to TRUE
mtcars[i < 0.20, ]
mpg cylinders disp hp axle_ratio weight qsec engine am gear carb am_ch
3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 0
17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 0
19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 1
27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1
30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 1
am_fac
3 manual
13 automatic
17 automatic
19 manual
27 manual
30 manual
The random sample will change every time the code is run. To always generate the same “random” sample, use the set.seed()
function with any positive integer.
mpg cylinders disp hp axle_ratio weight qsec engine am gear carb am_ch
6 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1 0
15 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4 0
18 32.4 4 78.7 66 4.08 2.20 19.47 1 1 4 1 1
30 19.7 6 145.0 175 3.62 2.77 15.50 0 1 5 6 1
am_fac
6 automatic
15 automatic
18 manual
30 manual