data.table subsetting

The data.table package supports a powerful syntax to select rows and columns.

Selecting a single column

library(data.table)
data("iris")
iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),]
DT = data.table(iris)
DT

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##  1:          4.9         3.0          1.4         0.2     setosa
##  2:          4.6         3.6          1.0         0.2     setosa
##  3:          7.2         3.0          5.8         1.6  virginica
##  4:          5.4         3.4          1.5         0.4     setosa
##  5:          6.7         3.1          5.6         2.4  virginica
##  6:          5.9         3.0          4.2         1.5 versicolor
##  7:          5.7         3.8          1.7         0.3     setosa
##  8:          6.1         2.8          4.0         1.3 versicolor
##  9:          5.2         3.4          1.4         0.2     setosa
## 10:          7.2         3.6          6.1         2.5  virginica

The syntax below returns a one-column data.table

DT[,.(Species)]

##        Species
##  1:     setosa
##  2:     setosa
##  3:  virginica
##  4:     setosa
##  5:  virginica
##  6: versicolor
##  7:     setosa
##  8: versicolor
##  9:     setosa
## 10:  virginica

This one returns a vector

DT[,Species]

##  [1] setosa     setosa     virginica  setosa     virginica  versicolor
##  [7] setosa     versicolor setosa     virginica 
## Levels: setosa versicolor virginica

However, when with=FALSE

DT[,"Species",with=FALSE]

it will always return a data.table

##        Species
##  1:     setosa
##  2:     setosa
##  3:  virginica
##  4:     setosa
##  5:  virginica
##  6: versicolor
##  7:     setosa
##  8: versicolor
##  9:     setosa
## 10:  virginica

This is because the input for column subset is always a vector, even with length 1. To get a single column as a vector, we can use list subsetting syntax, since data.table is also a data.frame

DT[["Species"]]

##  [1] setosa     setosa     virginica  setosa     virginica  versicolor
##  [7] setosa     versicolor setosa     virginica 
## Levels: setosa versicolor virginica

What if we also want the row-filtering power of data.table? It gets tricky because the syntax for row filtering only works with a column, not a column name.

This one works

DT[Species=="setosa"]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          4.9         3.0          1.4         0.2  setosa
## 2:          4.6         3.6          1.0         0.2  setosa
## 3:          5.4         3.4          1.5         0.4  setosa
## 4:          5.7         3.8          1.7         0.3  setosa
## 5:          5.2         3.4          1.4         0.2  setosa

But this one doesn’t

DT["Species" == "setosa"]

## Empty data.table (0 rows and 5 cols): Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species

To use a dynamic column name in row subsetting, we need to rely on [[

DT[DT[["Species"]] == 'setosa']

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          4.9         3.0          1.4         0.2  setosa
## 2:          4.6         3.6          1.0         0.2  setosa
## 3:          5.4         3.4          1.5         0.4  setosa
## 4:          5.7         3.8          1.7         0.3  setosa
## 5:          5.2         3.4          1.4         0.2  setosa

Does this compromise performance?

myCol = "Species"
bm = data.table(microbenchmark(times=100,
               DT[Species=='setosa'],
               DT[DT[[myCol]] == 'setosa'],
               iris[iris[[myCol]] == 'setosa',]))
knitr::kable(bm[,.("Mean (microsec)" =mean(time)/1000, "Median (microsec)" =median(time)/1000), by=expr])

expr	Mean (microsec)	Median (microsec)
iris[iris[[myCol]] == “setosa”, ]	113.4784	88.664
DT[Species == “setosa”]	202.5178	166.072
DT[DT[[myCol]] == “setosa”]	176.4317	147.644

The bulky syntax turned out to outperform the neater ones, and to my surprise, operation on data.frame is more efficient than on data.table.

Assigment Operator `:=`

Assignment (or sub-assignment) is done in place. So we should expect this to change the original data.table

DT[Species=='versicolor',Species:='versi']

By chaining the assignment to a previous selection, we’re only modifying the copy.

DT[Species=='virginica'][,Species:='virgi']

The original data.table remains unchanged.

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##  1:          4.9         3.0          1.4         0.2    setosa
##  2:          4.6         3.6          1.0         0.2    setosa
##  3:          7.2         3.0          5.8         1.6 virginica
##  4:          5.4         3.4          1.5         0.4    setosa
##  5:          6.7         3.1          5.6         2.4 virginica
##  6:          5.9         3.0          4.2         1.5     versi
##  7:          5.7         3.8          1.7         0.3    setosa
##  8:          6.1         2.8          4.0         1.3     versi
##  9:          5.2         3.4          1.4         0.2    setosa
## 10:          7.2         3.6          6.1         2.5 virginica

data.table subsetting

Selecting a single column

Assigment Operator :=

Trang Tran

Transforming normal to uniform distribution

A tabulated list of Markdown editors

Speeding up random sampling of an array in R

PCA, SVD and Eigen decomposition

Sweep vs Matrix multiplication

SVD in different languages

Multi-core parallel computing in Julia

Affinity propagation - step by step

Configure Nginx as a reverse proxy for Rstudio server

Fast SVD

Assigment Operator `:=`