The data.table package supports a powerful syntax to select rows and columns.

Selecting a single column

library(data.table)
data("iris")
iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),]
DT = data.table(iris)
DT
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
##  1:          4.9         3.0          1.4         0.2     setosa
##  2:          4.6         3.6          1.0         0.2     setosa
##  3:          7.2         3.0          5.8         1.6  virginica
##  4:          5.4         3.4          1.5         0.4     setosa
##  5:          6.7         3.1          5.6         2.4  virginica
##  6:          5.9         3.0          4.2         1.5 versicolor
##  7:          5.7         3.8          1.7         0.3     setosa
##  8:          6.1         2.8          4.0         1.3 versicolor
##  9:          5.2         3.4          1.4         0.2     setosa
## 10:          7.2         3.6          6.1         2.5  virginica

The syntax below returns a one-column data.table

DT[,.(Species)]
##        Species
##  1:     setosa
##  2:     setosa
##  3:  virginica
##  4:     setosa
##  5:  virginica
##  6: versicolor
##  7:     setosa
##  8: versicolor
##  9:     setosa
## 10:  virginica

This one returns a vector

DT[,Species]
##  [1] setosa     setosa     virginica  setosa     virginica  versicolor
##  [7] setosa     versicolor setosa     virginica 
## Levels: setosa versicolor virginica

However, when with=FALSE

DT[,"Species",with=FALSE]

it will always return a data.table

##        Species
##  1:     setosa
##  2:     setosa
##  3:  virginica
##  4:     setosa
##  5:  virginica
##  6: versicolor
##  7:     setosa
##  8: versicolor
##  9:     setosa
## 10:  virginica

This is because the input for column subset is always a vector, even with length 1. To get a single column as a vector, we can use list subsetting syntax, since data.table is also a data.frame

DT[["Species"]]
##  [1] setosa     setosa     virginica  setosa     virginica  versicolor
##  [7] setosa     versicolor setosa     virginica 
## Levels: setosa versicolor virginica

What if we also want the row-filtering power of data.table? It gets tricky because the syntax for row filtering only works with a column, not a column name.

This one works

DT[Species=="setosa"]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          4.9         3.0          1.4         0.2  setosa
## 2:          4.6         3.6          1.0         0.2  setosa
## 3:          5.4         3.4          1.5         0.4  setosa
## 4:          5.7         3.8          1.7         0.3  setosa
## 5:          5.2         3.4          1.4         0.2  setosa

But this one doesn’t

DT["Species" == "setosa"]
## Empty data.table (0 rows and 5 cols): Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species

To use a dynamic column name in row subsetting, we need to rely on [[

DT[DT[["Species"]] == 'setosa']
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1:          4.9         3.0          1.4         0.2  setosa
## 2:          4.6         3.6          1.0         0.2  setosa
## 3:          5.4         3.4          1.5         0.4  setosa
## 4:          5.7         3.8          1.7         0.3  setosa
## 5:          5.2         3.4          1.4         0.2  setosa

Does this compromise performance?

myCol = "Species"
bm = data.table(microbenchmark(times=100,
               DT[Species=='setosa'],
               DT[DT[[myCol]] == 'setosa'],
               iris[iris[[myCol]] == 'setosa',]))
knitr::kable(bm[,.("Mean (microsec)" =mean(time)/1000, "Median (microsec)" =median(time)/1000), by=expr])
expr Mean (microsec) Median (microsec)
iris[iris[[myCol]] == “setosa”, ] 113.4784 88.664
DT[Species == “setosa”] 202.5178 166.072
DT[DT[[myCol]] == “setosa”] 176.4317 147.644

The bulky syntax turned out to outperform the neater ones, and to my surprise, operation on data.frame is more efficient than on data.table.

Assigment Operator :=

Assignment (or sub-assignment) is done in place. So we should expect this to change the original data.table

DT[Species=='versicolor',Species:='versi']

By chaining the assignment to a previous selection, we’re only modifying the copy.

DT[Species=='virginica'][,Species:='virgi']

The original data.table remains unchanged.

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##  1:          4.9         3.0          1.4         0.2    setosa
##  2:          4.6         3.6          1.0         0.2    setosa
##  3:          7.2         3.0          5.8         1.6 virginica
##  4:          5.4         3.4          1.5         0.4    setosa
##  5:          6.7         3.1          5.6         2.4 virginica
##  6:          5.9         3.0          4.2         1.5     versi
##  7:          5.7         3.8          1.7         0.3    setosa
##  8:          6.1         2.8          4.0         1.3     versi
##  9:          5.2         3.4          1.4         0.2    setosa
## 10:          7.2         3.6          6.1         2.5 virginica