data.table subsetting
The data.table
package supports a powerful syntax to select rows and columns.
Selecting a single column
library(data.table)
data("iris")
iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),]
DT = data.table(iris)
DT
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.9 3.0 1.4 0.2 setosa
## 2: 4.6 3.6 1.0 0.2 setosa
## 3: 7.2 3.0 5.8 1.6 virginica
## 4: 5.4 3.4 1.5 0.4 setosa
## 5: 6.7 3.1 5.6 2.4 virginica
## 6: 5.9 3.0 4.2 1.5 versicolor
## 7: 5.7 3.8 1.7 0.3 setosa
## 8: 6.1 2.8 4.0 1.3 versicolor
## 9: 5.2 3.4 1.4 0.2 setosa
## 10: 7.2 3.6 6.1 2.5 virginica
The syntax below returns a one-column data.table
DT[,.(Species)]
## Species
## 1: setosa
## 2: setosa
## 3: virginica
## 4: setosa
## 5: virginica
## 6: versicolor
## 7: setosa
## 8: versicolor
## 9: setosa
## 10: virginica
This one returns a vector
DT[,Species]
## [1] setosa setosa virginica setosa virginica versicolor
## [7] setosa versicolor setosa virginica
## Levels: setosa versicolor virginica
However, when with=FALSE
DT[,"Species",with=FALSE]
it will always return a data.table
## Species
## 1: setosa
## 2: setosa
## 3: virginica
## 4: setosa
## 5: virginica
## 6: versicolor
## 7: setosa
## 8: versicolor
## 9: setosa
## 10: virginica
This is because the input for column subset is always a vector, even with length 1.
To get a single column as a vector, we can use list subsetting syntax, since data.table
is also a data.frame
DT[["Species"]]
## [1] setosa setosa virginica setosa virginica versicolor
## [7] setosa versicolor setosa virginica
## Levels: setosa versicolor virginica
What if we also want the row-filtering power of data.table
? It gets tricky because the syntax for row filtering only works with a column
, not a column name.
This one works
DT[Species=="setosa"]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.9 3.0 1.4 0.2 setosa
## 2: 4.6 3.6 1.0 0.2 setosa
## 3: 5.4 3.4 1.5 0.4 setosa
## 4: 5.7 3.8 1.7 0.3 setosa
## 5: 5.2 3.4 1.4 0.2 setosa
But this one doesn’t
DT["Species" == "setosa"]
## Empty data.table (0 rows and 5 cols): Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
To use a dynamic column name in row subsetting, we need to rely on [[
DT[DT[["Species"]] == 'setosa']
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.9 3.0 1.4 0.2 setosa
## 2: 4.6 3.6 1.0 0.2 setosa
## 3: 5.4 3.4 1.5 0.4 setosa
## 4: 5.7 3.8 1.7 0.3 setosa
## 5: 5.2 3.4 1.4 0.2 setosa
Does this compromise performance?
myCol = "Species"
bm = data.table(microbenchmark(times=100,
DT[Species=='setosa'],
DT[DT[[myCol]] == 'setosa'],
iris[iris[[myCol]] == 'setosa',]))
knitr::kable(bm[,.("Mean (microsec)" =mean(time)/1000, "Median (microsec)" =median(time)/1000), by=expr])
expr | Mean (microsec) | Median (microsec) |
---|---|---|
iris[iris[[myCol]] == “setosa”, ] | 113.4784 | 88.664 |
DT[Species == “setosa”] | 202.5178 | 166.072 |
DT[DT[[myCol]] == “setosa”] | 176.4317 | 147.644 |
The bulky syntax turned out to outperform the neater ones, and to my surprise, operation on data.frame
is more efficient than on data.table
.
Assigment Operator :=
Assignment (or sub-assignment) is done in place. So we should expect this to change the original data.table
DT[Species=='versicolor',Species:='versi']
By chaining the assignment to a previous selection, we’re only modifying the copy.
DT[Species=='virginica'][,Species:='virgi']
The original data.table
remains unchanged.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1: 4.9 3.0 1.4 0.2 setosa
## 2: 4.6 3.6 1.0 0.2 setosa
## 3: 7.2 3.0 5.8 1.6 virginica
## 4: 5.4 3.4 1.5 0.4 setosa
## 5: 6.7 3.1 5.6 2.4 virginica
## 6: 5.9 3.0 4.2 1.5 versi
## 7: 5.7 3.8 1.7 0.3 setosa
## 8: 6.1 2.8 4.0 1.3 versi
## 9: 5.2 3.4 1.4 0.2 setosa
## 10: 7.2 3.6 6.1 2.5 virginica