Non-trivial operation on data.table columns

This note explores the use of data.table package to calculate pairwise correlation between columns, with iris data set as example.

library(data.table)
DT = data.table(iris)

The iris data is now data.table-ized

##      Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
##   1:          5.1         3.5          1.4         0.2    setosa
##   2:          4.9         3.0          1.4         0.2    setosa
##   3:          4.7         3.2          1.3         0.2    setosa
##   4:          4.6         3.1          1.5         0.2    setosa
##   5:          5.0         3.6          1.4         0.2    setosa
##  ---                                                            
## 146:          6.7         3.0          5.2         2.3 virginica
## 147:          6.3         2.5          5.0         1.9 virginica
## 148:          6.5         3.0          5.2         2.0 virginica
## 149:          6.2         3.4          5.4         2.3 virginica
## 150:          5.9         3.0          5.1         1.8 virginica

To calculate the correlation between each pair of variables across the whole data set, a regular cor call is sufficient

everything = cor(DT[, !c('Species'),with=FALSE],method='pearson')

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

If we want to calculate the correlation between each pair of variables within each Species, we can choose to partition by Species first, and repeat the call to cor(). If there are \(k\) different species, we will need to investigate \(k\) correlation matrices, each of which is of dimension \(n\times n\).

species = unique(DT[,Species])
for (sp in species) {
    print(paste("Correlation within Species", sp))
    print(cor(DT[Species==sp,!c('Species'),with=FALSE], method='pearson'))
}

## [1] "Correlation within Species setosa"
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.7425467    0.2671758   0.2780984
## Sepal.Width     0.7425467   1.0000000    0.1777000   0.2327520
## Petal.Length    0.2671758   0.1777000    1.0000000   0.3316300
## Petal.Width     0.2780984   0.2327520    0.3316300   1.0000000
## [1] "Correlation within Species versicolor"
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.5259107    0.7540490   0.5464611
## Sepal.Width     0.5259107   1.0000000    0.5605221   0.6639987
## Petal.Length    0.7540490   0.5605221    1.0000000   0.7866681
## Petal.Width     0.5464611   0.6639987    0.7866681   1.0000000
## [1] "Correlation within Species virginica"
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.4572278    0.8642247   0.2811077
## Sepal.Width     0.4572278   1.0000000    0.4010446   0.5377280
## Petal.Length    0.8642247   0.4010446    1.0000000   0.3221082
## Petal.Width     0.2811077   0.5377280    0.3221082   1.0000000

Many times we are only interested in the correlation of one variable with the remaining. The desired output corresponds to one row (or column) of the correlation matrix. In most cases, calculating the whole matrix can be computationally expensive, and unnecessary. To calculate only the values of interest and get a correlation matrix by Species, there are several ways

Loop over the partitions of data set, calculate the correlation, and bind all the result together

target = "Sepal.Length"
allCols = setdiff(names(DT), "Species")
cm = list()
for (sp in species) {
    cm[[sp]] =  sapply(allCols, function(x) cor(DT[Species==sp, .SD, .SDcols=c(target,x)])[1,2])
}
cm.mat = sapply(cm,c)
print(cm.mat)

##                 setosa versicolor virginica
## Sepal.Length 1.0000000  1.0000000 1.0000000
## Sepal.Width  0.7425467  0.5259107 0.4572278
## Petal.Length 0.2671758  0.7540490 0.8642247
## Petal.Width  0.2780984  0.5464611 0.2811077

The final binding can be done either with sapply, do.call(rbind), or do.call(cbind)

sapply(cm,c)

##                 setosa versicolor virginica
## Sepal.Length 1.0000000  1.0000000 1.0000000
## Sepal.Width  0.7425467  0.5259107 0.4572278
## Petal.Length 0.2671758  0.7540490 0.8642247
## Petal.Width  0.2780984  0.5464611 0.2811077

do.call(rbind,cm)

##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa                1   0.7425467    0.2671758   0.2780984
## versicolor            1   0.5259107    0.7540490   0.5464611
## virginica             1   0.4572278    0.8642247   0.2811077

do.call(cbind,cm)

##                 setosa versicolor virginica
## Sepal.Length 1.0000000  1.0000000 1.0000000
## Sepal.Width  0.7425467  0.5259107 0.4572278
## Petal.Length 0.2671758  0.7540490 0.8642247
## Petal.Width  0.2780984  0.5464611 0.2811077

It is tempting to exploit the power of data.table group-by. However, I haven’t found a way to make it work in a reproducible manner.

For example, the code below attempts to apply a correlation function on (x,y), with x being the variables in the data set, and y is the target variable, but it could not subset the target variable correctl

corr.with.me = function(x,group) {
    y = DT[Species==group, Sepal.Length]
    print("y:")
    print(y)
    return(cor(x,y,use="complete.obs"))
    # return(cor(x, DT[Species==group, .SD, .SDcols=c("Sepal.Length")], method='pearson',use = "complete.obs"))
}

DT[,lapply(.SD,corr.with.me, .BY),by=Species]

## [1] "y:"
## numeric(0)

## Error in cor(x, y, use = "complete.obs"): incompatible dimensions

Non-trivial operation on data.table columns

Trang Tran

Transforming normal to uniform distribution

A tabulated list of Markdown editors

Speeding up random sampling of an array in R

PCA, SVD and Eigen decomposition

Sweep vs Matrix multiplication

SVD in different languages

Multi-core parallel computing in Julia

Affinity propagation - step by step

Configure Nginx as a reverse proxy for Rstudio server

Fast SVD