Efficient `reduce` in R

Combining rows Combining columns When you need to do a Reduce operation on a list, it’s more efficient to use do.call(). Combining rows a = lapply(1:100, rnorm, n = 50) microbenchmark(Reduce(rbind, a), do.call(rbind, a)) %>% boxplot(unit = 'ms', boxwex=0.2) Combining columns microbenchmark(Reduce(cbind, a), do.call(cbind, a)) %>% boxplot(unit = 'ms', boxwex = 0.2)

Continue reading

Importance sampling

An example Formalization References \[ \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\E}{\mathbb{E}} \] An example The example below is taken from [1] Let \(X\) be a random variable with uniform distribution in \([0,10]\), \[ X \sim Uniform(0,10) \] Consider the function \(h(x) = 10 e^{-2|x-5|}\). Suppose we want to calculate \(\E_X[h(X)]\). By definition, \[\begin{align} \E_X[h(X)] &= \int_{0}^{10} h(x) f(x)dx \\ &= \int_{0}^{10} exp(-2|x-5|) dx \end{align}\] A straightforward way to do this is sampling \(X_i\) from the uniform(0,10) density and calculating the mean of \(10\cdot h(X_i)\)

Continue reading

\[ \newcommand{\matrix}[1]{\mathbf{#1}} \] Let \(\matrix{A}\) be a data set of \(m\) points in \(\mathbb{R}^d\). One application of SVD is to create a compressed representation of \(\matrix{A}\). Rank-\(k\) approximation of \(A\) is created by calculating the singular value decomposition of \(\matrix{A}\) \[ \matrix{A} = \matrix{U}\matrix{\Sigma}{\matrix{V}} \] and reconstruct it with \(k \leq d\) first singular values. \[ \matrix{A_k} = \matrix{U_k}\matrix{\Sigma_k}\matrix{V_k^T} \] SVD in R Each implementation of SVD has some varieties in the output representation.

Continue reading

Motivation Parallelization Simple parallelization: one variable per worker Massive parallel: chunk of pairs per worker Motivation Let’s look at the time it takes to calculate all pairwise correlation for \(n\) variable, with \(m\)=200 samples. n dt 1e+02 1.433043e+00 1e+03 1.359290e+02 2e+03 5.371534e+02 1e+05 1.230446e+06 Given the timing above, and the extrapolated timing for \(10^{5}\) genes, which is roughly the order of number of genes/transcripts in a transcriptomic profile, it would take 14.

Continue reading

Allocation cost

Many high-level programming languages allow their users to afford the luxury of extending an existing matrix or vector. The question is, how luxury it can be? The two functions below both return an \(m \times n\) matrix, by calling a random generator \(n\) times. The first function does that by initializing the whole matrix with zeros, and filling the values until finish. The second extends the results one column at a time.

Continue reading

The data.table package supports a powerful syntax to select rows and columns. Selecting a single column library(data.table) data("iris") iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),] DT = data.table(iris) DT ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 4.9 3.0 1.4 0.2 setosa ## 2: 4.6 3.6 1.0 0.2 setosa ## 3: 7.2 3.0 5.8 1.6 virginica ## 4: 5.4 3.4 1.5 0.4 setosa ## 5: 6.7 3.1 5.6 2.4 virginica ## 6: 5.

Continue reading

This note explores the use of data.table package to calculate pairwise correlation between columns, with iris data set as example. library(data.table) DT = data.table(iris) The iris data is now data.table-ized ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 5.1 3.5 1.4 0.2 setosa ## 2: 4.9 3.0 1.4 0.2 setosa ## 3: 4.7 3.2 1.3 0.2 setosa ## 4: 4.6 3.1 1.5 0.2 setosa ## 5: 5.0 3.6 1.4 0.2 setosa ## --- ## 146: 6.

Continue reading

Author's picture

Trang Tran


Student

USA