Parallel Pearson Correlation

November 9, 2017

Motivation Parallelization Simple parallelization: one variable per worker Massive parallel: chunk of pairs per worker Motivation Let’s look at the time it takes to calculate all pairwise correlation for \(n\) variable, with \(m\)=200 samples. n dt 1e+02 1.433043e+00 1e+03 1.359290e+02 2e+03 5.371534e+02 1e+05 1.230446e+06 Given the timing above, and the extrapolated timing for \(10^{5}\) genes, which is roughly the order of number of genes/transcripts in a transcriptomic profile, it would take 14.

Directed Acyclic Graph and conditional independence

October 25, 2017

Discrete random variables Analytical estimation of probabilities Empirical solution Intervention Continuous random variables Conditioning by observation Conditioning by intervention Joint distribution of \((Y,Z)\) Simulation of joint distribution \((Z,Y)\) References The example is taken from Chapter 17 ???. Let \(V = (X,Y,Z)\) represented by the following graph Discrete random variables Let \(V = (X,Y,Z)\) have the following joint distribution \[ \renewcommand{\vector}[1]{\mathbf{#1}} \newcommand{\matrix}[1]{\mathbf{#1}} \newcommand{\E}[1]{\mathbb{E}{\left(#1\right)}} \begin{align} X &\sim Bernoulli(1/2) \\ Y|X=x &\sim Bernoulli\left(\frac{e^{4x-2}}{1 + e^{4x-2}}\right) \\ Z|X=x, Y=y &\sim Bernoulli\left(\frac{e^{2(x+y)-2}}{1 + e^{2(x+y)-2}}\right) \end{align} \]

Allocation cost

September 17, 2017

Many high-level programming languages allow their users to afford the luxury of extending an existing matrix or vector. The question is, how luxury it can be? The two functions below both return an \(m \times n\) matrix, by calling a random generator \(n\) times. The first function does that by initializing the whole matrix with zeros, and filling the values until finish. The second extends the results one column at a time.

data.table subsetting

March 2, 2017

The data.table package supports a powerful syntax to select rows and columns. Selecting a single column library(data.table) data("iris") iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),] DT = data.table(iris) DT ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 4.9 3.0 1.4 0.2 setosa ## 2: 4.6 3.6 1.0 0.2 setosa ## 3: 7.2 3.0 5.8 1.6 virginica ## 4: 5.4 3.4 1.5 0.4 setosa ## 5: 6.7 3.1 5.6 2.4 virginica ## 6: 5.

Non-trivial operation on data.table columns

March 1, 2017

This note explores the use of data.table package to calculate pairwise correlation between columns, with iris data set as example. library(data.table) DT = data.table(iris) The iris data is now data.table-ized ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 5.1 3.5 1.4 0.2 setosa ## 2: 4.9 3.0 1.4 0.2 setosa ## 3: 4.7 3.2 1.3 0.2 setosa ## 4: 4.6 3.1 1.5 0.2 setosa ## 5: 5.0 3.6 1.4 0.2 setosa ## --- ## 146: 6.

.Internal(sample)

July 7, 2016

.Internal(sample()) requires explicitly 4 arguments in order: n, size, replacement, probabilities If probabilities is not NULL, the first argument has to be an integer. To achieve an equivalent output as that of sample, we need to map the sampled integers back to desired values internal_boolean <- function(size,replace,prob) { s = .Internal(sample(2,size,replace, prob)) return(s<2) } internal_boolean_rle <- function(size,replace,prob) { s = .Internal(sample(2,size,replace, prob)) return(rle(s<2)) } N = 100000 probs = c(0.0001, 1-0.

Fitting negative binomial distribution and goodness-of-fit

July 7, 2016

Obtaining data Fitting with pre-determined distribution The effects of sample size Goodness-of-fit Assuming Poisson distribution Assuming NB distribution The package MASS provides a function, fitdistr to fit an observation over discrete distribution using Maximum likelihood. Obtaining data We first need to generate some data to fit. The rnegbin(n,mu,theta) function can be used to generate n samples of negative binomial with mean mu and variance mu + mu^2 / theta.

NEWER POSTS
OLDER POSTS
page 4 of 6

Parallel Pearson Correlation

Directed Acyclic Graph and conditional independence

Allocation cost

data.table subsetting

Non-trivial operation on data.table columns

.Internal(sample)

Fitting negative binomial distribution and goodness-of-fit

Trang Tran

Transforming normal to uniform distribution

A tabulated list of Markdown editors

Speeding up random sampling of an array in R

PCA, SVD and Eigen decomposition

Sweep vs Matrix multiplication

SVD in different languages

Multi-core parallel computing in Julia

Affinity propagation - step by step

Configure Nginx as a reverse proxy for Rstudio server

Fast SVD