Motivation Parallelization Simple parallelization: one variable per worker Massive parallel: chunk of pairs per worker Motivation Let’s look at the time it takes to calculate all pairwise correlation for \(n\) variable, with \(m\)=200 samples. n dt 1e+02 1.433043e+00 1e+03 1.359290e+02 2e+03 5.371534e+02 1e+05 1.230446e+06 Given the timing above, and the extrapolated timing for \(10^{5}\) genes, which is roughly the order of number of genes/transcripts in a transcriptomic profile, it would take 14.

Continue reading

Discrete random variables Analytical estimation of probabilities Empirical solution Intervention Continuous random variables Conditioning by observation Conditioning by intervention Joint distribution of \((Y,Z)\) Simulation of joint distribution \((Z,Y)\) References The example is taken from Chapter 17 ???. Let \(V = (X,Y,Z)\) represented by the following graph Discrete random variables Let \(V = (X,Y,Z)\) have the following joint distribution \[ \renewcommand{\vector}[1]{\mathbf{#1}} \newcommand{\matrix}[1]{\mathbf{#1}} \newcommand{\E}[1]{\mathbb{E}{\left(#1\right)}} \begin{align} X &\sim Bernoulli(1/2) \\ Y|X=x &\sim Bernoulli\left(\frac{e^{4x-2}}{1 + e^{4x-2}}\right) \\ Z|X=x, Y=y &\sim Bernoulli\left(\frac{e^{2(x+y)-2}}{1 + e^{2(x+y)-2}}\right) \end{align} \]

Continue reading

Allocation cost

Many high-level programming languages allow their users to afford the luxury of extending an existing matrix or vector. The question is, how luxury it can be? The two functions below both return an \(m \times n\) matrix, by calling a random generator \(n\) times. The first function does that by initializing the whole matrix with zeros, and filling the values until finish. The second extends the results one column at a time.

Continue reading

The data.table package supports a powerful syntax to select rows and columns. Selecting a single column library(data.table) data("iris") iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),] DT = data.table(iris) DT ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 4.9 3.0 1.4 0.2 setosa ## 2: 4.6 3.6 1.0 0.2 setosa ## 3: 7.2 3.0 5.8 1.6 virginica ## 4: 5.4 3.4 1.5 0.4 setosa ## 5: 6.7 3.1 5.6 2.4 virginica ## 6: 5.

Continue reading

This note explores the use of data.table package to calculate pairwise correlation between columns, with iris data set as example. library(data.table) DT = data.table(iris) The iris data is now data.table-ized ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 5.1 3.5 1.4 0.2 setosa ## 2: 4.9 3.0 1.4 0.2 setosa ## 3: 4.7 3.2 1.3 0.2 setosa ## 4: 4.6 3.1 1.5 0.2 setosa ## 5: 5.0 3.6 1.4 0.2 setosa ## --- ## 146: 6.

Continue reading

.Internal(sample)

.Internal(sample()) requires explicitly 4 arguments in order: n, size, replacement, probabilities If probabilities is not NULL, the first argument has to be an integer. To achieve an equivalent output as that of sample, we need to map the sampled integers back to desired values internal_boolean <- function(size,replace,prob) { s = .Internal(sample(2,size,replace, prob)) return(s<2) } internal_boolean_rle <- function(size,replace,prob) { s = .Internal(sample(2,size,replace, prob)) return(rle(s<2)) } N = 100000 probs = c(0.0001, 1-0.

Continue reading

Obtaining data Fitting with pre-determined distribution The effects of sample size Goodness-of-fit Assuming Poisson distribution Assuming NB distribution The package MASS provides a function, fitdistr to fit an observation over discrete distribution using Maximum likelihood. Obtaining data We first need to generate some data to fit. The rnegbin(n,mu,theta) function can be used to generate n samples of negative binomial with mean mu and variance mu + mu^2 / theta.

Continue reading

Author's picture

Trang Tran


Student

USA