Motivation Parallelization Simple parallelization: one variable per worker Massive parallel: chunk of pairs per worker Motivation Let’s look at the time it takes to calculate all pairwise correlation for \(n\) variable, with \(m\)=200 samples.
n dt 1e+02 1.433043e+00 1e+03 1.359290e+02 2e+03 5.371534e+02 1e+05 1.230446e+06 Given the timing above, and the extrapolated timing for \(10^{5}\) genes, which is roughly the order of number of genes/transcripts in a transcriptomic profile, it would take 14.
Discrete random variables Analytical estimation of probabilities Empirical solution Intervention Continuous random variables Conditioning by observation Conditioning by intervention Joint distribution of \((Y,Z)\) Simulation of joint distribution \((Z,Y)\) References The example is taken from Chapter 17 ???. Let \(V = (X,Y,Z)\) represented by the following graph
Discrete random variables Let \(V = (X,Y,Z)\) have the following joint distribution
\[ \renewcommand{\vector}[1]{\mathbf{#1}} \newcommand{\matrix}[1]{\mathbf{#1}} \newcommand{\E}[1]{\mathbb{E}{\left(#1\right)}} \begin{align} X &\sim Bernoulli(1/2) \\ Y|X=x &\sim Bernoulli\left(\frac{e^{4x-2}}{1 + e^{4x-2}}\right) \\ Z|X=x, Y=y &\sim Bernoulli\left(\frac{e^{2(x+y)-2}}{1 + e^{2(x+y)-2}}\right) \end{align} \]
Many high-level programming languages allow their users to afford the luxury of extending an existing matrix or vector. The question is, how luxury it can be?
The two functions below both return an \(m \times n\) matrix, by calling a random generator \(n\) times. The first function does that by initializing the whole matrix with zeros, and filling the values until finish. The second extends the results one column at a time.
The data.table package supports a powerful syntax to select rows and columns.
Selecting a single column library(data.table) data("iris") iris = iris[sample.int(nrow(iris),size=10,replace = FALSE),] DT = data.table(iris) DT ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 4.9 3.0 1.4 0.2 setosa ## 2: 4.6 3.6 1.0 0.2 setosa ## 3: 7.2 3.0 5.8 1.6 virginica ## 4: 5.4 3.4 1.5 0.4 setosa ## 5: 6.7 3.1 5.6 2.4 virginica ## 6: 5.
This note explores the use of data.table package to calculate pairwise correlation between columns, with iris data set as example.
library(data.table) DT = data.table(iris) The iris data is now data.table-ized
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1: 5.1 3.5 1.4 0.2 setosa ## 2: 4.9 3.0 1.4 0.2 setosa ## 3: 4.7 3.2 1.3 0.2 setosa ## 4: 4.6 3.1 1.5 0.2 setosa ## 5: 5.0 3.6 1.4 0.2 setosa ## --- ## 146: 6.
.Internal(sample()) requires explicitly 4 arguments in order: n, size, replacement, probabilities
If probabilities is not NULL, the first argument has to be an integer. To achieve an equivalent output as that of sample, we need to map the sampled integers back to desired values
internal_boolean <- function(size,replace,prob) { s = .Internal(sample(2,size,replace, prob)) return(s<2) } internal_boolean_rle <- function(size,replace,prob) { s = .Internal(sample(2,size,replace, prob)) return(rle(s<2)) } N = 100000 probs = c(0.0001, 1-0.
Obtaining data Fitting with pre-determined distribution The effects of sample size Goodness-of-fit Assuming Poisson distribution Assuming NB distribution The package MASS provides a function, fitdistr to fit an observation over discrete distribution using Maximum likelihood.
Obtaining data We first need to generate some data to fit. The rnegbin(n,mu,theta) function can be used to generate n samples of negative binomial with mean mu and variance mu + mu^2 / theta.