Parallel correlation calculation with Spark

November 15, 2017

This task is an embarrassingly parallel task, as explored in a previous post. import numpy as np import pandas as pd import time from scipy.stats import pearsonr from pyspark import SparkContext, SparkConf from scipy.sparse import coo_matrix ## The measurement (input data) is specified in a matrix ## samples x variables m = 150 n = 1000 measurements = np.random.rand(m*n).reshape((m,n)) nThreads = [1,2,4,6,8,10,12,14,16] dt = np.zeros(len(nThreads)) for i in range(len(nThreads)): ## Parameters NMACHINES = nThreads[i] NPARTITIONS = NMACHINES*4 conf = (SparkConf() .

Trang Tran

Student

USA

no post found

Transforming normal to uniform distribution

Mar 3, 2021

Problem Given a random variable $X \sim \mathcal{N}(\mu,\sigma^2)$, find a transformation $f: X \rightarrow Y$, such that $Y \sim Uniform(a,b)$. Solution Let $\Phi_X(\cdot)$ the cumulative distribution function of $X$. \[ \begin{eqnarray} Z \equiv \frac{X - \mu}{\sigma};\quad Z &\sim& \mathcal{N}(0;1) \\ \Phi_{Z}\left(\frac{X-\mu}{\sigma}\right) &\sim& Uniform(0;1) \\ (b-a) \Phi_{Z}\left(\frac{X-\mu}{\sigma}\right) &\sim& Uniform(0,b-a) \\ a + (b-a) \Phi_{Z}\left(\frac{X-\mu}{\sigma}\right) &\sim& Uniform(a,b) \end{eqnarray} \] In conclusion, $Y \equiv a + (b-a) \Phi_X\left(\frac{X-\mu}{\sigma}\right)$. Computational demonstration norm2unif = function(x, mu = 0, sigma = 1, min = 0, max = 1, use.

A tabulated list of Markdown editors

Dec 12, 2020

Sources https://www.linuxjournal.com/content/mark-text-vs-typora-best-markdown-editor-linux

Speeding up random sampling of an array in R

Nov 11, 2020

Problem statement Given a set $S = {s_1, s_2, \dots, s_n}$, one would like to sample a subset of $X \subset S$ of size $m$. If this operation needs to be repeated for a very large number of times $k$, what is the most efficient way? set_S = c(1:100) microbenchmark::microbenchmark(sample(set_S, size = 50), times = 10) ## Unit: microseconds ## expr min lq mean median uq max neval ## sample(set_S, size = 50) 5.

PCA, SVD and Eigen decomposition

Oct 10, 2020

Principal component analysis (PCA) is a popular method for dimensional reduction, and has been invented independently many times in different fields, resulting in various definitions. This note attempts to unify 2 of the most popular definitions of principal components and illustrate how PCA can be done correspondingly. Alternative definitions PCA from the SVD of the centered matrix The singular value decomposition (SVD) of an $n\times d$ matrix $X$ has the form

Sweep vs Matrix multiplication

Apr 4, 2020

Sweeping along an axis can be represented by matrix multiplication. Given the matrix $A$ and diagonal matrix $D$, $DA$ is equivalent to multiplying each row $i$ of $A$ by $d_{ii}$, and $AD$ is equivalent to multiplying each column $j$ of $A$ by $d_{jj}$ A = matrix(runif(50000),ncol=100) w = apply(A, 1, norm, '2') all(abs(sweep(A,1, w, '/') - (diag(1/w) %*% A) ) < .Machine$double.eps) ## [1] TRUE It is reasonably expected that the sweeping operation on invidual row/column vector will be more efficient than the equivalent matrix operation, because no additional memory will be required to store the non-diagonal entries of $D$.

SVD in different languages

Feb 2, 2020

Defining utility functions burd = colorRampPalette(colors = c("blue", "white", "red"))(n = 499) blues = colorRampPalette(colors = c('#deebf7', '#08306b'))(n = 256) plot.matrix = function(m, col = burd, asp=1) { m %>% apply(MARGIN = 2, rev) %>% t() %>% image(useRaster = TRUE, axes = FALSE, col = col, asp = asp) } parse_timing_output = function(output_raw) { sapply(output_raw, function(x) { str = stringr::str_split(x,":\\s+")[[1]] return(as.numeric(str[2])) }) } An arbitrary matrix sin2d = function(a, b) { sin((a/ 500 - b / 15) * pi) } start = proc.

Multi-core parallel computing in Julia

Feb 2, 2020

As of this writing, Julia supports three types of concurrency: Coroutines Multi-Threading Multi-Core or Distributed Processing This post will explore multicore parallelization in Julia Using multiple cores in julia If more than one cores are to be used in julia, it must be specified, either when starting julia, using -p <n_cpus, for example julia -p 8 # to use 8 cores or by adding processors in an interactive session

Affinity propagation - step by step

Oct 10, 2019

This article will walk you through a step-by-step implementation of affinity propagation, a clustering algorithm by message passing by Frey and Dueck [@Frey:2007:Clustering]. Step-by-step Input data Given a similarity matrix S = rbind(c(1.0, 0.8, 0.7, 0.2, 0.5), c(0.8, 1.0, 0.75, 0.3, 0.3), c(0.7, 0.75, 1., -0.1, 0.4), c(0.2, 0.3, -0.1, 1.0, 0.8), c(0.5, 0.3, 0.4, 0.8, 1.0)) %>% set_colnames(c('A', 'B', 'C', 'D', 'E')) %>% set_rownames(c('A', 'B', 'C', 'D', 'E')) image(S, col = cm.

Configure Nginx as a reverse proxy for Rstudio server

Apr 4, 2019

There has been a guide on how to set up Nginx as a reverse proxy for Rstudio server here. This guide attempts to go further, by making sure that Rstudio server is accessible via https. This guide was tested on Ubuntu 16.04 LTS and Ubuntu 20.04, so make sure you adapt the commands accordingly to your system. Assuming that your machine already has Nginx and Rstudio server up and running. After any change in the configuration, you may restart the servers using these commands.

Fast SVD

Apr 4, 2019

Singular value decomposition is an expensive operation. For rectangular matrices with significant different dimensions, i.e. very “fat” or “thin” matrices, there is a trick to make the computation cheaper. This trick is implemented in fast.svd() of the R package corpcor. Calculate SVD The singular value decomposition of a matrix $M$ of size $m \times n$. \[ M = UDV^T \] \[ \begin{align} MM^T &= (UDV^T)(UDV^T)^T \\ &= (UDV^T)V(UD)^T \\ &= UD (V^TV) (UD)^T \\ &= UD(UD)^T \quad (V\text{ is orthogonal}) \\ &= UDD^TU^T \\ \end{align} \] Thus the decomposition of $MM^T$ gives $U$ and $D^2$.