This task is an embarrassingly parallel task, as explored in a previous post. import numpy as np import pandas as pd import time from scipy.stats import pearsonr from pyspark import SparkContext, SparkConf from scipy.sparse import coo_matrix ## The measurement (input data) is specified in a matrix ## samples x variables m = 150 n = 1000 measurements = np.random.rand(m*n).reshape((m,n)) nThreads = [1,2,4,6,8,10,12,14,16] dt = np.zeros(len(nThreads)) for i in range(len(nThreads)): ## Parameters NMACHINES = nThreads[i] NPARTITIONS = NMACHINES*4 conf = (SparkConf() .

Continue reading

Author's picture

Trang Tran


Student

USA