1. A Nonparametric Density implementation in Spark

One of my previous blog post concerns about nonparametric density estimation. In this post i presented some Matlab code. An advantage of this Spark implementation is that the estimation is totally parallel since we only use build-in Spark procedures. Let
be a random sample drawn from some distribution with an unknown density
. The key is to use data.cartesian(random_grid) which creates pairs
where
is a predefined grid. Then using map together with an Epanechnikov kernel
we get
. The final
is then evaluated using reduceByKey.
###A Spark-Function to derive a non-parametric kernel density
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.feature import StandardScaler
import matplotlib.pyplot as plt
from numpy import *
##1.0 Simulated Data
N=15000
mu, sigma = 2, 3 # mean and standard deviation
rdd = sc.parallelize( random.normal(mu,sigma,N) )
##2.0 The Function
#2.1 Kernel Function
def spark_density(data, Nout, bw):
def epan_kernel(x,y,b):
u=true_divide( (x-y), b)
return max(0, true_divide( 1, b)*true_divide(3,4)*(1-u**2))
#derive the minia and maxi used for interpolation
mini=data.takeOrdered(1, lambda x: x )
maxi=data.takeOrdered(1, lambda x: -1*x )
#create an interpolation grid (in fact NOT random this time)
random_grid = sc.parallelize( linspace(mini, maxi, num=Nout) )
Nin=data.count()
#compute K(x-xi) Matrix
kernl=data.cartesian(random_grid).map(lambda x:( float(x[1]),true_divide(epan_kernel(array(x[0]),array(x[1]),bw),Nin) ) )
#sum up
return kernl.reduceByKey( lambda y, x: y+x )
##3.0 Results
density= spark_density(rdd, 128, 0.8).collect()
dens=array(density).transpose()
anzahl=array(anz).transpose()
#Plot the estimate
plt.plot(dens[0], dens[1], 'bo')
axis2=linspace(-10, 10, num=128)
#plot the true density
plt.plot(axis2, 1/(sigma * sqrt(2 * pi)) *exp( - (axis2 - mu)**2 / (2 * sigma**2) ),linewidth=2, color='r')
plt.show()
Stupid question: can i just use R `density` for it?
Sure, that is possible if you use R. In fact the R density function comes with some nice additional features like an automatich bandwith choice. However R has problems concerning big datasets, here Spark comes into play the algorithm shown above will is scalable and will able to handle a very large amount of data (if your Spark Cluster is powerful enough).
Pingback: Kernel Regression using Pyspark – The Big Data Blog
Pingback: Kernel Regression using the Fast Fourier Transform – The Big Data Blog