Non-Linear Classification Methods in Spark

In a previous post I covered how to apply classical linear estimators like support vector machines or logistic regression to a non-linear dataset using the kernel method. This article can be considered as a second part. While the previous article concerns about the theoretical foundation this article is about implementation issues and examples.

1. Model fitting

Again consider multivariate data given by a pair (y,x) where the explanatory variable x \in \mathbb{R}^p and the group coding y \in \{-1,1\}. In the following we assume an iid. sample of size N given by (y_1,x_1),\dots,(y_N,x_N). Starting where the previous post ends our aim is to minimize

(1)   \begin{equation*} min_\mathbf{\alpha} L(\mathbf{y}, \mathbf{K}  \mathbf{\alpha}) + \lambda  \mathbf{\alpha}^T \mathbf{K}  \mathbf{\alpha}. \end{equation*}


here \mathbf{y}=(y_1,\dots,y_N),\; \mathbf{\alpha}=(\alpha_1,\dots,\alpha_N) and \lambda is a certain penalty term. \mathbf{K}_{ij} = K(x_i,x_j),\;i,j=1,\dots,N is a N \times N Matrix derived from a kernel function, a kernel can be understood as a function that measures a certain relation between x_i and x_j (e.g. spacial location). An important role is played by the choice of L, it determines which method (eg. SVM, logit) is used. In general the problem lacks of an closed form solution, therefore we approximate the solution using the Newton-Raphson algorithm.

2. Newton-Raphson algorithm

For a given loss function L( \mathbf{y},    \mathbf{K}  \mathbf{\alpha}  ) let the gradient error function be given by \nabla  E(\alpha) = \frac{\partial L( \mathbf{y},    \mathbf{K}  \mathbf{\alpha} )+  \mathbf{\alpha}^T \mathbf{K}  \mathbf{\alpha}    }{\partial  \alpha} and the Hessian H (\alpha) =  \frac{\partial   L( \mathbf{y},    \mathbf{K}  \mathbf{\alpha}  ) +\mathbf{\alpha}^T \mathbf{K}  \mathbf{\alpha}   }{\partial  \alpha^2}. The Newton-Raphson algorithm is then to start with some initial estimate \alpha^{0} and update \alpha^{(m+1)}= \alpha^{(m)} - H^{-1}   \nabla  E(  \alpha ^{(m)}   ) until convergence.

2.1 Examples

2.1.1 Ridge Regression

To construct a ridge regression using (1) we have to choose the squared loss function L(y, x)= (y-x)^2 and thus

(2)   \begin{equation*} min_\mathbf{\alpha}(  \mathbf{y}  - \mathbf{K}  \mathbf{\alpha}  )^T  (  \mathbf{y}  - \mathbf{K}  \mathbf{\alpha}  ) + \lambda  \mathbf{\alpha}^T \mathbf{K}  \mathbf{\alpha} . \end{equation*}


using Newton-Raphson notice we get \nabla E(\alpha)= -2  \mathbf{K}^T  (    \mathbf{y}   - \mathbf{K}  \alpha  )  +  2 \lambda     \mathbf{K}  \alpha and H=2 \mathbf{K} ^T \mathbf{K} +2\lambda  \mathbf{K}. Starting with \mathbf{\alpha}^{(0)}=(0,\dots,0)

(3)   \begin{equation*}   \mathbf{\alpha}^{(1)} = \mathbf{\alpha}^{(0)}  - (    2 \mathbf{K} ^T \mathbf{K} +2\lambda  \mathbf{K}   )^{-1} (  -2  \mathbf{K}^T  (    \mathbf{y}   - \mathbf{K}  \alpha ^{(0)}   )  +  2 \lambda     \mathbf{K}  \alpha ^{(0)}    ) =  ( \mathbf{K}  + \lambda \mathbf{I})^{-1} \mathbf{y} . \end{equation*}


Indeed, for the Ridge Regression case we can derive also an optimal solution analytically by simple taking the derivatives \nabla E(\alpha) and equating them to zero which gives the same solution as the Newton-Raphson algorithm in one step. Fitted values are thus given by \hat{\mathbf{y}}=\mathbf{K}  \mathbf{\alpha}^{(1)} = ( \mathbf{I}  + \lambda \mathbf{K}^{-1})^{-1} \mathbf{y}.

2.1.2 Logistic Regression

To construct a logistic regression L() has to be chosen as L(y,x) =log(1 + exp(- yx  ) ). This lacks of an closed form solution, therefore we approximate the solution using the Newton-Raphson algorithm. Recall that we aim to minimize

    \[  min_\mathbf{\alpha} \sum_{i=1}^N log(1 + exp(- y_i   \mathbf{K}  \mathbf{\alpha}  ) )  + \lambda  \mathbf{\alpha}^T \mathbf{K}  \mathbf{\alpha} . \]


Therefore \nabla E(\alpha)_{j}=- \sum_{i=1}^N \frac{exp( - y_i \mathbf{K}  \mathbf{\alpha}  ) y_i \mathbf{K}_{ij}}{1 +exp( -  y_i     \mathbf{K}  \mathbf{\alpha} ) } +  2 \lambda  \sum_{j=1}^N   \mathbf{K}_{ij}     \mathbf{\alpha}_j and the Hessian is given by

    \[H_{ij}= \frac{\partial  \nabla E(\alpha)_i  }{ \partial \alpha_j}  +2\lambda \mathbf{K}_{ij}\]

.

2.1 Computation

A computational problem using Newton-Raphson is to derive the Hessian since N  \times N cells has to be computed. A commonly used strategy is thus to approximate the Hessian for example using H(\alpha^{(m)}) \approx \frac{  \nabla  E(  \alpha ^{(m)})  -   \nabla  E(  \alpha ^{(m-1)} ) }{ \alpha ^{(m)} - \alpha ^{(m-1) }}. Such Methods are called quasi-newton Methods. The Spark implementation to derive an support vector machine fitting for example make use of the L-BFGS method. Logit and Ridge Regression fitting rely on Stochastic gradient descent which does not use the Hessian at all and even approximates the gradient error function.

3. Prediction

Suppose a given fitted model \alpha^* fitted using a training set of pairs (y_1,x_1),\dots,(y_N,x_N). Consider another set of data x^{'}_1, \dots, x^{'}_{N^'} which is often called test sample. In order to to reach a decision to which group these variables belong we use \hat{y}^{'}_j=\sum_i K(x_i, x^{'}_j) \alpha^{*}_i. This also gives a intuitive explanation how the prediction using the kernel method works. If observed test data x^{'}_1 is close, in the kernel sense, to the training data x_1 than these two observations likely belong to the same group.

4. Implementation

Simulated 200 random samples using the described model.

The procedure is implemented in pySpark using the spark functions SVMWithSGD, LogisticRegressionWithLBFGS, LinearRegressionWithSGD. For curiosity also a Lasso implementation using LassoWithSGD was done, actually the model does not fit to the ones described above, because the error term looks different (L1 error norm). However using Lasso in the kernel context has an interesting effect, since lasso will not discard certain features but certain observations. To test the implementation train and test data is simulated using the following model: y \sim B(1, 0.5) and with \phi  \sim \mathcal{N}(0,2\pi), x=  (0.5 (1 + y)  cos( \phi ) + \epsilon, 0.5 (1 + y )sin( \phi ) + \epsilon') where \epsilon, \epsilon' \sim \mathcal{N}(0,0.1). The whole simulation is a bit long therefore it is available as a zeppelin notebook via github. For those who have no access to a zeppelin installation there exsits also a docker-compose setup including the notebook. As an example the code for the logit method is given above, for other linear methods the procedure is comparable.

import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.pyplot as plt

def radial_kernel(x,y,sigma):
    return np.exp(-sum((x-y)**2)/(2*sigma**2))


def construct_K(Y,X,X_1,lamb):
    sp_X=sc.parallelize(np.transpose(X)).zipWithIndex()
    sp_X_1=sc.parallelize(np.transpose(X_1)).zipWithIndex()
    sp_Y=sc.parallelize(Y).zipWithIndex().map(lambda(x,y) : (y,x) )
    grid=sp_X.cartesian(sp_X_1)
    K=grid.map(lambda(x,y) : (x[1],radial_kernel(x[0],y[0],lamb)) )
    return [sp_Y, K]
    
def construct_labeled(Y,K):
    def add_element(acc,x):
        if type(acc[1]) == list:
            return (min(acc[0],x[0]), acc[1] + [x[1]]  )
        else:
            return (min(acc[0],x[0]), [acc[1]] + [x[1]]  )
    jnd=Y.join(K).reduceByKey(lambda acc, x : add_element(acc,x) )
    labeled=jnd.map(lambda(y,x) : LabeledPoint(x[0], x[1])  )
    order=jnd.map(lambda (y,x): y)
    return [labeled, order]

##Simualte the training sample
N=500
Y= np.random.randint(0,2,N)
degree=np.random.normal(0,1,N)*2*np.pi
X= [0+ (0.5 + Y*0.5)* np.cos(degree)+ np.random.normal(0,2,N)*0.05, 0 + (0.5 + Y*0.5)*np.sin(degree)+ np.random.normal(0,2,N)*0.05   ]

plt.scatter(X[0], X[1], c=Y)
plt.show()

#Example Logistic Regression

from pyspark.mllib.regression import LabeledPoint

Y_K=construct_K(Y,X,X,0.1)
l_train=construct_labeled(Y_K[0], Y_K[1])[0]


# Evaluating the model on training data
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel

LogitModel = LogisticRegressionWithLBFGS.train(l_train)
labelsAndPreds = l_train.map(lambda p: (p.label, LogitModel.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(l_train.count())
print("Training Error = " + str(trainErr))

#Construct Test Sample and apply the model
##Generate data
#Simulation
N=200

Y_test=np.random.randint(0,2,N)
degree=np.random.normal(0,1,N)*2*np.pi
X_test=[0+ (0.5 + Y_test*0.5)* np.cos(degree)+ np.random.normal(0,2,N)*0.05, 0 + (0.5 + Y_test*0.5)*np.sin(degree)+ np.random.normal(0,2,N)*0.05]

#plot data
plt.scatter(X_test[0], X_test[1], c=Y_test)
plt.show()

Y_K_test=construct_K(Y_test,X_test,X,0.1)
l_test=construct_labeled(Y_K_test[0], Y_K_test[1])

##Logit
labelsAndPreds = l_test[0].map(lambda p: (p.label, LogitModel.predict(p.features)))
testErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(l_train.count())
print("Prediction Error (Logit)= " + str(testErr))

#plot predictions
preds=labelsAndPreds.map(lambda lp: lp[1]).collect()
sort_order=l_test[1].collect()
pred_sorted = [x for _,x in sorted(zip(sort_order,preds))]

plt.scatter(X_test[0], X_test[1], c=pred_sorted)
plt.show()

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.