This Post is about how to speed up the computation kernel density estimators using the FFT (Fast Fourier Transform). Let be be a random sample drawn from an unknown distribution with density . Remember, the kernel density estimator with bandwidth is then given by

(1)

In the following we will use the notation from a previous article where the convolution theorem for the Fourier Transform was introduced. Accordingly the discrete version of the convolution theorem is given by: Let and be two functions defined at evenly spaced points, their convolution is given by:

(2)

To speed up the kernel computation we will use a particular feature given by:

(3)

Since the discrete convolution theorem requires functions observed at evenly spaced points we create a fine grid of length where we want to evaluate the density.

(4)

where is the dirac delta function. To derive the estimate for all points the computer has to handle operations.Following [1], let the discrete Fourier transform of be denoted by . The Fourier transform of (4) for is then using convolution and translation given by

(5)

where is the Fourier transform of the data. The result corresponding to (4) can then be obtained by applying the inverse DFFT. All these steps then involve only operations. is derived using a histogram on the observed data with binning boundaries defined by , and then applying the Fast Fourier transform.

To implement the method the fft function that comes with the numpy package was chosen. The numpy fft() function will return the approximation of the DFT from 0 to . Therefore fftshift() is needed to swap the output vector of the fft() right down the middle. So the output of fftshift(fft()) is then from to . The computation using the fft was around 5 times faster.

import numpy as np import matplotlib.pyplot as plt import timeit # Simulation N=500 X=np.random.normal(-0,1.5,N) plt.scatter(np.linspace(0,N,N), X, c="blue") plt.xlabel('sample') plt.ylabel('X') plt.show() # Presetting Nout=2**7 def epan_kernel(u,b): u= u/b return max(0, 1./b*3./4*(1-u**2)) # Usual estimation def dens(s, X, h=0.5): return [epan_kernel(s-x,1) for x in X] start = timeit.default_timer() grid = np.linspace(-5, 5, num=Nout) density=[(1./X.size)*sum( dens(y, X) ) for y in grid] plt.plot(grid, density ) stop = timeit.default_timer() print('Time: ', stop - start) # Estimation using FFT start = timeit.default_timer() to = np.linspace(-5, 5, Nout+1) t = grid kernel=[epan_kernel(x,1) for x in t] kernel_ft=np.fft.fft(kernel) hist,bins=np.histogram(X, bins=to) density_tmp=kernel_ft.flatten()*np.fft.fft( hist ) denisty_fft=(1./X.size)*np.fft.fftshift( np.fft.ifft(density_tmp).real) plt.plot(t, denisty_fft) stop = timeit.default_timer() print('Time: ', stop - start) print(np.sqrt(sum((density-denisty_fft)**2)))

[1] B. W. Silverman, “Algorithm as 176: kernel density estimation using the fast fourier transform,” Journal of the royal statistical society. series c (applied statistics), vol. 31, iss. 1, p. 93–99, 1982.

[Bibtex]

[Bibtex]

```
@article{Silverman1982,
ISSN = {00359254, 14679876},
URL = {http://www.jstor.org/stable/2347084},
author = {B. W. Silverman},
journal = {Journal of the Royal Statistical Society. Series C (Applied Statistics)},
number = {1},
pages = {93--99},
publisher = {[Wiley, Royal Statistical Society]},
title = {Algorithm AS 176: Kernel Density Estimation Using the Fast Fourier Transform},
volume = {31},
year = {1982}
}
```

Again consider multivariate data given by a pair where the explanatory variable and the group coding . In the following we assume an iid. sample of size given by . Starting where the previous post ends our aim is to minimize

(1)

here and is a certain penalty term. is a Matrix derived from a kernel function, a kernel can be understood as a function that measures a certain relation between and (e.g. spacial location). An important role is played by the choice of , it determines which method (eg. SVM, logit) is used. In general the problem lacks of an closed form solution, therefore we approximate the solution using the Newton-Raphson algorithm.

For a given loss function let the gradient error function be given by and the Hessian . The Newton-Raphson algorithm is then to start with some initial estimate and update until convergence.

To construct a ridge regression using we have to choose the squared loss function and thus

(2)

using Newton-Raphson notice we get and . Starting with

(3)

Indeed, for the Ridge Regression case we can derive also an optimal solution analytically by simple taking the derivatives and equating them to zero which gives the same solution as the Newton-Raphson algorithm in one step. Fitted values are thus given by .

To construct a logistic regression has to be chosen as . This lacks of an closed form solution, therefore we approximate the solution using the Newton-Raphson algorithm. Recall that we aim to minimize

Therefore and the Hessian is given by

.

A computational problem using Newton-Raphson is to derive the Hessian since cells has to be computed. A commonly used strategy is thus to approximate the Hessian for example using . Such Methods are called quasi-newton Methods. The Spark implementation to derive an support vector machine fitting for example make use of the L-BFGS method. Logit and Ridge Regression fitting rely on Stochastic gradient descent which does not use the Hessian at all and even approximates the gradient error function.

Suppose a given fitted model fitted using a training set of pairs . Consider another set of data which is often called test sample. In order to to reach a decision to which group these variables belong we use . This also gives a intuitive explanation how the prediction using the kernel method works. If observed test data is close, in the kernel sense, to the training data than these two observations likely belong to the same group.

The procedure is implemented in pySpark using the spark functions *SVMWithSGD, LogisticRegressionWithLBFGS, LinearRegressionWithSGD*. For curiosity also a Lasso implementation using *LassoWithSGD* was done, actually the model does not fit to the ones described above, because the error term looks different (L1 error norm). However using Lasso in the kernel context has an interesting effect, since lasso will not discard certain features but certain observations. To test the implementation train and test data is simulated using the following model: and with , where . The whole simulation is a bit long therefore it is available as a zeppelin notebook via github. For those who have no access to a zeppelin installation there exsits also a docker-compose setup including the notebook. As an example the code for the logit method is given above, for other linear methods the procedure is comparable.

import numpy as np import matplotlib.pyplot as plt import numpy as np import matplotlib.pyplot as plt def radial_kernel(x,y,sigma): return np.exp(-sum((x-y)**2)/(2*sigma**2)) def construct_K(Y,X,X_1,lamb): sp_X=sc.parallelize(np.transpose(X)).zipWithIndex() sp_X_1=sc.parallelize(np.transpose(X_1)).zipWithIndex() sp_Y=sc.parallelize(Y).zipWithIndex().map(lambda(x,y) : (y,x) ) grid=sp_X.cartesian(sp_X_1) K=grid.map(lambda(x,y) : (x[1],radial_kernel(x[0],y[0],lamb)) ) return [sp_Y, K] def construct_labeled(Y,K): def add_element(acc,x): if type(acc[1]) == list: return (min(acc[0],x[0]), acc[1] + [x[1]] ) else: return (min(acc[0],x[0]), [acc[1]] + [x[1]] ) jnd=Y.join(K).reduceByKey(lambda acc, x : add_element(acc,x) ) labeled=jnd.map(lambda(y,x) : LabeledPoint(x[0], x[1]) ) order=jnd.map(lambda (y,x): y) return [labeled, order] ##Simualte the training sample N=500 Y= np.random.randint(0,2,N) degree=np.random.normal(0,1,N)*2*np.pi X= [0+ (0.5 + Y*0.5)* np.cos(degree)+ np.random.normal(0,2,N)*0.05, 0 + (0.5 + Y*0.5)*np.sin(degree)+ np.random.normal(0,2,N)*0.05 ] plt.scatter(X[0], X[1], c=Y) plt.show() #Example Logistic Regression from pyspark.mllib.regression import LabeledPoint Y_K=construct_K(Y,X,X,0.1) l_train=construct_labeled(Y_K[0], Y_K[1])[0] # Evaluating the model on training data from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel LogitModel = LogisticRegressionWithLBFGS.train(l_train) labelsAndPreds = l_train.map(lambda p: (p.label, LogitModel.predict(p.features))) trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(l_train.count()) print("Training Error = " + str(trainErr)) #Construct Test Sample and apply the model ##Generate data #Simulation N=200 Y_test=np.random.randint(0,2,N) degree=np.random.normal(0,1,N)*2*np.pi X_test=[0+ (0.5 + Y_test*0.5)* np.cos(degree)+ np.random.normal(0,2,N)*0.05, 0 + (0.5 + Y_test*0.5)*np.sin(degree)+ np.random.normal(0,2,N)*0.05] #plot data plt.scatter(X_test[0], X_test[1], c=Y_test) plt.show() Y_K_test=construct_K(Y_test,X_test,X,0.1) l_test=construct_labeled(Y_K_test[0], Y_K_test[1]) ##Logit labelsAndPreds = l_test[0].map(lambda p: (p.label, LogitModel.predict(p.features))) testErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(l_train.count()) print("Prediction Error (Logit)= " + str(testErr)) #plot predictions preds=labelsAndPreds.map(lambda lp: lp[1]).collect() sort_order=l_test[1].collect() pred_sorted = [x for _,x in sorted(zip(sort_order,preds))] plt.scatter(X_test[0], X_test[1], c=pred_sorted) plt.show()]]>

In a previous article I explained how to get spark running at an OrangePi to create a toy computing-cluster. If you look at this article you may agree that this was a really painful setup process. For quite a while now I worked with Kubernetes which makes the deployment process a lot easier using container technology. A big advantage is that Kubernetes takes automatically distributes nodes, for example spark workers, across different machines. Sloppy speaking a container is really close to the concept of an VM, but where each VM comes with an entire OS, Container share these components with the host system. Unfortunately, prior to January 2017 OrangePi Images where based on the Linux-Kernel < 3.4 which lacked features required to run containers on the H3 chipset. Since newer Images are based on kernel >=4.14 it is now possible to install Docker or Kubernetes for example using k3s, which uses containerd (the core of Docker) as container technology.

To Setup the Pi you have to follow these steps:

- Download the Armbian Bionic Image
- Install the Image on an SD-Card (for example with Etcher or see Section 3.1 here)
- Put the SD-Card into the Pi and the connect the device

To connect to the device from the host, for people using Windows 10 I recommend installing the Linux Bash or Git Bash to follow the next steps.

- Login as root :
`ssh root@ip_of_your_master_device`

- The Password is
`1234`

you are forced to change the password. I changed mine to “orangepi”.

The next step is to change the hostname, this is necessary because k3s will later identify the nodes due to this hostname. I labeled the master as “masterpi” while the nodes where denoted as “nodepi1”, “nodepi2 ” and so on. To change the hostname type ** echo "masterpi" > /etc/hostname**. In addition you have to replace the “orangepi” entries with “masterpi” in

`sudo nano /etc/host`

`sudo reboot`

Next you have to install the latest k3s Kubernetes cluster on the masterpi:

`curl -sfL https://get.k3s.io | sh -`

This may take a while. k3s is now starting and begins to download required images from the Kubernetes registry. To check if the systemd service is running type`sudo systemctl status k3s`

To add more nodes to your cluster you need to know the token of your master node. To get this token, execute at the master

`sudo cat /var/lib/rancher/k3s/server/node-token`

A token might look like *K10ca1b47907be6d5cf91e6e7a29d1d52c9b36c087ca35da3cee2757a9a3507ed5a::node:24af5a7878e575c3f36566a0011395f3*.

Then prepare a second (or third or whatever) device as described above. At each node you have to define the following environment variables based on your actual setup

`export K3S_URL="https://ip_of_your_master_device:6443"`

`export K3S_TOKEN="K10ca1b47907be6d5cf91e6e7a29d1d52c9b36c087ca35da3cee2757a9a3507ed5a::node:24af5a7878e575c3f36566a0011395f3"`

`curl -sfL https://get.k3s.io | sh -`

You can now join your node to the cluster this way:

`sudo k3s agent --server ${K3S_URL} --token ${K3S_TOKEN}`

Well done, you have successfully deployed a mini Kubernetes cluster using OrangePis hardware. You can now check via ** kubectl get nodes** if all nodes are connected succesfully to the cluster.

For an easier deployment process we will now install kubectl at our host pc and connect it to our cluster. If you are a Windows user you can download the binary, as a Linux (Ubuntu, Debian) user you can install kubectl via

sudo apt-get update && sudo apt-get install -y apt-transport-https curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add - echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list sudo apt-get update sudo apt-get install -y kubectl

To connect to your cluster you first have to define the cluster

`kubectl config set-cluster mycluster --server=https://ip-of-your-master-device:6443 --insecure-skip-tls-verify`

`kubectl config set-context mycontext --cluster=mycluster --user=root`

Next you have to log in into the master and type ** kubectl config view** to see the admin password. At the host we will now add a user using this password

`kubectl config set-credentials root --username=admin --password=fcc25ce80120dad6f1384a8bea223331`

. `kubectl config use-context mycontext`

`kubectl apply -f https://raw.githubusercontent.com/heikowagner/thebigdatablog/master/deployment.yaml`

` `

and check via `kubectl get po`

`ip_of_your_master_device`

The technique we use is commonly known as Montecarlo Tree Search. Each round of Monte Carlo tree search consists of four steps:

*Selection*: start from root*R*and select successive child nodes until a leaf node*L*is reached. The root is the current game state and a leaf is any node from which no simulation (playout) has yet been initiated.*Expansion*: unless*L*ends the game decisively (e.g. win/loss/draw) for either player, create one (or more) child nodes and choose node*C*from one of them. Child nodes are any valid moves from the game position defined by*L*. As a decision to choose C we use UCB1 which was described in the “The Multiarmed Bandit Problem” article.*Simulation*: complete one random playout from node*C*.*Backpropagation*: use the result of the playout to update information in the nodes on the path from*C*to*R*.

The whole code to implement an tic-tac-toe game (less than 200 lines) is presented above. Feel free to modify the code (in the “Vue” Tab), for example modify the utility function. Changing *u={win:1,lose:-1,draw:0}* to *u={win:10,lose:-1,draw:0}* which will make the AI greedy. As an result the AI will no longer play any good and perform much riskier moves. Indeed, considering these kind of AI the only way to implement what by politicians is often called an “ethic AI” is due to the the utility function.

In the twenties people start to describe games using math. Since then Game Theory becomes an important technique in economy and and nowadays, with artificial intelligence becoming more important, even computer silence. We want to follow the usual notation but only give a very short repetition of the most important parts. For a deeper intro into the topic i recommend A primer in Game Theory by Robert Gibbons. To describe a game at first we need a description of the players playing the game, in particular let be the set of players. The second important information is which rules apply to the game. This is described by which is a set of all possible sequences (finite or infinite), is the initial history and the terminal history. To describe the histories in between we introduce the concept of actions. At each stage of the game the players, say , have to choose an action from the set which is the set of actions or strategies available to player . Accordingly, with . Then a history at stage is described by with being the action profile stage 0. Accordingly for , , if for all positive . We also need to determine which players turn is at a certain stage. Let the player function that determines who is next to the respective sequence, . Finally we need to determine the outcomes of the game by is the set of utility functions . With these information we can now describe a perfect-information extensive-form game as .

Tic-Tac-Toe is a sequential two player game, who take turns marking the spaces in a grid with either or . The player who succeeds in placing three of their marks in a horizontal, vertical, or diagonal row wins the game. If we consider a board with the nine positions numbered as follows:

1 | 2 | 3 |

4 | 5 | 6 |

7 | 8 | 9 |

We can define the Game by where 1 is the Player who plays and 2 the player who plays . The set of all sequences is then given by where is a possible move of Player 1 where the second field is marked and a possible move after by Player 2 and so on. Accordingly for example , . Let (draw, lost) a history where Player 1 wins the game, a possible payout function one can for example define .

In the following we want “teach” a computer how to play a game. In particular this requires to let the computer know which moves at particular stage are allowed. As the reader might recognize this information is included in . Even though for simple games like Tic-Tac-Toe it is not difficult for a computer to write down (and therefore implement) in particular (in case of Tic-Tac-Toe we are dealing with 255.168 games), for more complicated games like chess or GO it turns out that (at the current level of computer power) this task is impossible. This information is for example required to let the computer play the best strategy. However, this is not what we want to achieve today. At the moment we only want to construct a computer that understands and plays by the rules without a certain strategy (or more precise: a random strategy).

To let the computer “know” the allowed actions at stage we introduce a new function which map a given history to all possible actions, . For example . We will make another simplification here. For many games, including Tic-Tac-Toe, Chess or GO, the knowledge of the history is not necessary to make the next move. The only necessary information is the current state of the game. Games where the history is unknown are called “games with complete information” in contrast to “games with perfect information” which describes games where the history is known to each player at each stage. In the following we present a Javascript example where the computer plays Tic-Tac-Toe against himself by randomly choosing cells. The “gamerule()” maps with “f”, with “reward” and with “turn”.

At a first look the outcome of running 10.000 random games was surprising. In Table we see a clear first mover advantage. Intuitively I suspected the chances of winning the game to be equal, however after a second thought this result is not surprising because the first mover gets 5 moves as compared to 4 moves for the second player. In particular, according to Wikipedia, there are 255.168 possible games, in 131.184 of these games the first player wins while in 77.904 the second player wins and in 46.080 the game ends with a draw. In the Table we can see that this differs from our estimation. The reason are the very special rules of the Tic-Tac_Toe Game. In fact the game could end earlier if one player succeeds in making the row before the last turn. There are ** **5328 possibilities for games ending in a win on the sixth move, 47952 possibilities for games ending in a win on the seventh move, 72576 possibilities for games ending in a win on the eighth move and 81792 possibilities for games ending in a win on the ninth move.

Since we play the Game sequentially, a path where the game ends earlier will be reached more often then path where the game ends later. To see this we take a look at a history that ends after turns and a path with a history that ends after turns. Let’s say we reach the path of each history after turns 2 times. Since for this is the final knot, this will count two times for . For there are at least two further histories to reach in the next turn and therefore the chance to reach is which means that we will reach only half as much as . However, doing a quick Google I was able to find some results ( here, here ) that supports my results.

Outcome | Estimated probability | Wikipedia probability |
---|---|---|

Player 1 wins | 0.5859 | 0.5141 |

Player 2 wins | 0.2874 | 0.3053 |

Draw | 0.1267 | 0.1806 |

]]>/* --Heiko Wagner 2019 */ class Matrix { constructor(data, ncols) { //reshape the data this.ncols = ncols; var ncols = ncols; var data = data; this.matrix = [] for (var m = 0; m < data.length / ncols; m++) { var row = [] for (var n = 1; n <= ncols; n++) { row.push(data[n + (m * ncols) - 1]) } this.matrix.push(row) } } transpose() { return new Matrix(this.matrix.map((_, c) => this.matrix.map(r => r)).flatMap(x => x), this.ncols) } trace() { return this.matrix.map((x, i) => x[i]).reduce((a, b) => (a === null || b == null) ? null : a + b) } rowsums() { return this.matrix.map((x) => x.reduce((a, b) => (a === null || b == null) ? null : a + b)) } flip() { var nrows = this.matrix.length return new Matrix(this.matrix.map((x) => x.reverse().flatMap(x => x)).flatMap(x => x), nrows) } } function getRandomInt(max) { return Math.floor(Math.random() * Math.floor(max)); } function gamerules(state, turn, player) { //check for win else //win if any rowsum is 3 or 0, trace or flip.trace is 3 or 0 var has_won = [state.rowsums(), state.transpose().rowsums(), state.trace(), state.flip().trace(), ].flatMap(x => x) var reward = null; if (has_won.includes(0)) { //console.log("0 has won") if (player == 0) { reward = 1 } else { reward = -1 } } if (has_won.includes(3)) { //console.log("1 has won") if (player == 1) { reward = 1 } else { reward = -1 } } var state_init = state.matrix.flatMap((x)=>x) f = [] for (var l = 0; l < state_init.length; l++) { var state_l = state_init.slice(0); if (state_init[l] == null) { state_l[l] = turn f.push(new Matrix(state_l,3) ) } } //check if no free field are left ==> draw if (f.length==0 && reward == null) { reward = 0 } return { f: f, reward: reward, turn: (turn + 1) % 2 } } //run trough all states of a game until an reward is reached //play a random game starting from state with gamerules function random_game(state) { //play initial turn var choices = gamerules(state, 1) while (choices.reward === null) { var K = choices.f.length var decision = getRandomInt(K) choices = gamerules(choices.f[decision], choices.turn, 0) } return choices } //Intial state of the game var state = [null, null, null, null, null, null, null, null, null] test = new Matrix(state, 3) //Example of \mathcal{H}: //gamerules(test, 1) //Play a single random game random_game(test) //Play 10000 random games and make a histogram M=10000 var win =0 var loss = 0 var draw = 0 for (var i=0; i < M; i++) { var outcome = random_game(test).reward switch(outcome) { case 1: win++; break; case -1: loss++; break; case 0: draw++; } } var M2=draw+win+loss console.log("prop of player 1 winning: "+loss/M) console.log("prop of player 2 winning: "+win/M) console.log("prop of draw: "+draw/M)

Consider the following problem: A gambler enters a casino with slot machines. The probability to receive a reward for each slot machine follows different, unknown probabilities, e.g. we are facing a set of unknown distributions with associated expected values and variances .

In each turn the gambler can play the lever of one slot machine and observes the associated reward . The objective is now with which strategy the gambler should play to maximize his earning ( or minimizing his losses in case he has to pay a fee to pull a lever ).

If the distributions are known, then one would simply play all the times at the machine with the highest probability of winning. The average reward following this strategy will then be .

Since the probabilities are unknown, a naive solution would for example to play several times, say , at the fist machine, then several times at the second and so on to get a decent estimator idea which machine is the best one and then keep playing. However there are certain drawbacks with this approach, first of all one will play at least times at an inferior machine and secondly one can end up choosing the wrong machine in the end. The chance ending up with the wrong machine can be decreased by raising the number of turn played at each machine, but then one will play even more often at a inferior machine. To measure the performance of a certain strategy we introduce the concept of* total expected regret*, defined by

[1] states that regret grows at least logarithmically. Therefore an algorithm is said to solve the multi-armed bandit problem if it can match this lower bound such that . For the naive solutions we thus can derive that: Play each machine with an uniform random propability:

Play each machine times, play infinitely the machine giving the highest payoff:

where is the average winning of machine at time .

In Blog post we will discuss the Upper Confidence Bounds (UCB1) algorithm proposed by [2] . UCB1 is the simplest algorithm out of the UCB family. The idea of the algorithm is to initially try each lever once. Record the reward gained from each machine as well as the times the machine has been played. At each turn select the machine

which depends on the average winning as well as the number the machine has been played, and the trial number .

[2] show that for UCB1

Hence UCB1 achieves the optimal regret and is said to solve the multi-armed bandit problem.

An advantage of UCB1 is that the algorithm is easy to implement. In the following we present an implementation of UCB1 using JavaScript. An advantage of JavaScript is that is a functional programming language. This means that we can pass functions as arguments of functions. In terms of the multi armed bandit algorithm this allows us to pass entire sample functions to the algorithm.

/* --Bandit.js -This Program computes an optimal lever and average return of the multiarmed bandit problem --Input f=[f_1(x),...,f_K(x)] -An Array of size K containing functions given a certain reward r eg. binomial or a normal distribution with different means T -Runs to retrieve the decision (the first, t or N will terminate the algorithm) --Output sum \mu -the average revenue x_out --Heiko Wagner 2019 */ bandit = function(f, T) { //pull each lever once var x_out = f.map((x) => [x(), 1]) var a; for (var t = 0; t < T; t++) { //determine the position with the highest value var j = x_out.map((x) => x[0] / x[1] + Math.sqrt((2 * Math.log(t)) / x[1])) a = j.reduce((iMax, x, i, arr) => x > arr[iMax] ? i : iMax, 0) //pull maximum lever x_out[a] = [x_out[a][0] + f[a](), x_out[a][1] + 1] } return [x_out.map((x)=>x[0]).reduce( (a,b) =>a+b)/T, x_out] } //Example var K = 10 var f = [] for (var k = 0; k < K; k++) { f.push(eval('() => Math.random()*' + k)) } bandit(f, 10000)

Suppose now the gambler has not only to choose the machine, but prior he has to choose one out of casinos. The question here is which casino is the best. From a theoretical point of view this setup is not very interesting because the because it can be reformulated to a classical one stage multiarmed bandit problem. However from an implementation point of view, especially if we add more stages with an complicated possible unknown structure, we are facing a demanding problem. Here the functional programming approach of JavaScript plays out its strengths. In the following we will thus present an example solving the multiarmed bandit problem with two stages.

//Let's define a Casino Class class Casino { constructor(K, m) { this.K = K; this.m = m; } levers() { var f = [] for (var k = 0; k < this.K; k++) { f.push(eval('() => Math.random()*' + k * this.m)) } return f } } var K_dash = 5 var K = 10 //To build a two stage bandit problem we build var f_2 = [] for (var k = 1; k <= K_dash; k++) { f_2.push(eval('() => { return bandit(new Casino(' + K + ',' + k + ').levers() ,1000)[0] }')) } bandit(f_2, 1000)

[1] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, iss. 1, pp. 4-22, 1985.

[Bibtex]

[Bibtex]

```
@article{LAI19854,
title = "Asymptotically efficient adaptive allocation rules",
journal = "Advances in Applied Mathematics",
volume = "6",
number = "1",
pages = "4 - 22",
year = "1985",
issn = "0196-8858",
doi = "https://doi.org/10.1016/0196-8858(85)90002-8",
url = "http://www.sciencedirect.com/science/article/pii/0196885885900028",
author = "T.L Lai and Herbert Robbins"
}
```

[2] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine learning, vol. 47, iss. 2, p. 235–256, 2002.

[Bibtex]

[Bibtex]

```
@Article{Auer2002,
author="Auer, Peter
and Cesa-Bianchi, Nicol{\`o}
and Fischer, Paul",
title="Finite-time Analysis of the Multiarmed Bandit Problem",
journal="Machine Learning",
year="2002",
month="May",
day="01",
volume="47",
number="2",
pages="235--256",
abstract="Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.",
issn="1573-0565",
doi="10.1023/A:1013689704352",
url="https://doi.org/10.1023/A:1013689704352"
}
```

To explain the registration problem i will start with an example. In Figure 1 the pinch force dataset is shown, to collect the data a group of 20 subjects were asked to press a button as hard if they can after they hear a sound signal. The pressure was then recorded every 2 milliseconds, resulting in 151 observations. Since reaction times of the subjects differ we can clearly see some shift in the curves reflecting the pressure. The problem with this kind of shifted data is now that even very simple statistical measures, like mean or variance, are not meaningful. To see this take a look at the blue curve which is the sample mean . It is visible, that the mean curve does not reflect the shape of the sample curves and even worse the highest point of this curve is smaller than the smallest sample curve. To obtain information about the mean pressure this curve is therefore a bad measure. In case of the pinch fore data an obvious way to fix this problem is to align the curves at certain landmarks, for example the peaks of each curve. This is landmark registration is for example covered by [1, 2], [3] or [4]. However, Landmark registration has certain drawbacks. Considering more complex problems defining the Landmarks becomes ugly very fast, especially when working with more than one spatial dimension. It is also not clear how to choose the Landmarks, consider for example curves where some are wider than others, in this case sometimes also inflection points are used as landmarks. This then leads to a methods which rely on minimizing a distance between registered functions and a template , for example one curve out of the sample. See [5], [6], [7], or [8] for more insights. This strategy works very well in many situations but also has severe problems. Consider for example a sample of curves where some curve have one peak while others have two. A registration method that minimizing a distance between the curves and a one peaked template, will then then to pinch the curves with two peaks, see [6].

An alternative approach was developed by [9] and [10]. Where registration was considered as a tool for statistical analysis. Whenever the random functions possess “bounded shape variation”, then there exists a finite and warping functions such that with probability 1

(1)

for some basis functions and individually different coefficients . An advantage of this way to look at the registration problem that it allows curves to be registered with a more complex structure than the curves displayed in figure 1. Traditional registration procedures can be understood as a registration with and are troubled with curves as displayed in figure 3. Decomposition 1 is unfortunately not unique if . There will then exist different sets of basis functions and such that(2)

The corresponding spaces and spanned by and , respectively, may be structurally very different from each other. As an example consider continuous periodic functions with period length equal to 1, and assume that in each period every curve just possesses one local maximum and one minimum. Registration is driven by the succession of local extrema (shape features) in each of the functions . For any continuous function one can determine locations and heights of all isolated local extrema in the interior of . This means that for all there exists an open neighborhood of such that either for all or for all . If let , and let denote the corresponding -dimensional vector of heights of local extrema (including starting and end points). When analyzing such functions on the interval , periodicity just means that , . If each of the curves just has one maximum and one minimum, then , and by . It is indeed simple to construct a 3 dimensional space analytically. For example, let , , , and . Quite obviously, for any there exists a unique element with and . We can thus conclude that there are unique warping functions and unique coefficients such that(3)

Note that the functions have their local extrema at different locations, depending on and . Registration to therefore does not lead to an alignment of shape features. But is not the only possible candidate space. Consider the space of all polynomials of order 5 satisfying the constraints as well as . This is again a three dimensional space of functions with identical starting and end points, while the generates functions with one local maximum and one minimum in the interior of . There thus exists a set of warping functions such that . The two spaces and are not identical. As a matter of fact one can construct arbitrary many candidate spaces by pre-chosen an arbitrary set of basis functions as long as there exists a with . A central question which space can be considered as the “best”. [10] answered this question by selecting the linear subspace where the least amount of warping is necessary.[1] F. L. Bookstein, Morphometric tools for landmark data: geometry and biology, Cambridge University Press, 1997.

[Bibtex]

[Bibtex]

```
@book{bookstein1997morphometric,
title={Morphometric Tools for Landmark Data: Geometry and Biology},
author={Bookstein, F.L.},
isbn={9780521585989},
lccn={lc91039063},
series={Geometry and Biology},
url={http://books.google.co.in/books?id=amwT1ddIDwAC},
year={1997},
publisher={Cambridge University Press}
}
```

```
@book{bookstein1998,
AUTHOR = "Bookstein, F.L.",
TITLE = "The Measurement of Biological Shape and Shape Change",
PUBLISHER = "Springer",
YEAR = "1978",
BIBSOURCE = "http://www.visionbib.com/bibliography/describe448.html#TT52072"}
```

[3] A. Kneip and T. Gasser, “Statistical tools to analyze data representing a sample of curves,” The annals of statistics, vol. 20, iss. 3, p. 1266–1305, 1992.

[Bibtex]

[Bibtex]

```
@article{kneip1992,
ajournal = "Ann. Statist.",
author = "Kneip, Alois and Gasser, Theo",
doi = "10.1214/aos/1176348769",
journal = "The Annals of Statistics",
month = "09",
number = "3",
pages = "1266--1305",
publisher = "The Institute of Mathematical Statistics",
title = "Statistical Tools to Analyze Data Representing a Sample of Curves",
url = "http://dx.doi.org/10.1214/aos/1176348769",
volume = "20",
year = "1992"
}
```

[4] T. Gasser and A. Kneip, “Searching for structure in curve sample,” Journal of the american statistical association, vol. 90, iss. 432, pp. 1179-1188, 1995.

[Bibtex]

[Bibtex]

```
@article{gasser:95,
ISSN = {01621459},
URL = {http://www.jstor.org/stable/2291510},
abstract = {The shape of a regression curve can to a large extent be characterized by the succession of structural features like extrema, inflection points, and so on. When analyzing a sample of regression curves, it is often important to know at an early stage of data analysis which structural features are occurring consistently in each curve of the sample. Such a definition is usually not easy due to substantial interindividual variation both in the x and the y axis and due to the influence of noise. A method is proposed for identifying typical features without relying on an a priori specified functional model for the curves. The approach is based on the frequencies of occurrence of structural features, as, for example, maxima in the curve sample along the x axis. Important tools are nonparametric regression and differentiation and kernel density estimation. Apart from a theoretical foundation, the usefulness of the method is documented by application to two interesting biomedical areas: growth and development, and neurophysiology.},
author = {Gasser, Theo and Kneip, Alois },
journal = {Journal of the American Statistical Association},
number = {432},
pages = {1179-1188},
publisher = {Taylor & Francis, Ltd.},
title = {Searching for Structure in Curve Sample},
volume = {90},
year = {1995}
}
```

[5] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” Acoustics, speech and signal processing, ieee transactions on, vol. 26, iss. 1, p. 43–49, 1978.

[Bibtex]

[Bibtex]

```
@article{sakoe1978,
abstract = {This paper reports on an optimum dynamic progxamming (DP) based time-normalization algorithm for spoken word recognition. First, a general principle of time-normalization is given using time-warping function. Then, two time-normalized distance definitions, called symmetric and asymmetric forms, are derived from the principle. These two forms are compared with each other through theoretical discussions and experimental studies. The symmetric form algorithm superiority is established. A new technique, called slope constraint, is successfully introduced, in which the warping function slope is restricted so as to improve discrimination between words in different categories. The effective slope constraint characteristic is qualitatively analyzed, and the optimum slope constraint condition is determined through experiments. The optimized algorithm is then extensively subjected to experimental comparison with various DP-algorithms, previously applied to spoken word recognition by different research groups. The experiment shows that the present algorithm gives no more than about two-thirds errors, even compared to the best conventional algorithm.},
author = {Sakoe, H. and Chiba, S. },
booktitle = {Acoustics, Speech and Signal Processing, IEEE Transactions on},
citeulike-article-id = {3496861},
journal = {Acoustics, Speech and Signal Processing, IEEE Transactions on},
keywords = {dtw, litreview, thesis},
number = {1},
pages = {43--49},
posted-at = {2008-11-08 22:11:03},
priority = {0},
title = {Dynamic programming algorithm optimization for spoken word recognition},
url = {http://ieeexplore.ieee.org/xpls/abs\_all.jsp?arnumber=1163055},
volume = {26},
year = {1978}
}
```

[6] J. O. Ramsay and X. Li, “Curve registration,” Journal of the royal statistical society: series b (statistical methodology), vol. 60, iss. 2, p. 351–363, 1998.

[Bibtex]

[Bibtex]

```
@article {Ramsay19982,
author = {Ramsay, J. O. and Li, Xiaochun},
title = {Curve registration},
journal = {Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
volume = {60},
number = {2},
publisher = {Blackwell Publishers Ltd.},
issn = {1467-9868},
url = {http://dx.doi.org/10.1111/1467-9868.00129},
doi = {10.1111/1467-9868.00129},
pages = {351--363},
keywords = {Dynamic time warping, Geometric Brownian motion, Monotone functions, Spline, Stochastic time, Time warping},
year = {1998},
}
```

[7] J. O. Ramsay, “Estimating smooth monotone functions,” Journal of the royal statistical society: series b (statistical methodology), vol. 60, iss. 2, p. 365–375, 1998.

[Bibtex]

[Bibtex]

```
@article {Ramsay1998,
author = {Ramsay, J. O.},
title = {Estimating smooth monotone functions},
journal = {Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
volume = {60},
number = {2},
publisher = {Blackwell Publishers Ltd.},
issn = {1467-9868},
url = {http://dx.doi.org/10.1111/1467-9868.00130},
doi = {10.1111/1467-9868.00130},
pages = {365--375},
keywords = {Convex functions, Density estimation, Generalized additive model, Linear differential equation, Monotonicity, Nonparametric regression, Regression spline, Spline smoothing},
year = {1998},
}
```

[8] A. Kneip, X. Li, K. B. MacGibbon, and J. O. Ramsay, “Curve registration by local regression,” Canadian journal of statistics, vol. 28, iss. 1, p. 19–29, 2000.

[Bibtex]

[Bibtex]

```
@article{KneipGib2000,
abstract = {{Functional data analysis involves the extension of familiar statistical procedures such as principal-components analysis, linear modelling and canonical correlation analysis to data where the raw observation is a function x, (t). An essential preliminary to a functional data analysis is often the registration or alignment of salient curve features by suitable monotone transformations hi(t). In effect, this conceptualizes variation among functions as being composed of two aspects: phase and amplitude. Registration aims to remove phase variation as a preliminary to statistical analyses of amplitude variation. A local nonlinear regression technique is described for identifying the smooth monotone transformations hi, and is illustrated by analyses of simulated and actual data.}},
address = {Facult\'{e} des sciences \'{e}conomiqu.es, sociales etpolitiques Universit\'{e} catholique de Louvain, Place Montesquieu 4 B-1348 Louvain-la-Neuve, Belgium; no e-mail address available 700 North Alabama Street, Indianapolis, IN 46204, USA; D\'{e}p. de math\'{e}matiques, Universit\'{e} du Qu\'{e}bec \`{a} Montr\'{e}al C. P. 8888 Succursale centre-ville, Montr\'{e}al (Quebec), Canada H3C 3P8; Dept. of Psychology, McGill University 1205 avenue Docteur-Penfield, Montreal (Quebec), Canada H3A 1B1},
author = {Kneip, A. and Li, X. and MacGibbon, K. B. and Ramsay, J. O.},
citeulike-article-id = {6101184},
citeulike-linkout-0 = {http://dx.doi.org/10.2307/3315251.n},
citeulike-linkout-1 = {http://www3.interscience.wiley.com/cgi-bin/abstract/122439952/ABSTRACT},
doi = {10.2307/3315251.n},
issn = {1708-945X},
journal = {Canadian Journal of Statistics},
keywords = {alignment},
number = {1},
pages = {19--29},
posted-at = {2009-11-12 12:34:54},
priority = {2},
title = {{Curve registration by local regression}},
url = {http://dx.doi.org/10.2307/3315251.n},
volume = {28},
year = {2000}
}
```

[9] A. Kneip and J. O. Ramsay, “Combining registration and fitting for functional models,” Journal of the american statistical association, vol. 103, iss. 483, pp. 1155-1165, 2008.

[Bibtex]

[Bibtex]

```
@ARTICLE{Kneip2008,
title = {Combining Registration and Fitting for Functional Models},
author = {Kneip, Alois and Ramsay, James O},
year = {2008},
journal = {Journal of the American Statistical Association},
volume = {103},
number = {483},
pages = {1155-1165},
url = {http://EconPapers.repec.org/RePEc:bes:jnlasa:v:103:i:483:y:2008:p:1155-1165}
}
```

[10] H. Wagner and A. Kneip, “Nonparametric registration to low-dimensional function spaces,” Computational statistics & data analysis, 2019.

[Bibtex]

[Bibtex]

```
@article{WAGNER2019,
title = "Nonparametric registration to low-dimensional function spaces",
journal = "Computational Statistics & Data Analysis",
year = "2019",
issn = "0167-9473",
doi = "https://doi.org/10.1016/j.csda.2019.03.004",
url = "http://www.sciencedirect.com/science/article/pii/S0167947319300714",
author = "Heiko Wagner and Alois Kneip",
keywords = "Amplitude variation, Genes, Dimension reduction, Functional data analysis, Functional principal components, Low dimensional linear function spaces, Phase variation, Registration, Time warping",
abstract = "Registration aims to decompose amplitude and phase variation of samples of curves. Phase variation is captured by warping functions which monotonically transform the domains. Resulting registered curves should then only exhibit amplitude variation. Most existing methods assume that all sample functions exhibit a typical sequence of shape features like peaks or valleys, and registration focuses on aligning these features. A more general perspective is adopted which goes beyond feature alignment. A registration method is introduced where warping functions are defined in such a way that the resulting registered curves span a low dimensional linear function space. The approach may be used as a tool for analyzing any type of functional data satisfying a structural regularity condition called bounded shape variation. Problems of identifiability are discussed in detail, and connections to established registration procedures are analyzed. The method is applied to real and simulated data."
}
```

This blog post is about Support Vector Machines (SVM), but not only about SVMs. SVMs belong to the class of classification algorithms and are used to separate one or more groups. In it’s pure form an SVM is a linear separator, meaning that SVMs can only separate groups using a a straight line. However ANY linear classifier can be transformed to a nonlinear classifier and SVMs are excellent to explain how this can be done. For a deeper introduction to the topic I recommend Tibshirani (2009), one can find a more detailed description including an derivation of the complete lagrangian there.

The general Idea of SVM is to separate two (or more) groups using a straight line (see Figure 1). However, in general there exits infinitely many lines that fulfill the task. So which one is the “correct” one? The SVM answers this question by choosing the line (or hyperplane if we suppose more than two features) which is most far away (the distance is denoted by ) from the nearest points within each group.

Suppose multivariate data given by a pair where the explanatory variable and the group coding . In the following we assume an iid. sample of size given by .

Any separating hyperplane (which is a line if ) can therefore be described such that there exits some and

(1)

Note that with the (non restrictive) condition that . Our task to separate the groups is covered by the minimization problem(2)

In most cases the assumption that there exits a hyperplane that perfectly parts the data points is unrealistic. Usually some points will lie on the other side of the hyperplane. In that case (2) will not have a solution. The idea of SVM is now to introduce an parameter to fix this issue. In particular we modify (1) such that we require

(3)

The corresponding minimization problem is then given by(4)

Both (2) and (3) are convex optimization problems and can be solved for example using the Lagrange minimization technique. A solution for (4) always has the form(5)

where iff (3) is met with “” and else, these are then calledThe method we described so far can only handle data where the groups can be separate using some hyperplane (line). However in many cases the data to be considered is not suited to be separated using a linear method. See figure 2 for example, in figure 2 the groups are arranged in two circles with different radius. Any attempt to separate the groups using a line will thus fail. However any linear method can be transformed into a non-linear method by projecting the data into a higher dimensional space. This new data may then be separable by some linear method. In case of figure 2 a suitable projection is for example given by , maps the data onto a cone where the data can be separated using a hyperplane as to be seen in figure 3. However, there are some drawbacks using this method. First of all will in general be unknown. In our simple example we where lucky to find a suitable , however concerning a more complicated data structure fining a suitable turns out to be very hard. Secondly, depending on the dataset, the dimension of needed to guarantee the existence of a separating hyperplane can become quite large and even infinite. Before we deal with this issues using kernels we will first have a look at the modified minimization problem (4) using instead of .

Let the nonlinear solution function given by . To overcome the need of the constant we introduce some penalty and rewrite (4) as

(6)

The notation means that only the positive part of is taken into account. At this point I would like to establish the connection to other linear methods. Since we consider SVMs we set the Loss function to . Different loss functions will lead to different methods, for example using will correspond to the logistic regression. Therefore the following strategy can be used to extent linear methods to deliver non-linear solutions.

From (5) we already know, that a nonlinear solution function will have the form

(7)

We can verify, that knowledge about is in fact not required to formulate the solution. We only need to know something about the inner product . While could be a high dimensional complicated function, the inner product is just a number. So if we would have some magical procedure that gives us this number, solving Nonlinear SVM would be straightforward. This leads us to think about kernels.

We already get in touch with kernels when estimating densities or a regression. In this application Kernels are a way to reduce the infinite-dimensional problem to a finite dimensional optimization problem because the complexity of the optimization problem remains only dependent on the dimensionality of the input space and not of the feature space. To see this, let instead of looking at some particular function , we consider the whole space of functions generated by the linear span of where is just one element in this space. To model this space in practice popular kernels are

- linear kernel:
- polynomial kernel:
- radial kernel:

(8)

where are the orthonormal eigenfunctions of and eigenvalues , which ensure that that the generated space is a Hilbert space. Then for suitable and for some we can represent(9)

Thus we can write (6) in its general form as

(10)

According to (9) we can write down (10) using matrix notation as

(11)

This finite dimensional problem can now be solved using standard methods. ]]>In a previous article I presented an implementation of a kernel denisty estimation using pyspark. It is thus not difficult to modify the algorithm to estimate a kernel regression. Suppose that there exits some function , an example for such functions are for instance temperature curves which measure the temperature during a day. In practice, such functions will often not be directly observed, but one will have to deal with discrete, noisy observations contaminated with some error. The purpose of kernel regression is then to estimate the underlying true function.

In the following I will only consider a simple, standard error model: For design points

there are noisy observations such that

(1)

for i.i.d. zero mean error terms with finite variance and .

To estimate I stick to the Nadaraya-Watson-Estimator given by

where is a kernel as described here, in the implementation again an Epanechnikov kernel is used. The reader might recognize that is just the density (scaled by ) already implemented here, while is just a weighted version of it. Implementation should therefore be very similar leading to the following algortihm:

###A Spark-Function to derive a non-parametric kernel regression from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.feature import StandardScaler import matplotlib.pyplot as plt from numpy import * ##1.0 Simulated Data T=1500 time=sort( random.uniform(0,1,T) ) ##Sorting is not required, i did it to have less trouble with the plots true = sin( true_divide( time, 0.05*pi) ) X= true + random.normal(0,0.1,T) data= sc.parallelize( zip(time,X) ) plt.plot(time,true, 'bo') plt.plot(time,X, 'bo') plt.show() ##2.0 The Function #2.1 Kernel Function def spark_regression(data, Nout, bw): def epan_kernel(x,y,b): u=true_divide( (x-y), b) return max(0, true_divide( 1, b)*true_divide(3,4)*(1-u**2)) #derive the minia and maxi used for interpolation mini=data.map(lambda x: x[0]).takeOrdered(1, lambda x: x ) maxi=data.map(lambda x: x[0]).takeOrdered(1, lambda x: -1*x ) #create an interpolation grid (in fact this time it's random) random_grid = sc.parallelize( random.uniform(maxi,mini,Nout) ) #compute K(x-xi) Matrix density=data.cartesian(random_grid).map(lambda x:( float(x[1]),epan_kernel(array(x[0][0]),array(x[1]),bw) ) ) kernl=data.cartesian(random_grid).map(lambda x:( float(x[1]),x[0][1]*epan_kernel(array(x[0][0]),array(x[1]),bw) ) ) mx= kernl.filter(lambda x: x&gt;0).reduceByKey( lambda y, x: y+x ).zip( density.filter(lambda x: x&gt;0).reduceByKey( lambda y, x: y+x ) ) ##added optional filter() does anyone know if this improves performance? return mx.map(lambda x: (x[0][0],true_divide(x[0][1],x[1][1])) ) ##3.0 Results fitted= spark_regression(data, 128, 0.05).collect() fit=array(fitted).transpose() plt.plot(time,true, color='r') plt.plot(fit[0], fit[1], 'bo') plt.show()]]>

###A Spark-Function to derive a non-parametric kernel density from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.feature import StandardScaler import matplotlib.pyplot as plt from numpy import * ##1.0 Simulated Data N=15000 mu, sigma = 2, 3 # mean and standard deviation rdd = sc.parallelize( random.normal(mu,sigma,N) ) ##2.0 The Function #2.1 Kernel Function def spark_density(data, Nout, bw): def epan_kernel(x,y,b): u=true_divide( (x-y), b) return max(0, true_divide( 1, b)*true_divide(3,4)*(1-u**2)) #derive the minia and maxi used for interpolation mini=data.takeOrdered(1, lambda x: x ) maxi=data.takeOrdered(1, lambda x: -1*x ) #create an interpolation grid (in fact NOT random this time) random_grid = sc.parallelize( linspace(mini, maxi, num=Nout) ) Nin=data.count() #compute K(x-xi) Matrix kernl=data.cartesian(random_grid).map(lambda x:( float(x[1]),true_divide(epan_kernel(array(x[0]),array(x[1]),bw),Nin) ) ) #sum up return kernl.reduceByKey( lambda y, x: y+x ) ##3.0 Results density= spark_density(rdd, 128, 0.8).collect() dens=array(density).transpose() anzahl=array(anz).transpose() #Plot the estimate plt.plot(dens[0], dens[1], 'bo') axis2=linspace(-10, 10, num=128) #plot the true density plt.plot(axis2, 1/(sigma * sqrt(2 * pi)) *exp( - (axis2 - mu)**2 / (2 * sigma**2) ),linewidth=2, color='r') plt.show()]]>