Kernel Regression using the Fast Fourier Transform

1. Setup

In a previous post it was shown how to speed up the computation of a kernel density using the Fast Fourier Transform. Conceptually a kernel density is not that far away from kernel regression, accordingly this post is will cover using the FFT to improve the computation of a kernel regression. A popular estimator for the regression function E(X|Y)=m(X) given observed points (x_1, y_1),…,(x_T, y_T) is the Nadaraya–Watson estimator with kernel function K() and bandwidth h:

(1)   \begin{equation*}\widehat {m}}_{h}(x)={\frac {\sum _{i=1}^{T}K_{h}(x-x_{i})y_{i}}{\sum _{j=1}^{T}K_{h}(x-x_{j})}}.\end{equation*}

The FFT requires a grid of with T^{'}=2^{k}, k \in \mathbb{N} design points. A meaningful choice in terms of asymptotic is k=\left \lceil{\frac{log(T)}{log(2)}}\right \rceil . Then lim_{T\rightarrow \infty} \frac{2^k}{T} =1 which means that T and T' are asymptotically equivalent. In a first step we have to interpolate y=(y_1,\dots,y_T), x=(x_1,\dots,x_T) to fit to an equidistant grid with T^' points with

(2)   \begin{equation*}y^I=y_{a}+\left(y_{b}-y_{a}\right){\frac {x^I-x_{a}}{x_{b}-x_{a}}}{\text{ at the point }}\left(x^I,y^I\right)} .\end{equation*}

Since we use an interpolation using an equidistant grid x^I=(x^I_1, \dots x^I_{T'}) (1) becomes

(3)   \begin{equation*}\widehat {m}}^I_{h}(x)= \frac{max(x^I)-min(x^I)}{T^{'}} \sum _{i=1}^{T^{'}}K_{h}(x-x^I_{i})y^I_{i}} .\end{equation*}

Let the discrete Fourier transform of \hat{m} be denoted by \tilde{m}. The Fourier transform of (3) for s \in (x^I_1, \dots x^I_{T'}) is then using convolution and translation given by

(4)   \begin{equation*}\tilde{m}_h(s)= \frac{max(x^I)-min(x^I)}{T^{'}} \sum_{i=1}^{T^{'}} \tilde{K}_{h}(s) y^I_{i} e^{i s x^I_{i}}=\frac{max(x^I)-min(x^I)}{T^{'}}  \tilde{K}_{h}(s) \tilde{y}(s)\end{equation*}


where \tilde{y}(s)= \frac{1}{T^{'}} \sum_{i=1}^{T^{'}} y^I_{i} e^{i s x^I_{i}} is the Fast Fourier transform of the interpolated data. \hat{m}^I can then be obtained by applying the inverse DFFT to (4). All these steps then involve only \mathcal{O}(T' log(T') ) operations instead of \mathcal{O}(T'T) operations when computing (3) directly. Sorting, which may be required prior to the linear interpolation has a computational complexity of \mathcal{O}(T log(T) ) which has the same order of magnitude than estimating the FFT based estimator itself and thus does not impact the computational complexity.

2. Asymptotic Considerations

An important question is if the interpolation decline the asymptotic properties of the estimator. Another way to describe the regression setup is for T design points x_1,\dots,x_T in addition wlog for simplicity we require x_i \in [0,1] there are noisy observations y_{l} such that

(5)   \begin{eqnarray*}y_{l}=m(x_l)+\epsilon_{l},\quad l=1,\dots,T,\end{eqnarray*}


for i.i.d. zero mean error terms \epsilon_{l} with finite variance \sigma^2>0 and E(\epsilon_{l}^4)<\infty. Following [1], the Nadaraya–Watson estimator for an interior point x we got a mean squared error of

    \[\operatorname {E} (({m}(x)-{\hat {m}}(x))^2) \approx \underbrace {h^{4}B^{2}} _{={\text{Bias}}^{2}}+\underbrace {{\frac {1}{Th}}V}_{={\text{Variance}}}}\]

while using a simple taylor extension, for neigbour points x_a,x_b the error introduced by the interpolation is given by

    \[\operatorname {E} (|m^I(x)-m(x)|)\leq \operatorname {E} (C(x_{b}-x_{a})^{2}) = \mathcal{O}_p( T^{-2} )\quad {\text{where}}\quad C={\frac {1}{8}}\max _{r\in [x_{a},x_{b}]}|m''(r)|.}\]

Putting all together and using an appropriate bandwidth of order h^*= \mathcal{O}(T^{-1/5})=\mathcal{O}(T'^{-1/5})

    \[\operatorname {E} ((m(x)-{\hat {m}^I}(x))^2)=\operatorname {E} ((\underbrace {m(x)-m^I(x)}_{\text{interpolation error}}+\underbrace {m^I(x)-{\hat {m}^I}(x)}_{\text{estimation error}} )^2) \]

using some straightforward calculations one can show that the interpolating error is dominated by the kernel regression error. A detailed proof and a extension allowing for multivariate X can be found in [2].

3. Python Code

To implement the method the fft function that comes with the numpy package was chosen. The numpy fft() function will return the approximation of the DFT from 0 to \pi. Therefore fftshift() is needed to swap the output vector of the fft() right down the middle. So the output of fftshift(fft()) is then from -\pi/2 to \pi/2.

import numpy as np
import matplotlib.pyplot as plt
import timeit
from scipy.interpolate import interp1d

# Simulation
T = 500
X = np.linspace(-5, 5, T) 
Y= (X)**3 + np.random.normal(-0, 10, T)
plt.scatter(X, Y, c="blue")

plt.xlabel('sample')
plt.ylabel('X')

# Presetting
Tout = int( 2 ** np.floor(np.log(T)/np.log(2)) )
print(Tout)

# Interpolate Y
f = interp1d(X, Y)
grid = np.linspace(-5, 5, num=Tout)
Ynew=f(grid)

def epan_kernel(u, b):
    u = u / b
    return max(0, 1. / b * 3. / 4 * (1 - u ** 2))


# Estimation using FFT
start = timeit.default_timer()

t = grid
h= Tout**(-1/5)
kernel = [epan_kernel(x, h) for x in t]
kernel_ft = np.fft.fft(kernel)
func_tmp = kernel_ft.flatten() * np.fft.fft(Ynew)
m_fft = (max(X)-min(X))*np.fft.fftshift(np.fft.ifft(func_tmp).real)/Tout
plt.plot(t, m_fft)

stop = timeit.default_timer()
print('Time: ', stop - start)

plt.show()

References

[1] {. Fan and {. Gijbels, Local polynomial modelling and its applications, London [u.a.]: Chapman & hall, 1996.
[Bibtex]
@book{Fan1996,
added-at = {2009-08-21T10:31:17.000+0200},
address = {London [u.a.]},
author = {Fan, {Jianqing} and Gijbels, {Irène}},
biburl = {http://www.bibsonomy.org/bibtex/23e163e04b09550b54a8067f0cbd97b7e/fbw_hannover},
interhash = {de68bea35adadb13da464f65107efce4},
intrahash = {3e163e04b09550b54a8067f0cbd97b7e},
isbn = {0412983214},
keywords = {Equations Mathematische_Statistik Polynomials Regression_analysis},
number = 66,
pagetotal = {XV, 341},
ppn_gvk = {19282144X},
publisher = {Chapman \& Hall},
series = {Monographs on statistics and applied probability series},
timestamp = {2009-08-21T10:31:17.000+0200},
title = {Local polynomial modelling and its applications},
url = {http://gso.gbv.de/DB=2.1/CMD?ACT=SRCHA&SRT=YOP&IKT=1016&TRM=ppn+19282144X&sourceid=fbw_bibsonomy},
year = 1996
}
[2] M. P. Wand, “Fast computation of multivariate kernel estimators,” Journal of computational and graphical statistics, vol. 3, iss. 4, p. 433–445, 1994.
[Bibtex]
@article{Wand1994,
ISSN = {10618600},
URL = {http://www.jstor.org/stable/1390904},
abstract = {Multivariate extensions of binning techniques for fast computation of kernel estimators are described and examined. Several questions arising from this multivariate extension are addressed. The choice of binning rule is discussed, and it is demonstrated that linear binning leads to substantial accuracy improvements over simple binning. An investigation into the most appropriate means of computing the multivariate discrete convolutions required for binned kernel estimators is also given. The results of an empirical study indicate that, in multivariate settings, the fast Fourier transform offers considerable time savings compared to direct calculation of convolutions.},
author = {M. P. Wand},
journal = {Journal of Computational and Graphical Statistics},
number = {4},
pages = {433--445},
publisher = {[American Statistical Association, Taylor & Francis, Ltd., Institute of Mathematical Statistics, Interface Foundation of America]},
title = {Fast Computation of Multivariate Kernel Estimators},
volume = {3},
year = {1994}
}

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.