Kernel Density Estimation

KDE Image

Kernel Density Estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable.

Kernel density estimates are closely related to Histograms but with the additional properties of smoothness and continuity.

Good explanations for Kernel Density Estimates can be found here:

The effect of bandwidth on Kernel Density Estimates

The kernel bandwidth is equivalent to the histogram’s bin width. It’s important to pick the correct bandwidth because a value that is too small or too large is not useful. Here is example code to illustrate the importance of picking the right bandwidth:

from sklearn.neighbors import KernelDensity
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats.distributions import norm

hours = np.linspace(0,23,50)
freq = np.concatenate([ norm(8,2.).rvs(100), norm(18,1.).rvs(100) ])

# Plot the kernel density estimates
fig, ax = plt.subplots(1, 5, sharey=True, figsize=(18, 3))

for (i,bw) in enumerate([0.2,0.5,1.0,2.0,5.0]):
    # sklearn
    kde_skl = KernelDensity(bandwidth=bw)[:,np.newaxis])
    density = np.exp(kde_skl.score_samples(hours[:,np.newaxis]))
    ax[i].plot(hours, density, color='red', alpha=0.5, lw=3)
    ax[i].set_title('sklearn, bw={0}'.format(bw))

  • bandwidths 0.2, 0.5 – undersmoothed, too much detail
  • bandwidths 1.0, 2.0, 5.0 – oversmoothed, too little detail

KDE Image

Use sci-kit learn Grid Search Cross Validation to find optimal bandwidth for KDE

from sklearn.grid_search import GridSearchCV

grid = GridSearchCV(KernelDensity(),
                    {'bandwidth': np.linspace(0.1, 1.0, 30)},
                    cv=20) # 20-fold cross-validation[:, None])
print grid.best_params_


{'bandwidth': 0.72068965517241379}

Generate Histogram and KDE with optimal bandwidth

kde = grid.best_estimator_
pdf = np.exp(kde.score_samples(hours[:, None]))

fig, ax = plt.subplots()
ax.plot(hours, pdf, linewidth=3, alpha=0.5, label='bw=%.2f' % kde.bandwidth)
ax.hist(freq, 30, fc='gray', histtype='stepfilled', alpha=0.3, normed=True)
ax.legend(loc='upper left')
ax.set_title('optimum bandwidth')


Applying KDE to MRT station twitter data

I am using the MRT Station Tweet data again. This time I plot the histogram of tweets for each station binning the data in 15 minute intervals (96 bins). Then I calculate the optional bandwidth using a grid search with cross validation.

Python file:

Number of tweets for each Station plotted as a Histogram with 96 bins (a bin for every 15 minutes) and a Kernel density estimate plot with optimal bandwidth. Data has been normalized.