Data Visualization and Data Reduction

2021-08-30 Visualization Data Analysis

Introduction

In this notebook I will be trying out some dimensionality reduction techniques which can be used for modelling and/or data visualization. Dimensional reduction occupies a central position in many fields. In essence, the goal is to change the representation of data sets, originally in a form involving a large number of variables, into a low-dimensional description. The main difference between data reduction and data visualization is that some data visualization techniques can not be used on unseen data, thus they will not be useful in the modelling part. Data visualization is still an import aspect in every modelling task as discovering patterns at an early stage helps to guide the next steps of data science. If categories are well-separated the visualization method, machine learning is likely to be able to find a mapping from an unseen new data point to its value. Given the right prediction algorithm, we can then expect to achieve high accuracy.

The methods can be either linear or nonlinear. Non-linear methods are often used as data points seldom live on linear manifolds. However, non-linear methods can be very sensitive to hyper-parameters. We also want the methods to preserve local structures because in many applications, distances of points that are far apart are meaningless, and therefore need not be preserved. We should keep in mind that since we often have many knobs to tune, we can easily fall into the trap of over-tuning until we see what you wanted to see. This is especially true for the non-linear method.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import KernelPCA, PCA

from sklearn.manifold import MDS, Isomap

from sklearn.metrics.pairwise import euclidean_distances

from scipy.spatial import distance_matrix

import seaborn as sns

from pandas.api.types import is_numeric_dtype

import matplotlib.patches as mpatches

from scipy.linalg import eigh, svd

Ghoul data

We will work with a very basic data set just to capture the main ideas. In real life however multiple data tweeking and munging needs to be done beforehand, which will probably be more time consuming than the actual modelling part. We are given data on creatures and the goal is to classify the creature type, they can either be Ghost, Ghoul, or Goblin (https://www.kaggle.com/c/ghouls-goblins-and-ghosts-boo/data). We have 4 numerical features: bone_length, rotting_flesh, hair_length, has_soul (percentage of soul in creature) and 1 non-ordinal categorical variable: color (dominant color of the creature):

creatures_train = pd.read_csv("data/creatures_train.csv")

creatures_train.head()

sns.pairplot(creatures_train, hue="type", height=2)



creatures_train_centered = creatures_train.copy()



for col in creatures_train_centered.columns:

    if is_numeric_dtype(creatures_train_centered[col]):

        creatures_train_centered[col] = \

            (creatures_train[col] - creatures_train[col].mean())/np.std(creatures_train[col])



creatures_train_centered.head()



data_numbers = creatures_train_centered[['bone_length', 'rotting_flesh', 

                                         'hair_length', 'has_soul']]



numerical_variables = ['bone_length', 'rotting_flesh', 

                       'hair_length', 'has_soul']

categorical_variables = ['color']



# map each creature to a color

color_map = np.array(['blue'] * creatures_train.shape[0],dtype=object )

color_map[creatures_train['type'] == 'Goblin'] = 'orange'

color_map[creatures_train['type'] == 'Ghost'] = 'green'

png

def plot_embedding(embedding, color_map, alpha = 1):

    plt.scatter(embedding[:, 0],embedding[:, 1], c=color_map, s=40, cmap='viridis', alpha = alpha)

    pop_a = mpatches.Patch(color='blue', label='Ghoul')

    pop_b = mpatches.Patch(color='orange', label='Goblin')

    pop_c = mpatches.Patch(color='green', label='Ghost')

    plt.legend(handles=[pop_a,pop_b,pop_c])



 

In the follwing we let \(X\) be \(n \times p\) matrix where \(n\) is the number of data points and \(p\) is the number of features.

PCA

Principal Component Analysis, or PCA, is probably the most widely used embedding. PCA transforms the data into a new coordinate system with an orthogonal linear transformation such that the first coordinate has the greatest variance, the second coordinate has the second greatest variance, and so on. The main advantage is that it is quite simple and fast, but the disadvantage is that it can only capture linear structures, so non-linear information will be lost. PCA can both be used for visualization and modelling.

When performing PCA it is important to normalize the data as we want to minimize the variance. If one feature has higher variance than the other features it will skew the decomposition.

Let \(h_w(x)\) denote a orthogonal projection onto a direction \(w \in R^d\). The empirical variance by the projection is:

\[ var(h_w) = \frac{1}{n} \sum_{i} h_w(x_i)^2 = \frac{1}{n} \sum_{i} \frac{(x_i^Tw)^2}{||w||^2} = \frac{1}{n} \frac{w^T X^T X w}{w^Tw} \]

where the last term is the Rayleigh quotient. We want to find the i-th principal direction \(w_i\) such that:

\[ w_i = \underset{w \perp {w_1, \dots w_{i-1}}}{\operatorname{arg max}} var(h_w) \quad s.t. \quad ||w|| = 1 \]

The Lagrangian form is given by:

\[ L(w,\lambda) = \frac{1}{n} w^T X^T X w - \lambda w^Tw \]

Taking the derivative yields

\[ X^T X w_i =\lambda w_i \]

This is a well studied eigenvalue problem.

Once we have found the directions \(w_i\) we project our data onto the directions \(w_i\) as \(\tilde{X} = XW\), where \(W\) is a matrix having \(w_i\) as columns. Each column in \(\tilde{X}\) is called a principal component. The first column being principal component 1 or PC1, and the next column being the principal component 1 or PC2, and so on. The principal components are uncorrelated. \(Cov(w_i^T x_i, w_j^T x_j) = w_i^TCw_j = \lambda_j w_i^Tw_j =0\) (here \(x_i\) and \(x_j\) are a \(p \times 1\) vector, representing a data point. The data has already been normalized in the code above.

The procedure is fairly straightforward to code, we will only take use 2 coordinates but the number of optimal principal direction could be found with a Scree plot :

C = (1.0/data_numbers.shape[0])*data_numbers.T.dot(data_numbers)



lambda_, alpha = eigh(C)

print(lambda_)



# Do orhogonal projection

coef = np.fliplr(alpha[:,-2:])  # we have to flip the eigenvector matrix as we want the vector corresponding to the highest eigenvalue

X_projected = data_numbers.dot(coef)
[0.51569022 0.63469455 0.97799092 1.87162431]

We can check if the projections are orthogonal.

np.corrcoef(X_projected.T)
array([[1.00000000e+00, 7.69730711e-17],
       [7.69730711e-17, 1.00000000e+00]])

We have made them uncorrelated. Very cool. Let’s plot. Instead of simple scatter plot, we create a so-called biplot. This plot allows us to see the effect of each variable on each group.

def myBiplot(score,coeff,labels, color, alpha = 1):

    """

    Function that plots a PCA biplot

    """



    if type(score) != 'numpy.ndarray':

        score = np.array(score)



    xs = score[:,0]

    ys = score[:,1]

    n = coeff.shape[0]



    plt.figure(figsize=(8,5))

    plt.subplot(1, 1,  1)



    

    scalex = 1.0/(xs.max() - xs.min())

    scaley = 1.0/(ys.max() - ys.min())

    points = plt.scatter(xs * scalex, ys * scaley, c = color, alpha = alpha)



    for i in range(n):

        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r', alpha = 0.7)

        plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'black', ha = 'center', va = 'center')



    pop_a = mpatches.Patch(color='blue', label='Ghoul')

    pop_b = mpatches.Patch(color='orange', label='Goblin')

    pop_c = mpatches.Patch(color='green', label='Ghost')



    plt.legend(handles=[pop_a,pop_b, pop_c])



    plt.xlim([-1,1])

    plt.ylim([-1,1])

    plt.xlabel("PC1")

    plt.ylabel("PC2")

    plt.grid()
myBiplot(X_projected, coef, labels = list(data_numbers.columns), color = color_map, alpha = 0.5)

png

We get a nice visual separation, especially between A pretty Ghouls and Ghosts. We can see that Ghouls are more associated with high values of hair_length, has_soul and bone_length. Also, we see that the first principal component is highly correlated with hair_length, has_soul and bone_length as the the value is greater than 0.5. The second principal component is highly negatively correlated with rotting flesh and bone_length.

How does our method compare to the scipy method?

pca = PCA(n_components=2, svd_solver='full')

X_projected_pca_scipy = pca.fit_transform(data_numbers)

pca.singular_values_
array([26.350951  , 19.04821856])
myBiplot(X_projected_pca_scipy,np.transpose(pca.components_[0:2, :]), labels = numerical_variables, color = color_map, alpha = 0.5)

png

This is the same picture, just upside down. This is because the eigenvector signs are irrelevant, if \(u\) is an unit eigenvector then \(-u\) is an unit eigenvector as well.

Remark: Scipy uses SVD on X which is another method of getting the PCA score. That is we can perform a SVD on \(X\) to obtain \(X = USV^T\). From this equation we can obtain the covariance matrix as:

\[C = X^TX = \frac{VSU^TYSV^T}{n-1} = \frac{VS^2V^T}{n-1}\]

so the eigenvalues of C can be obtained as \(\lambda_i = s_i^2/(n-1)\) where \(s_i\) is the corresponding eigenvalue of \(X\). The principal components is then obtained as \(XV = USV^TV = US\)

KPCA

Kernel PCA (KPCA) is another dimensionality reduction technique closely related to the PCA. Namely, KPCA uses kernel to map the features in a reproducing kernel Hilbert space which is possibly infinite. That way we can get a nonlinear dimensionality reduction. The main advantages being that we can capture non-linear data structures. Also, kernels can be defined between abstract objects such as time series, strings, and graphs allowing more general structure. The disadvantage is that it is sensitive top the kernel choice and it’s hyper-paramters. Note, that if a linear kernel is used, we simply get the PCA back. KPCA can both be used for visualization and modelling.

Like in PCA, we work with normalized data.

Now we work with functions \(\phi: \mathcal{X} \mapsto \mathcal{X} \) which maps our data living in the data space \(\mathcal{X}\) to functions living in a repdocuing kernel hilbert space \(\mathcal{H}\) and it is in the space \(\mathcal{H}\) where we find a lower dimensional representation of our data.

For kernel pca we have something very similar to the PCA:

\[var(h_f) = \frac{1}{n} \sum_i h_f(x_i)^2 = \frac{1}{n} \sum_i \frac{<\phi(x_i), f>_H^2}{||f||_H^2} = \frac{1}{n} \sum_i \frac{f(x_i)^2}{||f||_H^2} \]

and we want to solve

\[f_i = \underset{f \perp {f_1, \dots f_{i-1}}}{\operatorname{arg max}} var(h_f) \quad s.t. \quad ||f||_H = 1\]

Using the representer theorem, we can write the function as \(f_i = \sum_j \alpha_i K(x_j, \cdot)\). The objective becomes

\[\alpha_i = \underset{\alpha}{\operatorname{arg max}} \alpha^T K^2 \alpha\]

such that \(\alpha_i^T K \alpha_j, \quad j = 1, \dots, i-1\) (orthogonal) and \(\alpha_i^T K \alpha_i\). We can turn this into an eigenvalue problem by doing a change of variables \(\beta = K^{1/2}\alpha\) where \(K^{1/2} = U\Sigma^{1/2}U^T\). The problem now becomes:

\[\beta_i = \underset{\beta}{\operatorname{arg max}} \beta^T K \beta\]

such that \(\beta_i^T K \beta_j, \quad j = 1, \dots, i-1\) (orthogonal) and \(\beta_i^T K \beta_i\). The solution is \(\beta_i = u_i\) where \(u_i\) is the i-th eigenvalue and \(\alpha_i = U\Sigma^{-1/2}U^Tu_i = U^T\Sigma^{-1/2} [\dots, 1, \dots]^T = \frac{1}{\sqrt{\lambda_i}u_i}\)

Note that if we take a linear kernel, that is, \(f_w(x) = w^Tx\) with the norm \(||f_w||_H = ||w||\) we have:

\[var(h_w) = \frac{1}{n} \sum_i \frac{f(x_i)^2}{||f||_H^2} = \frac{1}{n} \sum_i \frac{(x_i^Tw)^2}{||w||^2}\]

so a KPCA with a linear kernel is simply a PCA.

We can choose multiple kernels but for simplicity we can use the radial basis function kernel (rbf kernel). First we write the functions we need. Note we only use the 4 numerical features.

An important remark is that we have obtained orthogonal functions in \(\mathcal{H}\), if we would like to find a representation in the data space \(\mathcal{X}\) we would have to find the inverse of the function \(f\). This problem is called the pre-image problem and it is generally hard to solve. Thus, what is instead done in practice to reduce the computational complexity is to simply plot the \(\alpha\) values obtained for the visualization and/or the modelling part.



def fit( K, nr_eigen, center = True):

    """

    :param nr_eigen: Number of eigenvalues

    :param center: Should kernel matrix be centered?

    """



    if center:

        K = center_kernel_matrix(K)



    alphas_ = obtain_alphas(K,nr_eigen)

    nr_eigen = alphas_.shape[1]



    return K, alphas_





def center_kernel_matrix(K):

    """

    :param K: The kernel matrix we want to center

    :return: centered Kernel matrix

    """

    one_l_prime = np.ones(K.shape) / K.shape[1]

    one_l = np.ones(K.shape) / K.shape[1]

    K = K \

        - np.dot(one_l_prime, K) \

        - np.dot(K, one_l) \

        + one_l_prime.dot(K).dot(one_l)

    return K



def rbf_kernel(X, c):

    """

    :param c: parameter for the RBF Kernel function

    :return: Kernel matrix

    """

    # Compute squared euclidean distances between all samples, 

    # store values in a matrix

    sqdist_X = euclidean_distances(X, X, squared=True)

    K = np.exp(-sqdist_X / c)

    return K





def obtain_alphas(K, n):

    """

    :param n: number of components used 

    :return: returns the first n eigenvectors of the K matrix

    """



    lambda_, alpha = eigh(K, eigvals=(K.shape[0]-n,K.shape[0]-1))



    alpha_n = alpha / np.sqrt(lambda_)



    # Order eigenvalues and eigenvectors in descending order

    lambda_ = np.flipud(lambda_)

    alpha_n = np.fliplr(alpha_n)







    return alpha_n 
plt.figure(figsize=(20, 5))



for i, rbf_constant in enumerate([0.5, 1, 2, 4]):

    plt.subplot(1, 4, i + 1)

    

    K = rbf_kernel(data_numbers, rbf_constant)

    K, alphas = fit(K, 2)



    plot_embedding(alphas, color_map)

    plt.title(f'rbf constant = {rbf_constant}')

png

Again we get some distiction between the types. Let’s try the Sklearn KPCA

transformer = KernelPCA(n_components = 2, kernel='rbf', fit_inverse_transform=True, gamma = 1/2)

X_kpca = transformer.fit_transform(data_numbers)

plt.figure(figsize=(20, 5))

plt.subplot(1, 2, 1)

plot_embedding(transformer.alphas_, color_map)

png

This is exactly the same figure. We could in theory include categorical variables by using a kernel for categorical data, for example the Aitchison and Aitken kernel function. That is, we can define a kernel between the numerical features with \(k_{num}\) and we also define a different kernel between the categorical features \(k_{cat}\). Then we perform KPCA on \(k = k_{num}k_{cat}\).

Multidimensional Scaling

The classical mds

Multidimensional scaling (MDS) is a data visualization technique which tries to preserve distances between samples globally using only a table of distances between them. That is we are given distances \(d_{ij}\) and we want to find embeddings \(y_i \in R^q\) for each data point s.t.

\[ \sum_{i <j} (||y_i - y_j|| - d_{ij})\]

is minimized. We can derive the algorithm, Assume that \(x \in R^q\), \(p<q\) :

We know that \(d_{ij}^2 =||x_i - x_j||^2 = x_i^Tx_i + x_j^Tx_j - 2x_i^Tx_j\). Let \(K = XX^T\) be the Gram matrix (linear kernel). Rewriting the previous equations as \(d_{ij} = k_{ii} + k_{jj} -2k_{ij}\) and assuming that that the data is centered, \(sum_{i = 1}^n x_{im} = 0\) (any center location could be used as if x is solution then x +c is solution is well) then for all \(m\), we obtain:

\[\sum_{i = 1}^n k_{ij} = \sum_{i = 1}^n \sum_{m = 1}^q x_{im} x_{jm} = \sum_{m = 1}^q x_{jm} \sum_{i = 1}^n x_{im} = 0\]

We also have

\[ \sum_{i = 1}^n d_{ij}^2 = trace(K) + nb_{jj}, \quad \sum_{j = 1}^n d_{ij}^2 = trace(K) + nb_{ii}, \quad \sum_{i = 1}^n \sum_{j = 1}^n d_{ij}^2 = 2n*trace(B) \]

Rearranging the previous equation and plugging it into \(d_{ij} = k_{ii} + k_{jj} -2k_{ij}\) gives

\[ k_{ij} = -1/2 (d_{ij}^2 - 1/n \sum_{i = 1}^n d_{ij}^2 - 1/n \sum_{j = 1}^n d_{ij}^2 + 1/n^2 \sum_{i = 1}^n \sum_{j = 1}^n d_{ij}^2 )\]

which can be written in a matrix form (try by writing it down for a 3x3 matrix):

\[K = -1/2 H D^{(2)} H\]

where \(H = I - 1/n ee^T\) and \(e\) is a vector of one’s. The solution is then given by the eigen-decomposition of \(K\), \(K = V\Lambda V^T\) and \(X = V \Lambda^{1/2}\). For dimensionality reduction, simply pick the top \(q\) eigenvector and eigenvalues and calculate the embedding as \(\tilde{X} = V_q\Lambda_pq{1/2}\). The disadvantage is that is does not support out-of-sample transformation (we would have to redo the embedding scheme for each new sample). Note, that we can use which ever distance matrix we like so we are not restricted by the euclidean distance.

The metric mds

Is a generalization of the classical mds which allows for a variety of loss functions. The most widely used loss function is called s tress. see the wiki for more information.

The Isomap

Isomap generalizes the metric mds where the idea is to perform MDS not on the input space but rather on the geodesic space of the nonlinear data manifold. The first step is to find the nearest neighbours of each data point in high-dimensional data space, this generates a graph where each node is a data point and the edges connect the nearest neighbours of each data point and the weight of the edge is the input space distance. The second step is to calculate the geodesic pairwise distances between all points by using the a shortest path algorithm, for example, Dijkstra’s algorithm or Floyd’s algorithm. Finally the metric mds embedding is used. The disadvantage is of isomats are potential short-circuits, in which a single noisy datapoint provides a bridge between two regions of dataspace that should be far apart in the low-dimensional representation.

The disadvantage of these methods is that they are generally not used on unseen data although it can be done

def mds(D, p):

    """

    Calculate multidimensional scaling

    :param D: The distance matrix where each entry is squared

    :param p: Embedding dimension

    """

    # We just need to calculate



    n = D.shape[0]

    H = np.identity(n) - (1.0/n) * np.ones((n,n))



    K = -0.5 * np.dot(H,D).dot(H)



    lamda, V = eigh(K)



    # Order eigenvalues and eigenvectors in descending order

    lamda = np.flipud(lambda_)

    V = np.fliplr(V)



    embedding = np.dot(V[:, :p], np.diag(lamda[:p]))



    return embedding
# using the  Minkowski p-norm 

plt.figure(figsize=(20, 5))



for i, p in enumerate([1, 2, 3]):

    plt.subplot(1, 3, i+1)

   

    D = distance_matrix(data_numbers, data_numbers, p=p)

    D_sq = np.power(D, 2)



    embedding = mds(D_sq, 2)

    

    plot_embedding(embedding, color_map)

    plt.title(f'Minkowski {p}-norm')

png

Note that that \(p=2\) results in the same embedding as the PCA. This happens as the \(p =2\) norm is the same as a linear kernel

Comparing this to Sklearn’s mds.

embedding = MDS(n_components=2)

X_transformed = embedding.fit_transform(data_numbers)
plot_embedding(X_transformed, color_map)

png

This is not the same embedding as the sklearn uses the metric mds which uses the SMACOF (Scaling by MAjorizing a COmplicated Function) algorithm to minimizes an objective function.

We can also look at the Isomap embeddings using sklearn.

embedding_isomap = Isomap(n_components=2, n_neighbors=40)

X_transformed_isomap = embedding_isomap.fit_transform(data_numbers)
plot_embedding(X_transformed_isomap, color_map)

png

Correspondence Analysis

Correspondence Analysis is a method that helps us finding a connection between correlation and contingency. These methods are best suited for categorical data, but numerical data could be made categorical by binning strategies. The wiki page does a good job explaining the methodology CA.

We start by calculating the contingency table \(C\).

We compute the row and column weights, \(w_m = \frac{1}{n_c} C 1\) and \(w_n = \frac{1}{n_c} 1^T C\). Let \(n_c = \sum _j \sum _j C_{ij}\).

The weights are transformed into matrices, \(W_m = diag(1/\sqrt(w_m))\) and \(W_n = diag(1/\sqrt(w_n))\). Next, the standardized residuals are found

\[ S = W_M(P-w_m w_n)W_n\]

To calculate the coordinates for CA we perform a SVD decomposition of the standardized residual. The left singular vectors correspond to the categories in the rows of the table, and the right singular vectors correspond to the columns. The eigenvalue gives us the variance of each dimension. That is the principal coordinates of the rows are \(F_m = W_m U \Sigma\) and the principal coordinates for the columns are \(F_n = W_n V \Sigma\)

We can also plot the data in standard coordinates which are defined as \(G_m = W_mU\), \(G_n = W_n V\)

Note when we plot the data in the same space using the principal coordinates (\(F\)), we can only interpret the distance between row points or the distance between column points (intra distance). We can not interpret the distance between rows and column points (inter distance). To interpret the distance between column points and row points, we plot the decomposition in an asymmetric fashion. Rows (or columns) points are plotted from the standard coordinates and the profiles of the columns (or the rows) are plotted from the principal coordinates (\(F\) and \(G\)). If we plot using the standard coordinates (G) then the intra distance will be exaggerated.

def CA(data,row, col):

    """

    :param row: name of row variable

    :param col: name of col variable

    """

    ""

    C = pd.crosstab(data[row],data[col], margins = False)

    print(C)



    col_var = list(C.columns)

    row_var = list(C.index)



    n_c = np.sum(np.sum(C))  # sum of all cell values in C



    w_m = np.array(C).dot(np.ones((C.shape[1], 1))) / n_c  # column weights



    w_n = np.ones((1, C.shape[0])).dot(np.array(C)) / n_c



    W_m = np.diag(1.0/np.sqrt(w_m[:,0]))

    W_n = np.diag(1.0/np.sqrt(w_n[0,:]))



    P = (1/n_c) * C



    # standarized residuals

    S = W_m.dot(P - np.dot(w_m, w_n)).dot(W_n)





    U, Sigma, Vt = svd(S, full_matrices=False)



    F_m = W_m.dot(U).dot(np.diag(Sigma))

    F_n = W_n.dot(Vt.T).dot(np.diag(Sigma))



    G_m = W_m.dot(U)

    G_n = W_n.dot(Vt.T)



    plt.scatter(G_m[:,0], G_m[:,1], c = 'blue')

    for i, name in enumerate(row_var):

        plt.text(G_m[i,0], G_m[i,1], name)



    # plit color 

    plt.scatter(F_n[:,0], F_n[:,1], c = 'red')

    for i, name in enumerate(col_var):

        plt.text(F_n[i,0], F_n[i,1], name)





    return F_m, F_n, G_m, G_n

We plot the first two dimensions of each type and each color.

F_m, F_n, G_m, G_n = CA(creatures_train,'type', 'color')
color   black  blood  blue  clear  green  white
type                                           
Ghost      14      6     6     32     15     44
Ghoul      14      4     6     42     13     50
Goblin     13      2     7     46     14     43

png

This is pretty anti-climatic, but we could say that blood corresponds mostly with Ghost. I don’t these results are so surprising as the the color distribution for the types are very similar if we look at the contingency table. If we would have more than one categorical variable we could use Multiple correspondence analysis, which is pretty similar to CA and straightforward to implement.

Diffusion map

Diffusion map (ses also An Introduction to Diffusion Maps) is data visualization reduction technique. Similar to the Isomap it is a technique relying on a graph based algorithm, interpreting weighted graphs as a notion of geometry.

The diffusion map is generally not used on unseen data but it can be done

The connectivity between two data points, \(x\) and \(y\) is defined as the probability of jumping from \(x\) to \(y\) in one step of a random walk on a graph. The connectivity is expressed in terms of a kernel \(k(x,y)\)

The Gaussian kernel (rbf kernel)

\[k(x,y) = \exp(-\frac{||x-y||^2}{\alpha})\]

is popular as it gives a prior to the local geometry of \(X\) using the euclidean norm. By tweaking the parameter \(\alpha\) we are essentially defining the size of the neighbourhood area.

Then \(p(x,y)\) is defined as

\[p(x,y) = \frac{1}{d_X}k(x,y)\]

where \(d_X = \sum_{y }k(x,y)\) is a normalizing constant. Note that \(p(x,y)\) is no longer symmetric but it defines a Markov chain, and a path between two points is now measured by the path length, that is, how probable is it to transit from \(x\) to \(y\) in \(t\) steps. Using this idea we define the diffusion distance of two points in the space as the probability of jumping between them in \(t steps\)

\[ D_t(X_i, X_j) = \sum_u | p_t(X_i, u) - p_t(X_j, u)| ^2\]

where \(t\) denotes the number of steps. As the walk goes on it reveals the geometric structure of the data and the main contributors to the diffusion distance are paths along that structure.

Finally, to visualize the data manifold, we want to find points \(Y_i\) and \(Y_j\), that exist in an \(r\)-dimensional euclidean space which conserve the distance between \(X_i\) and \(X_j\) on the manifold, that is.

\[ || Y_i - Y_j ||_2 = D_t(X_i, X_j) \]

To find the data points \(Y_i\), we simply pick the top \(r\) eigenvalues of \(P\) and set \(Y_i = [\lambda_1 v_1[i], \dots, \lambda_r v_1[r]]\) where \(v_n[i]\) is \(i\)-th component of the \(n\)-th right eigenvector. A remark is that we look at \(P’ = D^{-1/2} k D^{-1/2}\), which has the same eigenvalues as \(P\), see An Introduction to Diffusion Maps for a better explanation.

The code implementation is pretty straight forward, it yet again bowls down to an eigenvalue decomposition,

def DiffusionMap(X, alpha, t):

    """



    :param X: data matrix

    :param alpha: Gaussian kernel parameter

    """





    sqdist_X = euclidean_distances(X, X, squared=True)

    K = np.exp(-sqdist_X / alpha)







    d_x = np.sum(K,axis= 0)

    D_inv = np.diag(1/d_x)



    P = np.dot(D_inv,K)



    D_sq_inv = np.diag((d_x) ** -0.5)





    P_prime = np.dot(D_sq_inv, np.dot(K,D_sq_inv))

    P_prime = np.matmul(np.diag((d_x) ** 0.5), np.matmul(P,D_sq_inv))



    lamda, U = eigh(P_prime)



    idx = lamda.argsort()[::-1]

    lamda = lamda[idx]

    U = U[:,idx]

    # print(lamda)



    coord = np.dot(D_sq_inv, U)#.dot(np.diag(lamda ** t))



    return coord, lamda, U
embedding_diffusion, lamda, U = DiffusionMap(data_numbers, 1.5, 1)

embedding_diffusion.shape
(371, 371)
plot_embedding(embedding_diffusion[:, 1:], color_map, alpha = 0.7)

png

TSNE

t-sne is another data visualization technique. It maps the data points in such a way that similar points are grouped together with a high probability, and that distant points will be distant with high probability. Therefore, t-sne aims to perserve local structure. The main idea is that the similarity between a point \(x_i\) and all other points \(x_j\) is measured with a conditional Gaussian, \(p_{j|i}\). That is, the probability that \(x_i\) would pick \(x_j\) as a neighbour,

\[p_{j|i} \frac{\exp (-||x_i - x_j||^2/2\sigma_i^2)}{\sum_{k \neq i}\exp (-||x_i - x_k||^2/2\sigma_i^2}\]

we set \(p_{i|i}\) to zero. We will explain the \(\sigma\) parameter later on. Then, to introduce symmetry, we will work the probability

\[ p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n} \]

where \(n\) is the number of data points. The denominatior is introcuded to ensure that each data point contributes to the cost function, \(\sum_j p_{ij} > 1/(2n)\).

The similarity in the low dimensional map is defined using a Cauchy distribution.

\[q_{ij} = \frac{(1+ ||y_i - y_j||)^{-1}}{\sum_{k \neq l} (1+ ||y_k - y_l||)^{-1}}\]

To measure the faithfulness with which \(q_{ij}\) models \(p_{ij}\), the Kullback-Leibler divergence is used.

\[C = \sum_i KL(P_i || Q_i) = \sum_i \sum_j p_{ji} \log \frac{p_{ji}}{q_{ji}}\]

We are interested in finding the best \(y_i\) to represent \(x_i\) in low-dimension. Therefore, we take the gradient with respect to \(y_i\) and we get (see the derivation in the paper)

\[ \frac{\partial C}{\partial y_i} = 4 \sum_j (p_{ij} - q_{ij})(y_i - y_j)(1 + ||y_i - y_j||^2)^{-1}\]

and finally we perform we use gradient descent, with a momentum term to speed up the optimization and to avoid poor local minima.

\[ y^{(t)} = y^{(t-1)} + \eta \frac{\partial C}{\partial y} + \alpha(t)(y^{(t-1)} - y^{(t-2)}) \]

where \(\eta\) is the learning rate and \(\alpha\) is the momentum at iteration \(t\).

The low dimensional embedding is initialized with a normal distribution.

It is important to note that the cost function is non convex and the solution depends on the initial embedding. Therefore, each time the algorithm is run, the outcome won’t be the same.

A major weakness of t-SNE is that the cost function is not convex, as a result of which several optimization parameters need to be chosen (including random initial low dimensional embeddings). Therefore, each time the algorithm is run, the outcome won’t be the same. That being said, a local optimum of a cost function that accurately captures a good representation is often preferable to the global optimum of a cost function that fails to capture important aspects.

What about \(\sigma_i\)? It is not likely that there is a single value of \(\sigma_i\) that is optimal for all datapoints in the data set because the density of the data is likely to vary In dense regions, a smaller value of \(\sigma_i\) is usually more appropriate than in sparser regions. \(\sigma_i\) is found by introducing a parameter called perplexity:

\[ perp(p_i) = 2^{H(p_i)}\]

where

\[ H(p_i) = - \sum_{j} p_{j|i} \log_2 p_{j|i} \]

We can find \(\sigma_i\) with a binary search.

The perplexity can be interpreted as a smooth measure of the effective number of neighbors. The performance of t-SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.

For large data sets, it will be costly to store the probability if each data point considered all other data points as a neighbour. To make the computations feasible we can start by choosing a desired number of neighbors and creating a neighborhood graph for all of the datapoints. Although this is computationally intensive, it is only done once. This is very similar to the isomap and the diffusion map. Finally, \(p_{j|i}\) is defined as the fraction of random walks starting at landmark point \(x_i\) that terminate at landmark point \(x_j\).

def sigma_binary_search(perplexity, sqdist, n_steps = 100, tolerance = 1e-5) -> np.array:

    """

    Perform binary search to find the sigmas. We perform the search on log scale. Note, we return the probability matrix P



    :param perplexity: perplexity

    :param sqdist: array (n_samples, n_neighbours), Square distances of data points

    :param n_steps: Number of binary search steps.n_steps

    """



    n_samples = sqdist.shape[0]

    n_neighbours = sqdist.shape[1]



    desired_entropy = np.log(perplexity)



    P = np.zeros((n_samples, n_neighbours), dtype=np.float64)



    # Perform binary search for each data point

    for i in range(n_samples):

        # var = 1/(2sigma^2)

        var_min = -np.Inf

        var_max = np.Inf

        var = 1.0



        # start binary search

        for k in range(n_steps):

            

            sum_p_i = 0.0  # the sum of $ p_{j|i}



            for j in range(n_neighbours):

                if j != i:

                    P[i, j] = np.exp(-sqdist[i, j] * var)

                    sum_p_i += P[i, j]



            

            sum_disti_Pi = 0.0

            for j in range(n_neighbours):

                P[i, j] /= sum_p_i

                sum_disti_Pi += sqdist[i, j] * P[i, j]





            entropy = np.log(sum_p_i)*1.0 + var * sum_disti_Pi

            entropy_diff = entropy - desired_entropy



            if np.abs(entropy_diff) <= tolerance:

                break



            # perform binary search

            if entropy_diff > 0.0:

                var_min = var

                if var_max == np.Inf:

                    var *= 2.0

                else:

                    var = (var + var_max) / 2.0

            else:

                var_max = var

                if var_min == -np.Inf:

                    var /= 2.0

                else:

                    var = (var + var_min) / 2.0



    return P
def tsnse(X, perplexity, nr_steps, eta = 100, n_steps = 100, tolerance = 1e-5):

    """



    Some of the code is from the author of t-sne https://lvdmaaten.github.io/tsne/

    

    :param X: data matrix (n_samples, n_features)

    :param perplexity: perplexity

    :param n_steps: Number of graident descent steps

    :param eta: Learning rate

    :param sqdist: array (n_samples, n_neighbours), Square distances of data points

    :param n_steps: Number of binary search steps.n_steps

    """



    n = X.shape[0]



    sqdist_X = euclidean_distances(X, X, squared=True)

    P = sigma_binary_search(perplexity, sqdist_X)



    P = (P + P.T)/(2.0 * n)

    P = np.maximum(P, 1e-12)



    # initalize Y

    Y = np.random.randn(n, 2)

    gradient = np.zeros((n, 2))  

    iY_1 = np.zeros((n, 2))





    # start gradient descent

    for t in range(nr_steps):



        Q = np.zeros((n,n))



        sqdist_Y = euclidean_distances(Y, Y, squared=True)

        for i in range(n):

            for j in range(n):

                Q[i,j] = 1./(1. + sqdist_Y[i,j])



        Q[range(n), range(n)] = 0  # set diagnonal as 0

        num = Q.copy()

        Q = Q / np.sum(Q)

        Q = np.maximum(Q, 1e-12)





        # Compute gradient

        PQ = P - Q

        for i in range(n):

            tmp = np.tile(PQ[:, i] * num[:, i], (2, 1))  #(p_ij - q_ij)(1 + ||y_i - y_j||)^-1 part

            gradient[i, :] = np.sum(tmp.T * (Y[i, :] - Y), 0)



        # Perform the update

        if t < 250:

            momentum = 0.5

        else:

            momentum = 0.8



        iY_2 = iY_1.copy()

        iY_1 = Y.copy()

        Y = Y - eta*gradient + momentum * (iY_1 - iY_2)





        # if (t + 1) % 10 == 0:

        #     C = np.sum(P * np.log(P / Q))

        #     print("Iteration %d: error is %f" % (t + 1, C))



    return Y
Y = tsnse(data_numbers, 30, 1000, eta = 200)
plot_embedding(Y, color_map)

png

The code seems to be working but it is quite slow. If one wants to do t-sne for data visualization, there is a module in the sklearn library, which is significantly faster.

from sklearn.manifold import TSNE

embedding_tsne = TSNE(n_components=2, learning_rate=200, perplexity=30, early_exaggeration=12).fit_transform(data_numbers)
plot_embedding(embedding_tsne, color_map)

png