Sammon Mapping - Data Visualisation

Cluster analysis or clustering is a form of unsupervised learning where the objective is to group feature vectors of length d representing the data according their relative distance in d-dimensional space.

In the case for d = 2 it is easy to visualise the data clustering but with most real world problems there are more than two features involved and clustering therefore has to take place in a high dimensional space. It is often useful, before embarking on a comprehensive clustering analysis, to attempt to visualise the data in a lower dimensional space, especially if the distances between points in the lower dimensional space correspond to dissimilarities between points in the original space.

Sammon's mapping [1] seeks to find a configuration of points in 2D whose interpoint distances correspond to dissimilarities in the high-dimensional space and is an example of multi-dimensional scaling.

A sum-of-squared error criterion is used where δij are the Euclidean distance between points in the original d-dimensional space and dij are the distances between image points for which a configuration δij = dij does not exist.

An optimal configuration of the 2D image points is then found through gradient descent [2]. The resulting visualisation based on the image points will show the amount of overlap between the different classes of feature vectors to be estimated and hence gives an indication of how difficult it would be to train a neural network to classify the data in the original d-dimensional space.


[1] J. Sammon (1969), A non-linear mapping for data structure analysis. IEEE Transactions on Computers, 18(5):401-409.

[2] W. Press, S. Teukolsky, W. Vetterling, B. Flannery (1995), Numerical Recipes in C. Cambridge University Press.