k-Means Clustering - Grouping Data Vectors
The k-means algorithm is an example of an iterative clustering algorithm where the objective of a clustering algorithm is to
find a partitioning of the patterns into k groups or clusters, such that the patterns in each cluster are more similar to each
other than to patterns in different clusters [1].
For a given number of clusters, the clustering algorithm aims to minimise the sum of squared errors between each pattern in each cluster, and the centre of that cluster. Thus the objective of a square-error clustering is to find a partition containing k clusters which minimises this error [2]. The error function measures the total sqared error incurred in representing a much larger number of input patterns by only k cluster centres. A simple example is shown in the figure above for nine two-dimensional patterns, where k = 2.
For the k-means algorithm, the aim is to minimise this error starting from an initial partitioning of the data set and then iteratively adjust the cluster centres until an optimal partitioning is achieved. The algorithm proceeds as follows:
1. Select an initial set of "cluster centres",
2. Assign each of the input patterns to its closest cluster centre,
3. Compute new cluster centres as the centroids of the k clusters, thus minimising the error function,
4. If the position of any cluster centre changes, return to step 2; otherwise, stop.
Different initial choices for values of of cluster centres can result in the algorithm converging to different solutions. When the data points are well clustered, however, initial choices which may appear to be unrealistic still lead to the desired partitioning.
[1] L. Tarassenko (1998), A guide to neural computing applications. London, New York: Arnold.
[2] C.M. Bishop (1995), Neural Networks for Pattern Recognition. Oxford, New York: Clarendon Press; Oxford University Press. 482.