Pages

Sunday, 9 October 2011

Exploring a clustering technique using mathematical group theory

Step I: Identification of ICM

[caption id="attachment_20428" align="aligncenter" width="596" caption="A scatter plot"][/caption]

The x-axis, identified by the base marker, represents numerical values from 1 to 100 (integers). The y-axis, identified by the decimal marker, represents numerical values from 0 to 1 (decimal, 2 significant digits).

From all the data, two points are selected at random, marked in the image by a black rectangle around them. They are labelled as the initial cluster markers (ICM).

Step II: Geometric tagging

The Euclidean distance between each point in the scatter plot and the ICM is calculated. In this case, there are 48 points (after excluding the ICM): x1, x2, . . ., x48. Therefore, the distances between x1 and ICM1 and ICM2 are calculated; between x2 and ICM1 and ICM2, and so on until x48 and ICM1 and ICM2.

If the distance between some xi and ICMa is lower than xi and ICMb, then xi is grouped with ICMa and tagged as xia. This gives rise to a clear demarcation; the data set is now split amongst xia and xjb. Each section is identified as S{xi,ja,b}.

Step III: Averaging

Before any average is computed, it must be determined as to which value is to be considered. Three cases are presented below.

  1. x-value: if the x-value of each data-point is to be averaged in each cluster, then the computed average will lie along a straight line xi = µi

  2. y-value: if the y-value of each data-point is to be averaged in each cluster, then the computed average will lie along a straight line yi = µi

  3. Alien value: consider the following table.





























































Base marker (1-50)



Decimal marker (0-1)



Anonymous marker (1-100)



4



0.71



20



13



0.61



38



14



0.98



7



18



0.03



4



11



0.68



55



37



0.26



67



30



0.04



46



21



0.22



57



37



0.48



90



47



0.94



43



The x- and y-axes represent the values in the first two columns while the third column shows a set of values not considered in the construction of the scatter plot. Since each row in this table is denoted as a point in the plot, each such point is also associated with a certain “anonymous value” as shown in the table.

These can be considered in the averaging process, whereby all the “anonymous” values of all the points in each cluster are averaged separately:

S(xia): A(xia)

S(xjb): A(xjb)

These new points, Aa and Ab, are plotted in their respective clusters.

Step IV: Convergence

Aa and Ab become the new ICM, and steps I to III are repeated.

The iterations stop when the new Aa and Ab converge with the Aa and Ab from the previous iteration.

Result

After the values have converged, the two S{xi,ja,b} now presented by the machine are two distinct groups of data, the emergence of which also signals that the machine has “learnt”. If a very large database is supplied as an input to the machine, there is an advantage as well as a disadvantage.

  • Advantage: the machine learns better

  • Disadvantage: data may not converge at all due to improper choice of clusters


In order to prevent the second possibility, a suitable number of clusters has to be determined for n, the number of data-points. As a rule of thumb,

k = (n / 2)1/2

Here, ‘k’ is the number of clusters.

If a dataset of 50 points is input, then k = 5 clusters are suggested. Once the iterations have been performed and convergence has been attained in five separate clusters, the centroids of each cluster (or, the average value of the data-points in each cluster) can now be used to arrive at 2 super-clusters.

No comments:

Post a Comment