The Copernican: Exploring a clustering technique using mathematical group theory

Step I: Identification of ICM

[caption id="attachment_20428" align="aligncenter" width="596" caption="A scatter plot"]

[/caption]

The x-axis, identified by the base marker, represents numerical values from 1 to 100 (integers). The y-axis, identified by the decimal marker, represents numerical values from 0 to 1 (decimal, 2 significant digits).

From all the data, two points are selected at random, marked in the image by a black rectangle around them. They are labelled as the initial cluster markers (ICM).

Step II: Geometric tagging

The Euclidean distance between each point in the scatter plot and the ICM is calculated. In this case, there are 48 points (after excluding the ICM): x₁, x_2,. . ., x₄₈. Therefore, the distances between x₁ and ICM₁ and ICM₂ are calculated; between x₂ and ICM₁ and ICM₂, and so on until x₄₈ and ICM₁ and ICM₂.

If the distance between some x_i and ICM_a is lower than x_i and ICM_b, then x_i is grouped with ICM_a and tagged as x_i^a. This gives rise to a clear demarcation; the data set is now split amongst x_i^a and x_j^b. Each section is identified as S{x_i,j^a,b}.

Step III: Averaging

Before any average is computed, it must be determined as to which value is to be considered. Three cases are presented below.

x-value: if the x-value of each data-point is to be averaged in each cluster, then the computed average will lie along a straight line x_i = µ_i

y-value: if the y-value of each data-point is to be averaged in each cluster, then the computed average will lie along a straight line y_i = µ_i

Alien value: consider the following table.

Base marker (1-50)	Decimal marker (0-1)	Anonymous marker (1-100)
4	0.71	20
13	0.61	38
14	0.98	7
18	0.03	4
11	0.68	55
37	0.26	67
30	0.04	46
21	0.22	57
37	0.48	90
47	0.94	43

The x- and y-axes represent the values in the first two columns while the third column shows a set of values not considered in the construction of the scatter plot. Since each row in this table is denoted as a point in the plot, each such point is also associated with a certain “anonymous value” as shown in the table.

These can be considered in the averaging process, whereby all the “anonymous” values of all the points in each cluster are averaged separately:

S(x_i^a): A(x_i^a)

S(x_j^b): A(x_j^b)

These new points, A_a and A_b, are plotted in their respective clusters.

Step IV: Convergence

A_a and A_b become the new ICM, and steps I to III are repeated.

The iterations stop when the new A_a and A_b converge with the A_a and A_b from the previous iteration.

Result

After the values have converged, the two S{x_i,j^a,b} now presented by the machine are two distinct groups of data, the emergence of which also signals that the machine has “learnt”. If a very large database is supplied as an input to the machine, there is an advantage as well as a disadvantage.

Advantage: the machine learns better

Disadvantage: data may not converge at all due to improper choice of clusters

In order to prevent the second possibility, a suitable number of clusters has to be determined for n, the number of data-points. As a rule of thumb,

k = (n / 2)^1/2

Here, ‘k’ is the number of clusters.

If a dataset of 50 points is input, then k = 5 clusters are suggested. Once the iterations have been performed and convergence has been attained in five separate clusters, the centroids of each cluster (or, the average value of the data-points in each cluster) can now be used to arrive at 2 super-clusters.

The Copernican

Pages

Sunday, 9 October 2011

Exploring a clustering technique using mathematical group theory

No comments:

Post a Comment