## Machine Learning : k-Means Clustering I

The **k-Means Clustering** finds centers of clusters and groups input samples around the clusters.

k-Means Clustering is a **partitioning method** which partitions data into k mutually exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical clustering, k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures), and creates a **single level of clusters**. The distinctions mean that k-means clustering is often more suitable than **hierarchical clustering** for large amounts of data.

The following description for the steps is from wiki - K-means_clustering.

**Step 1**

k initial "means" (in this case k=3) are randomly generated within the data domain.

**Step 2**

k clusters are created by associating every observation with the nearest mean. The partitions here represent the Voronoi diagram generated by the means.

**Step 3**

The centroid of each of the k clusters becomes the new mean.

**Step 4**

Steps 2 and 3 are repeated until convergence has been reached.

In OpenCV, we use **cv.KMeans2(()** and it's defined like this:

cv2.kmeans(data, K, criteria, attempts, flags[, bestLabels[, centers]])

The parameters are:

**data**: Data for clustering.**nclusters(K) K**: Number of clusters to split the set by.**criteria**: The algorithm termination criteria, that is, the maximum number of iterations and/or the desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster centers moves by less than criteria.epsilon on some iteration, the algorithm stops.**attempts**: Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness (see the last function parameter).**flags**:- KMEANS_RANDOM_CENTERS - selects random initial centers in each attempt.
- KMEANS_PP_CENTERS - uses kmeans++ center initialization by Arthur and Vassilvitskii.
- KMEANS_USE_INITIAL_LABELS - during the first (and possibly the only) attempt, use the user-supplied labels instead of computing them from the initial centers. For the second and further attempts, use the random or semi-random centers. Use one of KMEANS_*_CENTERS flag to specify the exact method.

**centers**: Output matrix of the cluster centers, one row per each cluster center.

In this section, we'll play with a data set which has only one feature. The data is one-dimensional, and clustering can be decided by one parameter.

We generated two groups of random numbers in two separated ranges, each group with 25 numbers as shown in the picture below:

The code:

import numpy as np import cv2 from matplotlib import pyplot as plt x = np.random.randint(25,100,25) # 25 randoms in (25,100) y = np.random.randint(175,250,25) # 25 randoms in (175,250) z = np.hstack((x,y)) # z.shape : (50,) z = z.reshape((50,1)) # reshape to a column vector z = np.float32(z) plt.hist(z,256,[0,256]) plt.show()

Now, we're going to apply **cv2.kmeans()** function to the data with the following parameters:

# Define criteria = ( type, max_iter = 10 , epsilon = 1.0 ) criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0) # Set flags flags = cv2.KMEANS_RANDOM_CENTERS # Apply KMeans compactness,labels,centers = cv2.kmeans(z,2,None,criteria,10,flags)

In the code, the **criteria** means whenever 10 iterations of algorithm is ran, or an accuracy of epsilon = 1.0 is reached, stop the algorithm and return the answer.

Here are the meanings of the return values from **cv2.kmeans()** function:

**compactness**: It is the sum of squared distance from each point to their corresponding centers.**labels**: This is the label array where each element marked '0','1',.....**centers**: This is array of centers of clusters.

For the current case, it gives us centers as:

centers: [[ 72.08000183] [ 217.03999329]]

Here is the final code for 1-D data:

import numpy as np import cv2 from matplotlib import pyplot as plt x = np.random.randint(25,100,25) # 25 randoms in (25,100) y = np.random.randint(175,250,25) # 25 randoms in (175,250) z = np.hstack((x,y)) # z.shape : (50,) z = z.reshape((50,1)) # reshape to a column vector z = np.float32(z) # Define criteria = ( type, max_iter = 10 , epsilon = 1.0 ) criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0) # Set flags flags = cv2.KMEANS_RANDOM_CENTERS # Apply KMeans compactness,labels,centers = cv2.kmeans(z,2,None,criteria,10,flags) print('centers: %s' %centers) A = z[labels==0] B = z[labels==1] # initial plot plt.subplot(211) plt.hist(z,256,[0,256]) # Now plot 'A' in red, 'B' in blue, 'centers' in yellow plt.subplot(212) plt.hist(A,256,[0,256],color = 'r') plt.hist(B,256,[0,256],color = 'b') plt.hist(centers,32,[0,256],color = 'y') plt.show()

We can see the data has been clustered: A's centroid is 72 and B's centroid is 217 which was one of the outputs from the **cv2.kmeans()** function.

Ph.D. / Golden Gate Ave, San Francisco / Seoul National Univ / Carnegie Mellon / UC Berkeley / DevOps / Deep Learning / Visualization