{"id":635,"date":"2017-05-26T23:00:03","date_gmt":"2017-05-27T03:00:03","guid":{"rendered":"http:\/\/aristotle2digital.blogwyrm.com\/?p=635"},"modified":"2021-11-25T20:10:47","modified_gmt":"2021-11-26T01:10:47","slug":"k-means-data-clustering","status":"publish","type":"post","link":"https:\/\/aristotle2digital.blogwyrm.com\/?p=635","title":{"rendered":"K-Means Data Clustering"},"content":{"rendered":"<p><style> td { text-align: center !important; }<\/style><\/p>\n<p>This month\u2019s column is inspired by James McCaffrey\u2019s <em>Test Run<\/em> article <em><a href=\"https:\/\/msdn.microsoft.com\/en-us\/magazine\/mt185575.aspx\">K-Means++ Data Clustering<\/a><\/em>, published in MSDN in August 2015.\u00a0 \u00a0McCaffrey\u2019s piece is really two articles rolled into one and is interesting not only for the data clustering algorithm that it presents but also for the fascinating light it sheds on the human facility to visually analyze data (not that McCaffrey comments on this point \u2013 it\u2019s just there for the careful observer to note).\u00a0 But before getting to the more philosophical aspects, it\u2019s necessary to discuss what K-means data clustering is all about.<\/p>\n<p>To explain what data clustering entails, I will use the very same data the McCaffrey presents in his article as the test set for mine.\u00a0 He gives a set of ordered pairs of height (in inches) and weight (in pounds) for 20 people in some sampled population.\u00a0 The table listing these points<\/p>\n<table>\n<tbody>\n<tr>\n<th width=\"4\"><strong>Height<\/strong><br \/><strong>(in)<\/strong><\/th>\n<th width=\"4\"><strong>Weight<\/strong><br \/><strong>(lbs)<\/strong><\/th>\n<\/tr>\n<tr>\n<td width=\"4\">65.0<\/td>\n<td width=\"4\">220.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">73.0<\/td>\n<td width=\"4\">160.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">59.0<\/td>\n<td width=\"4\">110.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">61.0<\/td>\n<td width=\"4\">120.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">75.0<\/td>\n<td width=\"4\">150.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">67.0<\/td>\n<td width=\"4\">240.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">68.0<\/td>\n<td width=\"4\">230.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">70.0<\/td>\n<td width=\"4\">220.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">62.0<\/td>\n<td width=\"4\">130.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">66.0<\/td>\n<td width=\"4\">210.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">77.0<\/td>\n<td width=\"4\">190.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">75.0<\/td>\n<td width=\"4\">180.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">74.0<\/td>\n<td width=\"4\">170.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">70.0<\/td>\n<td width=\"4\">210.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">61.0<\/td>\n<td width=\"4\">110.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">58.0<\/td>\n<td width=\"4\">100.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">66.0<\/td>\n<td width=\"4\">230.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">59.0<\/td>\n<td width=\"4\">120.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">68.0<\/td>\n<td width=\"4\">210.0<\/td>\n<\/tr>\n<tr>\n<td width=\"4\">61.0<\/td>\n<td width=\"4\">130.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>doesn\u2019t reveal anything special at a glance; nor should it be expected that it would.\u00a0 As every good data scientist knows, the best (human) approach is to plot the data so that the eye can pick out patterns in that the mind may fail to perceive in the tabulated points.\u00a0 The resulting plot<\/p>\n<p><a href=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-659\" src=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated.png\" alt=\"\" width=\"857\" height=\"577\" srcset=\"https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated.png 857w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated-300x202.png 300w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated-768x517.png 768w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated-810x545.png 810w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated-146x97.png 146w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-updated-250x167.png 250w\" sizes=\"auto, (max-width: 857px) 100vw, 857px\" \/><\/a><\/p>\n<p>reveals that there are three distinct clusters for the data. McCaffrey doesn\u2019t show a plot of these data but it seems that he omitted it because he wanted to focus on how to teach the computer to find these three groups using K-means clustering without the benefit of eyes and a brain, which it obvious doesn\u2019t have.<\/p>\n<p>The principle behind the K-means clustering is quite simple:\u00a0 each group of points is called a cluster if all the points are closer to the center of the group than they are to the centers of the of the other groups.<\/p>\n<p>The implementation is a bit more difficult and tricky.\u00a0 Distance plays a key role in this algorithm but whereas the human eye can take in the entire plot at one go and can distinguish the center of each cluster intuitively, the computer must calculate each point one at a time.\u00a0 This leads to two problems:\u00a0 1) how to numerically calculate distances between points and 2) how to find the centers of the cluster from which these distances are measured.\u00a0 Let\u2019s deal with each of these.<\/p>\n<p>The fact that the units for the x-axis are different from those on the y-axis can lead to some problems identifying the clusters \u2013 for both the human and the machine.\u00a0 Consider the following plot of the same data but with the x- and y-axes given the same extent from 50 to 200 in their respective units:<\/p>\n<p><a href=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-common-scale.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-633\" src=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-common-scale.png\" alt=\"\" width=\"857\" height=\"594\" srcset=\"https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-common-scale.png 857w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-common-scale-300x208.png 300w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-common-scale-768x532.png 768w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-common-scale-810x561.png 810w\" sizes=\"auto, (max-width: 857px) 100vw, 857px\" \/><\/a><\/p>\n<p>It isn\u2019t at all clear that there are three distinct clusters as opposed to two.\u00a0 A better approach is to normalize the data in some fashion.\u00a0 I prefer using the Z-score where the data are normalized according to<\/p>\n<p>\\[ Z_x = \\frac{ x &#8211; &lt;x&gt;}{\\sigma_x} \\; , \\]<\/p>\n<p>where $&lt;x&gt;$ is the mean over all the values and $\\sigma_x$ is the corresponding standard deviation.\u00a0 Applying the normalization to the height and the weight gives<a href=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-normalize.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-632\" src=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-normalize.png\" alt=\"\" width=\"857\" height=\"549\" srcset=\"https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-normalize.png 857w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-normalize-300x192.png 300w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-normalize-768x492.png 768w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-normalize-810x519.png 810w\" sizes=\"auto, (max-width: 857px) 100vw, 857px\" \/><\/a><\/p>\n<p>Once the data are normalized, the distance between them is determined by the usual Euclidean norm.<\/p>\n<p>The second step is to determine the centers of each cluster, which is essentially a chicken-and-the-egg problem.\u00a0 On one hand, one needs the centers to determine which point belongs to which cluster.\u00a0 On the other hand, one needs to have the points clustered in order to find their center.\u00a0 The K-means approach is to guess the centers and then to iterate to find a self-consistent solution.<\/p>\n<p>I implemented the K-means clustering code in a Jupyter notebook using Python 2.7 and numpy.\u00a0 I defined 2 helper functions.<\/p>\n<p>The first one determined the cluster into which each point was classified based on the current guess of the centers.\u00a0 It loops are all points and compares each point\u2019s distance from the k centers.\u00a0 The point belongs to the closest center and the point is assigned the id number closest center in a list, which is returned.<\/p>\n<div class=\"myQuoteDiv\">\n<pre>def find_cluster(curr_means,data):\n    num_pts      = len(data)\n    k            = len(curr_means)\n    cluster_list = np.zeros(num_pts,dtype=int)\n    dist         = np.zeros(k)\n    for i in range(num_pts):\n        for m in range(k):\n            l       = data[i] - curr_means[m]\n            dist[m] = l.dot(l)\n        cluster_list[i] = np.argmin(dist)\n    return cluster_list \n<\/pre>\n<\/div>\n<p>The second helper function updates the position of the centers by selecting from the list of all points, those that belong to the current cluster.\u00a0 This function depends heavily on the slicing and numerical functions built into numpy.<\/p>\n<div class=\"myQuoteDiv\">\n<pre>def update_means(k,cluster_list,curr_data):\n    updated_means = np.zeros((k,2))\n    for i in range(k):\n        clustered_data     = curr_data[np.where(cluster_list == i)]\n        updated_means[i,:] = np.array([np.mean(clustered_data[:,0]),np.mean(clustered_data[:,1])])\n    return updated_means  \n<\/pre>\n<\/div>\n<p>For this implementation, I chose 3 clusters (k = 3) and I seeded the guess for the centers by randomly selecting from the list of the initial points.\u00a0 Subsequent updates moved the centers off the tabulated points.\u00a0 The update process was iterated until no changes were observed in the position of the cluster centers.\u00a0 The final cluster assignments were then used to color code the points based on the cluster to which they belong.<\/p>\n<p>The first run of the algorithm was successful with the resulting plot indicating that the process had done a good job of identifying the clusters similar to the way the human eye would.<\/p>\n<p><a href=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-1.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-631\" src=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-1.png\" alt=\"\" width=\"857\" height=\"597\" srcset=\"https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-1.png 857w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-1-300x209.png 300w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-1-768x535.png 768w, https:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-1-810x564.png 810w\" sizes=\"auto, (max-width: 857px) 100vw, 857px\" \/><\/a><\/p>\n<p>Subsequent runs show that about 70-80% of the time, the algorithm correctly identifies the clusters.\u00a0 The other 20-30% of the time, the algorithm gets \u2018stuck\u2019 and misidentifies the points in way that tends to split the small cluster in the bottom left into two pieces (although not always in the same way).<\/p>\n<p><a href=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-3.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-630\" src=\"http:\/\/aristotle2digital.blogwyrm.com\/wp-content\/uploads\/2017\/05\/Height-v-Weight-run-3.png\" alt=\"\" width=\"857\" height=\"568\" \/><\/a><\/p>\n<p>The \u2018++\u2019 nomenclature featured in McCaffrey\u2019s article (K++-means clustering) refers to a more sophisticated way of selecting the initial cluster centers that minimizes the chances that the algorithm will get confused.\u00a0 \u00a0\u00a0This new type of seed and some numerical experiments with a larger data set (in both number of points and dimensionality) will be the subject of next month\u2019s column.<\/p>\n\n\n\n","protected":false},"excerpt":{"rendered":"<p>This month\u2019s column is inspired by James McCaffrey\u2019s Test Run article K-Means++ Data Clustering, published in MSDN in August 2015.\u00a0 \u00a0McCaffrey\u2019s piece is really two articles rolled into one and&#8230; <a class=\"read-more-button\" href=\"https:\/\/aristotle2digital.blogwyrm.com\/?p=635\">Read more &gt;<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-635","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/posts\/635","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=635"}],"version-history":[{"count":0,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=\/wp\/v2\/posts\/635\/revisions"}],"wp:attachment":[{"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=635"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=635"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/aristotle2digital.blogwyrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=635"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}