Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net

424 lines
14 KiB

  1. # K-Means Algorithm
  2. The general idea of clustering is to group data with similar traits. The main benefit of this is the ability to extract information from new data because you know what it is most similar to, thus giving you valuable insight. In the field of machine learning, clustering is considered unsupervised learning because it requires no labels on the data -- the algorithm auto assigns clusters, and you infer behavior off of those clusters.
  3. Clustering has many applications such as image segmentation, preference predictions, compression, model fitting.
  4. Although you can trace the concept of k-means clustering back to 1967 with a paper by Hugo Steinhaus, James MacQueen was the first to coin the term k-means in 1956. MacQueen's paper title "Some Methods For Classification and Analysis of Multivariate Observations" goes over the k-means process that segments an N-dimensional population into k sets. Note: when we refer to k in the algorithm, that is the number of sets that we are dividing the population.
  5. A great deal of MacQueen's article discusses optimality for the k-means algorithm, which is an important area to discuss, especially when considering the time at which the article got published. Back in 1967, computers were very slow and expensive. Although we had proofs that can guarantee that we could find an optimal solution, they were an NP-Hard problem. This is critical because NP-Hard problems are problems that are exponential to solve.
  6. Although the k-means algorithm did not guarantee the optimal solution, there was a subset of problems that it did guarantee an optimal solution-- the specifics of these problems got discussed later in the article. Nerveless, since this algorithm wasn't computationally expensive and generally gave good results, it was a huge breakthrough at the time.
  7. This algorithm contains four major segments:
  8. ## Step 1:
  9. Pick k random points as cluster centers called centroids.
  10. ## Step 2:
  11. Assign each point to the nearest cluster by calculating its distance to each centroid.
  12. ## Step 3:
  13. Find the new cluster center by taking the average of the assigned points.
  14. ## Step 4:
  15. Repeat steps 2 and 3 until no cluster assignments change.
  16. # Python Implementation
  17. Implementing this in python is rather straight forward. Given data, cluster it into k sets and return the cluster assignments and cluster values.
  18. ```python
  19. import sys
  20. import numpy as np
  21. def distToClust(val, clusterCenter):
  22. """
  23. Distance measure to cluster, can change
  24. this to be different types of distances
  25. """
  26. return np.linalg.norm(val-clusterCenter)
  27. def closestCenter(val, clusters):
  28. """
  29. Finds the cluster closest to the presented
  30. value
  31. """
  32. curMin = sys.maxsize
  33. curIndex = 0
  34. for k in range(0, len(clusters)):
  35. d = distToClust(val, clusters[k])
  36. if d < curMin:
  37. curIndex = k
  38. curMin = d
  39. return curIndex
  40. def kmeansAlgo(k, data):
  41. """
  42. k: number of clusters
  43. data: nxd numpy matrix where n is number of elements
  44. and d is the number of dimensions in the data.
  45. return: tuple of assignments and clusters where clusters
  46. is a list of clusteroids and assignments maps each value
  47. to a single cluster
  48. """
  49. n = data.shape[0] # length of data to cluster
  50. d = data.shape[1] # dimensionality of data
  51. # maps each element in data to a cluster
  52. assignments = np.zeros(n, dtype=np.int)
  53. clusters = []
  54. for i in range(0, k):
  55. clusters.append(data[i])
  56. reAssigned = True
  57. generations = 0
  58. while reAssigned:
  59. reAssigned = False
  60. # assign clusters
  61. for i in range(0, n):
  62. c = closestCenter(data[i], clusters)
  63. if c != assignments[i]:
  64. reAssigned = True
  65. assignments[i] = c
  66. # re-compute centers
  67. clusterValues = []
  68. for _ in range(0, k):
  69. clusterValues.append([])
  70. for i in range(0, n):
  71. clusterValues[assignments[i]].append(data[i])
  72. for i in range(0, k):
  73. clusters[i] = np.average(clusterValues[i], axis=0)
  74. generations = generations + 1
  75. print("Clustering took " + str(generations) + " generations")
  76. return assignments, clusters
  77. ```
  78. ```python
  79. ```
  80. # Image Segmentation
  81. Using our k-means algorithm, we can cluster the pixels in an image together.
  82. ## Clustering on Color
  83. When we cluster the pixels of an image based on color, we map pixels with a similar color to the same cluster. Since an image is a three-dimensional matrix, we can do this just fine. When clustering our data, the input is going just to be the list of pixel values. Note: as far as the k-means algorithm is concerned, there is no coordinates, just a list of pixels. The RGB values of the pixels get clustered together.
  84. To make things run a bit faster, we are going to be using the k-means implementation from Sklearn.
  85. ```python
  86. def segmentImgClrRGB(imgFilename, k):
  87. #1. Load the image
  88. img = cv2.imread(imgFilename)
  89. h = img.shape[0]
  90. w = img.shape[1]
  91. img.shape = (img.shape[0] * img.shape[1], 3)
  92. #5. Run k-means on the vectorized reponses X to get a vector of labels (the clusters);
  93. #
  94. kmeans = KMeans(n_clusters=k, random_state=0).fit(img).labels_
  95. #6. Reshape the label results of k-means so that it has the same size as the input image
  96. # Return the label image which we call idx
  97. kmeans.shape = (h, w)
  98. return kmeans
  99. ```
  100. After we have our pixel assignment, we want some useful ways to display it. In this algorithm, we are coloring in each pixel with the assignments' clustered center. IE: if our algorithm ran with three clusters, the new image would only have three colors in it.
  101. ```python
  102. import skimage
  103. from sklearn.cluster import KMeans
  104. from numpy import linalg as LA
  105. def colorClustering(idx, imgFilename, k):
  106. img = cv2.imread(imgFilename)
  107. clusterValues = []
  108. for _ in range(0, k):
  109. clusterValues.append([])
  110. for r in range(0, idx.shape[0]):
  111. for c in range(0, idx.shape[1]):
  112. clusterValues[idx[r][c]].append(img[r][c])
  113. imgC = np.copy(img)
  114. clusterAverages = []
  115. for i in range(0, k):
  116. # print(len(clusterValues[i])/(idx.shape[1]*idx.shape[0]))
  117. clusterAverages.append(np.average(clusterValues[i], axis=0))
  118. for r in range(0, idx.shape[0]):
  119. for c in range(0, idx.shape[1]):
  120. imgC[r][c] = clusterAverages[idx[r][c]]
  121. return imgC
  122. ```
  123. Next, we need a way of printing the images; I usually use matplotlib. I'm also displaying the image that we are going to be using for the clustering.
  124. ```python
  125. import cv2
  126. import matplotlib.pyplot as plt
  127. def printI(img):
  128. rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
  129. plt.imshow(rgb)
  130. printI(cv2.imread("bg.jpg"))
  131. ```
  132. ![Original image](media/k-means/output_9_0.png)
  133. ```python
  134. idx = segmentImgClrRGB("bg.jpg", 5)
  135. res = colorClustering(idx, "bg.jpg", 5)
  136. printI(res)
  137. ```
  138. ![clustered image](media/k-means/output_10_1.png)
  139. Just like you can segment with RGB values, you can also try different color spaces such as HSV. Playing around with the distance measure would also give you different results.
  140. ## Texture Clustering
  141. Similarly to color segmentation, you can cluster based on texture detection. The premise of this method is that you generate a bunch of filters of different shapes, sizes, and scales. When you apply the filter bank to a pixel, it has a similar response as pixels in regions with a similar texture.
  142. The first step is to construct a filter bank.
  143. ```python
  144. '''
  145. The Leung-Malik (LM) Filter Bank, implementation in python
  146. T. Leung and J. Malik. Representing and recognizing the visual appearance of
  147. materials using three-dimensional textons. International Journal of Computer
  148. Vision, 43(1):29-44, June 2001.
  149. Reference: http://www.robots.ox.ac.uk/~vgg/research/texclass/filters.html
  150. '''
  151. def gaussian1d(sigma, mean, x, ord):
  152. x = np.array(x)
  153. x_ = x - mean
  154. var = sigma**2
  155. # Gaussian Function
  156. g1 = (1/np.sqrt(2*np.pi*var))*(np.exp((-1*x_*x_)/(2*var)))
  157. if ord == 0:
  158. g = g1
  159. return g
  160. elif ord == 1:
  161. g = -g1*((x_)/(var))
  162. return g
  163. else:
  164. g = g1*(((x_*x_) - var)/(var**2))
  165. return g
  166. def gaussian2d(sup, scales):
  167. var = scales * scales
  168. shape = (sup,sup)
  169. n,m = [(i - 1)/2 for i in shape]
  170. x,y = np.ogrid[-m:m+1,-n:n+1]
  171. g = (1/np.sqrt(2*np.pi*var))*np.exp( -(x*x + y*y) / (2*var) )
  172. return g
  173. def log2d(sup, scales):
  174. var = scales * scales
  175. shape = (sup,sup)
  176. n,m = [(i - 1)/2 for i in shape]
  177. x,y = np.ogrid[-m:m+1,-n:n+1]
  178. g = (1/np.sqrt(2*np.pi*var))*np.exp( -(x*x + y*y) / (2*var) )
  179. h = g*((x*x + y*y) - var)/(var**2)
  180. return h
  181. def makefilter(scale, phasex, phasey, pts, sup):
  182. gx = gaussian1d(3*scale, 0, pts[0,...], phasex)
  183. gy = gaussian1d(scale, 0, pts[1,...], phasey)
  184. image = gx*gy
  185. image = np.reshape(image,(sup,sup))
  186. return image
  187. def makeLMfilters():
  188. sup = 49
  189. scalex = np.sqrt(2) * np.array([1,2,3])
  190. norient = 6
  191. nrotinv = 12
  192. nbar = len(scalex)*norient
  193. nedge = len(scalex)*norient
  194. nf = nbar+nedge+nrotinv
  195. F = np.zeros([sup,sup,nf])
  196. hsup = (sup - 1)/2
  197. x = [np.arange(-hsup,hsup+1)]
  198. y = [np.arange(-hsup,hsup+1)]
  199. [x,y] = np.meshgrid(x,y)
  200. orgpts = [x.flatten(), y.flatten()]
  201. orgpts = np.array(orgpts)
  202. count = 0
  203. for scale in range(len(scalex)):
  204. for orient in range(norient):
  205. angle = (np.pi * orient)/norient
  206. c = np.cos(angle)
  207. s = np.sin(angle)
  208. rotpts = [[c+0,-s+0],[s+0,c+0]]
  209. rotpts = np.array(rotpts)
  210. rotpts = np.dot(rotpts,orgpts)
  211. F[:,:,count] = makefilter(scalex[scale], 0, 1, rotpts, sup)
  212. F[:,:,count+nedge] = makefilter(scalex[scale], 0, 2, rotpts, sup)
  213. count = count + 1
  214. count = nbar+nedge
  215. scales = np.sqrt(2) * np.array([1,2,3,4])
  216. for i in range(len(scales)):
  217. F[:,:,count] = gaussian2d(sup, scales[i])
  218. count = count + 1
  219. for i in range(len(scales)):
  220. F[:,:,count] = log2d(sup, scales[i])
  221. count = count + 1
  222. for i in range(len(scales)):
  223. F[:,:,count] = log2d(sup, 3*scales[i])
  224. count = count + 1
  225. return F
  226. ```
  227. Now that we have our filters, we are going to display them to verify that they are what we expect.
  228. ```python
  229. def displayFilterBank():
  230. fig, axs = plt.subplots(6, 8, figsize=(20,20))
  231. F = makeLMfilters()
  232. for row in range(0, 6):
  233. for col in range(0, 8):
  234. f = F[:, :, row * 6 + col]
  235. axs[row][col].imshow(f, cmap='hot', interpolation='nearest')
  236. axs[row][col].grid(True)
  237. # plt.show()
  238. plt.savefig('filters.png')
  239. displayFilterBank()
  240. ```
  241. ![texture clustered image](media/k-means/output_15_0.png)
  242. With the filter bank, we can start clustering based on the responses from each filter response. When we clustered with color, we had three dimensions; with our filter bank, we now have 48 dimensions.
  243. ```python
  244. def generateGrayScale(image):
  245. w = np.array([[[ 0.25, 0.5, 0.25]]])
  246. gray2 = cv2.convertScaleAbs(np.sum(image*w, axis=2))
  247. return gray2
  248. def segmentImg(imgFilename, clusters):
  249. #1. Load and display the image from which you want to segment the foreground from the background
  250. # Make sure to convert your image to grayscale after loading
  251. img = cv2.imread(imgFilename)
  252. h = img.shape[0]
  253. w = img.shape[1]
  254. gImg = generateGrayScale(img)
  255. printI(gImg)
  256. #2. Create an overcomplete bank of filters F
  257. F = makeLMfilters()
  258. s = F.shape
  259. #3. Convolve the input image with every filter in the bank of filters
  260. responses = [None] * s[2]
  261. for k in range(0, s[2]):
  262. fb = F[:, :,k]
  263. response = cv2.filter2D(gImg,-1, fb)
  264. responses[k] = response
  265. #4.Take the absolute values of the responses and
  266. # reshape the reponse tensor into a matrix of size [row*cols, num_filters]
  267. a = np.zeros((h*w, s[2]), dtype=np.float)
  268. for r in range(0, h):
  269. for c in range(0, w):
  270. for f in range(0, s[2]):
  271. t = abs(responses[f][r][c])
  272. a[r * w + c][f] = abs(responses[f][r][c])
  273. #5. Run k-means on the vectorized reponses X to get a vector of labels (the clusters);
  274. kmeans = KMeans(n_clusters=clusters, random_state=0).fit(a).labels_
  275. #6. Reshape the label results of k-means so that it has the same size as the input image
  276. # Return the label image which we call idx
  277. kmeans.shape = (h, w)
  278. return kmeans
  279. ```
  280. ```python
  281. idx1 = segmentImg("bg.jpg",6)
  282. res1 = colorClustering(idx1, "bg.jpg", 6)
  283. printI(res1)
  284. ```
  285. ![png](output_18_0.png)
  286. With the texture clustering, you can see that part of the ocean was in the same cluster as part of the sky or beach. This color mixing is to be expected because they got clustered on texture and not color. When doing texture detection, a common thing to pick up is blurriness; this makes texture clustering useful for things like background removal when your foreground is in focus, and the background is blurred.
  287. # Takeaways
  288. Due to the k-means algorithm not always converting on an optimum answer and being profoundly affected by outlines, it is rarely used by itself. However, it frequently used to bootstrap and influence other algorithms.
  289. As the MacQueen stated initially in his article, "there is no feasible, general method which always yield an optimal partition." Due to the nature of NP-Hard problems, this fact is unlikely to change anytime soon. More recently, people have rebooted k-means to include a beam search to avoid converging on local maxima-- this process is called "k-means++" and Vassilvitskii outlines it in his 2007 paper titled "K-means++: the advantages of careful seeding."
  290. Another major question in this field of research is: how do we choose k? Newer work with k-means and other clustering techniques look into how we can automatically select a k using the elbow technique or other techniques like GVF.
  291. K-means has had a lasting effect on the field of machine learning. Most textbooks and AI classes cover k-means clustering as a starting point when teaching people about unsupervised learning. Moreover, algorithms to this day are still using k-means as a tool behind the scenes to pre-process data before it gets fed to the next step in the data pipeline.