Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

377 lines
15 KiB

  1. Word embeddings have stolen all my attention the last week.
  2. At a very high level, embeddings allow you to reduce the dimensionality something into a smaller vector that conveys positional meaning in that latent space. When considering something like a word, this is very useful because it enables you to use the vectorized version of the word in a secondary model. Since words that have similar meanings group together, it makes training faster. Computerphile has a fantastic video of this on Youtube.
  3. <youtube src="gQddtTdmG_8" />
  4. Google has open-sourced its tool for word embeddings called [word2vec](https://code.google.com/archive/p/word2vec/). Google also made their model trained on 100 billion words publicly [available](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing). Note: unzipped the model is about 3 GB.
  5. Using [Gensim](https://radimrehurek.com/gensim/) (a Python text data science library), we can load google's pre-trained model.
  6. ```python
  7. import gensim
  8. model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)
  9. ```
  10. One of the coolest things that you can do with vectorized words is query for words that are similar to a word that you provided. Gensim provides a straightforward function that gives you the ten most similar words, along with similarity scores.
  11. ```python
  12. model.most_similar("hello")
  13. output:
  14. [('hi', 0.654898464679718),
  15. ('goodbye', 0.639905571937561),
  16. ('howdy', 0.6310957074165344),
  17. ('goodnight', 0.5920578241348267),
  18. ('greeting', 0.5855878591537476),
  19. ('Hello', 0.5842196941375732),
  20. ("g'day", 0.5754077434539795),
  21. ('See_ya', 0.5688871145248413),
  22. ('ya_doin', 0.5643119812011719),
  23. ('greet', 0.5636603832244873)]
  24. ```
  25. ```python
  26. model.most_similar("cat")
  27. output:
  28. [('cats', 0.8099379539489746),
  29. ('dog', 0.7609456777572632),
  30. ('kitten', 0.7464985251426697),
  31. ('feline', 0.7326233983039856),
  32. ('beagle', 0.7150583267211914),
  33. ('puppy', 0.7075453996658325),
  34. ('pup', 0.6934291124343872),
  35. ('pet', 0.6891531348228455),
  36. ('felines', 0.6755931377410889),
  37. ('chihuahua', 0.6709762215614319)]
  38. ```
  39. Using genism's similar score method, we can write an elementary function that would transform a sentence into a new sentence using similar words.
  40. ```python
  41. def transformSentence(sentence):
  42. outputSentence = ""
  43. for word in sentence.split(" "):
  44. try:
  45. outputSentence += model.most_similar(word)[0][0] + " "
  46. except Exception:
  47. outputSentence += word + " "
  48. return outputSentence
  49. print(transformSentence("hello world"))
  50. output:
  51. hi globe
  52. ```
  53. ```python
  54. print(transformSentence("look mom no hands"))
  55. output:
  56. looks Mom No hand
  57. ```
  58. ```python
  59. print(transformSentence("The general idea of clustering is to group data with similar traits"))
  60. This gen_eral concept of Clustering was to groups Data wtih similiar trait
  61. ```
  62. Under elementary examples like "hello world," it did ok and transformed it into "hi globe." However, with more complex examples, it falls apart because it puts in non-sensical words or doesn't match verbs and nouns. It is common for this algorithm to cluster the plural of the noun very close to the singular of the noun. I.e., "cats" is conceptually similar to "cat."
  63. We can write a quick heuristic to try to filter these things out.
  64. ```python
  65. def removeFromString(string, chars):
  66. for c in chars:
  67. string = string.replace(c, "")
  68. return string
  69. def transformSentenceWithHeuristic(sentence):
  70. outputSentence = ""
  71. for word in sentence.split(" "):
  72. try:
  73. changed = False
  74. for w, _ in model.most_similar(word):
  75. clean = removeFromString(w, [' ', '_']).lower()
  76. if clean not in word.lower() and "_" not in w:
  77. outputSentence += w + " "
  78. changed = True
  79. break
  80. outputSentence = outputSentence if changed else outputSentence + word + " "
  81. except Exception:
  82. outputSentence += word + " "
  83. return outputSentence
  84. print(transformSentenceWithHeuristic("The general idea of clustering is to group data with similar traits."))
  85. Output:
  86. This manager concept of clusters was to groups datasets wtih similiar traits.
  87. ```
  88. ```python
  89. print(transformSentenceWithHeuristic("Sit down and grab a drink because it is time that we talk about the LSD trip that is the 1981 movie Shock Treatment."))
  90. Relax up and grabbing a drinks but that was day it I talking abut this hallucinogenic trips it was this 1981 film Fever Treatment.
  91. ```
  92. The output is not total garbage, but it isn't great. However, it is not nearly as bad as a GAN after it started training on character data alone. Most people end up using this vectorized version to feed it into a secondary learning algorithm rather than use it directly.
  93. To access the vector associated with each word, you can access it on the main model object like a normal Python map. The result is a length 300 vector with a set of numbers representing the "meaning" of the word.
  94. ```python
  95. print(model["cat"].shape)
  96. print(model["cat"])
  97. (300,)
  98. [ 0.0123291 0.20410156 -0.28515625 0.21679688 0.11816406 0.08300781
  99. 0.04980469 -0.00952148 0.22070312 -0.12597656 0.08056641 -0.5859375
  100. -0.00445557 -0.296875 -0.01312256 -0.08349609 0.05053711 0.15136719
  101. -0.44921875 -0.0135498 0.21484375 -0.14746094 0.22460938 -0.125
  102. -0.09716797 0.24902344 -0.2890625 0.36523438 0.41210938 -0.0859375
  103. -0.07861328 -0.19726562 -0.09082031 -0.14160156 -0.10253906 0.13085938
  104. -0.00346375 0.07226562 0.04418945 0.34570312 0.07470703 -0.11230469
  105. 0.06738281 0.11230469 0.01977539 -0.12353516 0.20996094 -0.07226562
  106. -0.02783203 0.05541992 -0.33398438 0.08544922 0.34375 0.13964844
  107. 0.04931641 -0.13476562 0.16308594 -0.37304688 0.39648438 0.10693359
  108. 0.22167969 0.21289062 -0.08984375 0.20703125 0.08935547 -0.08251953
  109. 0.05957031 0.10205078 -0.19238281 -0.09082031 0.4921875 0.03955078
  110. -0.07080078 -0.0019989 -0.23046875 0.25585938 0.08984375 -0.10644531
  111. 0.00105286 -0.05883789 0.05102539 -0.0291748 0.19335938 -0.14160156
  112. -0.33398438 0.08154297 -0.27539062 0.10058594 -0.10449219 -0.12353516
  113. -0.140625 0.03491211 -0.11767578 -0.1796875 -0.21484375 -0.23828125
  114. 0.08447266 -0.07519531 -0.25976562 -0.21289062 -0.22363281 -0.09716797
  115. 0.11572266 0.15429688 0.07373047 -0.27539062 0.14257812 -0.0201416
  116. 0.10009766 -0.19042969 -0.09375 0.14160156 0.17089844 0.3125
  117. -0.16699219 -0.08691406 -0.05004883 -0.24902344 -0.20800781 -0.09423828
  118. -0.12255859 -0.09472656 -0.390625 -0.06640625 -0.31640625 0.10986328
  119. -0.00156403 0.04345703 0.15625 -0.18945312 -0.03491211 0.03393555
  120. -0.14453125 0.01611328 -0.14160156 -0.02392578 0.01501465 0.07568359
  121. 0.10742188 0.12695312 0.10693359 -0.01184082 -0.24023438 0.0291748
  122. 0.16210938 0.19921875 -0.28125 0.16699219 -0.11621094 -0.25585938
  123. 0.38671875 -0.06640625 -0.4609375 -0.06176758 -0.14453125 -0.11621094
  124. 0.05688477 0.03588867 -0.10693359 0.18847656 -0.16699219 -0.01794434
  125. 0.10986328 -0.12353516 -0.16308594 -0.14453125 0.12890625 0.11523438
  126. 0.13671875 0.05688477 -0.08105469 -0.06152344 -0.06689453 0.27929688
  127. -0.19628906 0.07226562 0.12304688 -0.20996094 -0.22070312 0.21386719
  128. -0.1484375 -0.05932617 0.05224609 0.06445312 -0.02636719 0.13183594
  129. 0.19433594 0.27148438 0.18652344 0.140625 0.06542969 -0.14453125
  130. 0.05029297 0.08837891 0.12255859 0.26757812 0.0534668 -0.32226562
  131. -0.20703125 0.18164062 0.04418945 -0.22167969 -0.13769531 -0.04174805
  132. -0.00286865 0.04077148 0.07275391 -0.08300781 0.08398438 -0.3359375
  133. -0.40039062 0.01757812 -0.18652344 -0.0480957 -0.19140625 0.10107422
  134. 0.09277344 -0.30664062 -0.19921875 -0.0168457 0.12207031 0.14648438
  135. -0.12890625 -0.23535156 -0.05371094 -0.06640625 0.06884766 -0.03637695
  136. 0.2109375 -0.06005859 0.19335938 0.05151367 -0.05322266 0.02893066
  137. -0.27539062 0.08447266 0.328125 0.01818848 0.01495361 0.04711914
  138. 0.37695312 -0.21875 -0.03393555 0.01116943 0.36914062 0.02160645
  139. 0.03466797 0.07275391 0.16015625 -0.16503906 -0.296875 0.15039062
  140. -0.29101562 0.13964844 0.00448608 0.171875 -0.21972656 0.09326172
  141. -0.19042969 0.01599121 -0.09228516 0.15722656 -0.14160156 -0.0534668
  142. 0.03613281 0.23632812 -0.15136719 -0.00689697 -0.27148438 -0.07128906
  143. -0.16503906 0.18457031 -0.08398438 0.18554688 0.11669922 0.02758789
  144. -0.04760742 0.17871094 0.06542969 -0.03540039 0.22949219 0.02697754
  145. -0.09765625 0.26953125 0.08349609 -0.13085938 -0.10107422 -0.00738525
  146. 0.07128906 0.14941406 -0.20605469 0.18066406 -0.15820312 0.05932617
  147. 0.28710938 -0.04663086 0.15136719 0.4921875 -0.27539062 0.05615234]
  148. ```
  149. ## Visualization
  150. One cool thing that the vectorized versions of words enable us to is to visualize similarities between concepts. We can construct a correlation matrix and display it as a heatmap.
  151. ```python
  152. import numpy as np
  153. def createCorrelationMatrix(words):
  154. l = len(words)
  155. matrix = np.empty((l, l), np.float)
  156. for r in range(0, l):
  157. for c in range(0, l):
  158. matrix[r][c] = model.similarity(words[r], words[c])
  159. return matrix
  160. testMatrix = ["cat", "dog", "computer"]
  161. print(createCorrelationMatrix(testMatrix))
  162. [[1. 0.76094574 0.17324439]
  163. [0.76094574 0.99999994 0.12194333]
  164. [0.17324439 0.12194333 1. ]]
  165. ```
  166. ```python
  167. def displayMap(a):
  168. plt.imshow(a, cmap='hot', interpolation='nearest')
  169. plt.show()
  170. displayMap(createCorrelationMatrix(testMatrix))
  171. ```
  172. ![png](media/word-embeddings/output_11_0.png)
  173. Examining this heatmap, we can see that "dog" and "cat" are more similar than "computer"; however, the heatmap isn't the best. We can use matplotlib to jazz up heatmap and make it more exciting.
  174. ```python
  175. from matplotlib import pyplot as plt
  176. import matplotlib.image as mpimg
  177. def displayMap(a):
  178. plt.imshow(a, cmap='hot', interpolation='nearest')
  179. plt.show()
  180. def heatmap(data, row_labels, col_labels, ax=None):
  181. cbar_kw={}
  182. ax = plt.gca()
  183. im = ax.imshow(data, cmap="YlGn")
  184. # Create colorbar
  185. cbar = ax.figure.colorbar(im, ax=ax, label="Correlation")
  186. cbar.ax.set_ylabel("Correlation", rotation=-90, va="bottom")
  187. # We want to show all ticks...
  188. ax.set_xticks(np.arange(data.shape[1]))
  189. ax.set_yticks(np.arange(data.shape[0]))
  190. # ... and label them with the respective list entries.
  191. ax.set_xticklabels(col_labels)
  192. ax.set_yticklabels(row_labels)
  193. # Let the horizontal axes labeling appear on top.
  194. ax.tick_params(top=True, bottom=False,
  195. labeltop=True, labelbottom=False)
  196. # Rotate the tick labels and set their alignment.
  197. plt.setp(ax.get_xticklabels(), rotation=-30, ha="right",
  198. rotation_mode="anchor")
  199. # Turn spines off and create white grid.
  200. for edge, spine in ax.spines.items():
  201. spine.set_visible(False)
  202. ax.set_xticks(np.arange(data.shape[1]+1)-.5, minor=True)
  203. ax.set_yticks(np.arange(data.shape[0]+1)-.5, minor=True)
  204. ax.grid(which="minor", color="w", linestyle='-', linewidth=3)
  205. ax.tick_params(which="minor", bottom=False, left=False)
  206. return im, cbar
  207. def annotate_heatmap(im, data=None,
  208. threshold=None, **textkw):
  209. valfmt="{x:.2f}"
  210. textcolors=["black", "white"]
  211. if not isinstance(data, (list, np.ndarray)):
  212. data = im.get_array()
  213. # Normalize the threshold to the images color range.
  214. if threshold is not None:
  215. threshold = im.norm(threshold)
  216. else:
  217. threshold = im.norm(data.max())/2.
  218. # Set default alignment to center, but allow it to be
  219. # overwritten by textkw.
  220. kw = dict(horizontalalignment="center",
  221. verticalalignment="center")
  222. kw.update(textkw)
  223. # Get the formatter in case a string is supplied
  224. if isinstance(valfmt, str):
  225. valfmt = matplotlib.ticker.StrMethodFormatter(valfmt)
  226. # Loop over the data and create a `Text` for each "pixel".
  227. # Change the text's color depending on the data.
  228. texts = []
  229. for i in range(data.shape[0]):
  230. for j in range(data.shape[1]):
  231. kw.update(color=textcolors[int(im.norm(data[i, j]) > threshold)])
  232. text = im.axes.text(j, i, valfmt(data[i, j], None))
  233. texts.append(text)
  234. return texts
  235. def plotWordCorrelations(words):
  236. fig, ax = plt.subplots(figsize=(10,10))
  237. matrix = createCorrelationMatrix(words)
  238. im, cbar = heatmap(matrix, words, words, ax=ax)
  239. print(im)
  240. texts = annotate_heatmap(im, valfmt="{x:.1f} t")
  241. fig.tight_layout()
  242. plt.show()
  243. plt.savefig(str(len(words)) + '.png')
  244. plotWordCorrelations(["cat", "dog", "computer"])
  245. ```
  246. ![png](media/word-embeddings/output_14_1.png)
  247. ```python
  248. plotWordCorrelations(["good", "bad", "salty", "candy", "santa", "christmas"])
  249. ```
  250. ![png](media/word-embeddings/output_15_1.png)
  251. The annotated version of the correlation matrix gives us more insight into the similarities of words. Although you might be thinking, "wow Santa and Christmas are related, who would have known!" We already use embedding in more abstract things like people, graphs, books, etc. In these more advanced examples, seeing a correlation of objects would be more insightful. Moreover, these measures of correlation are more useful to machine learning algorithms since they can use it to learn concepts faster.
  252. If we wanted to visualize the entire dataset, that would be infeasible since it is 300 dimensions! On a graph, we usually only plot up to three dimensions. To visualize this model, we need to reduce the dimensionality.
  253. I planned on doing this with t-Distributed Stochastic Neighbor Embedding (t-SNE). T-SNE is a standard method used when trying to visualize high-dimensional data.
  254. ```python
  255. from sklearn.decomposition import IncrementalPCA # inital reduction
  256. from sklearn.manifold import TSNE # final reduction
  257. import numpy as np # array handling
  258. def reduce_dimensions(model):
  259. num_dimensions = 2 # final num dimensions (2D, 3D, etc)
  260. vectors = [] # positions in vector space
  261. labels = [] # keep track of words to label our data again later
  262. for word in model.wv.vocab:
  263. vectors.append(model.wv[word])
  264. labels.append(word)
  265. # convert both lists into numpy vectors for reduction
  266. vectors = np.asarray(vectors)
  267. labels = np.asarray(labels)
  268. # reduce using t-SNE
  269. vectors = np.asarray(vectors)
  270. tsne = TSNE(n_components=num_dimensions, random_state=0)
  271. vectors = tsne.fit_transform(vectors)
  272. x_vals = [v[0] for v in vectors]
  273. y_vals = [v[1] for v in vectors]
  274. return x_vals, y_vals, labels
  275. x_vals, y_vals, labels = reduce_dimensions(model)
  276. ```
  277. ![out of swap space](media/word-embeddings/swap.png)
  278. When I was halfway through training this, my computer ran out of swap space -- space on the HHD the computer uses when out of RAM. In a future blog post, I may create my embedding that is lower in dimensionality and see if I can use this visualization method.