Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

260 lines
9.4 KiB

  1. Let's get right into this post! Before you read this post, be sure to check out my original post on [word embeddings](https://jrtechs.net/data-science/word-embeddings). In this post, we will be using data from this blog to create and visualize a word embedding.
  2. To recap, let's first examine a small example using the [Gensim](https://pypi.org/project/gensim/) Word2Vec model:
  3. ```python
  4. from gensim.models import Word2Vec
  5. # defines some dummy data
  6. sentences = [["cat", "say", "meow"], ["dog", "say", "woof"], ["man", "say", "dam"]]
  7. # creates and trains model
  8. model = Word2Vec(min_count=1, size=10)
  9. model.build_vocab(sentences)
  10. model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)
  11. # saves the model so that you can load it later with model.load
  12. model.save("basic-word2vec.model")
  13. ```
  14. With this word2vec model, we can do things like get the most similar words and get the word vectorized.
  15. ```python
  16. model.wv.most_similar("dog")
  17. [('woof', 0.3232297897338867),
  18. ('dam', 0.14384251832962036),
  19. ('man', 0.11316978931427002),
  20. ('cat', -0.06251632422208786),
  21. ('say', -0.1781214326620102),
  22. ('meow', -0.21009384095668793)]
  23. ```
  24. ```python
  25. print(model.wv.get_vector("dog"))
  26. [ 0.04777663 0.01543251 -0.04632503 0.03601828 -0.00572644 0.00553683
  27. -0.04476452 -0.0274465 0.0047655 0.00508591]
  28. ```
  29. Using the TSNE method, we can reduce the dimensionality so that we can visualize it.
  30. ```python
  31. from sklearn.decomposition import IncrementalPCA # inital reduction
  32. from sklearn.manifold import TSNE # final reduction
  33. import numpy as np
  34. def reduce_dimensions(model):
  35. num_dimensions = 2 # final num dimensions (2D, 3D, etc)
  36. vectors = [] # positions in vector space
  37. labels = [] # keep track of words to label our data again later
  38. for word in model.wv.vocab:
  39. vectors.append(model.wv[word])
  40. labels.append(word)
  41. # convert both lists into numpy vectors for reduction
  42. vectors = np.asarray(vectors)
  43. labels = np.asarray(labels)
  44. # reduce using t-SNE
  45. vectors = np.asarray(vectors)
  46. tsne = TSNE(n_components=num_dimensions, random_state=0)
  47. vectors = tsne.fit_transform(vectors)
  48. x_vals = [v[0] for v in vectors]
  49. y_vals = [v[1] for v in vectors]
  50. return x_vals, y_vals, labels
  51. x_vals, y_vals, labels = reduce_dimensions(model)
  52. ```
  53. ```python
  54. print(labels)
  55. print(x_vals)
  56. print(y_vals)
  57. ```
  58. ['cat' 'say' 'meow' 'dog' 'woof' 'man' 'dam']
  59. [-29.594002, -45.996586, 20.368856, 53.92877, -12.437127, 3.9659712, 37.524284]
  60. [60.112713, 11.891685, 70.019325, 31.70431, -26.423267, 21.79772, -16.517805]
  61. Due to simplicity, we are using a basic matplotlib scatter plot to visualize it. Since it is messy to visualize every label on larger models, we are reducing the number of labels we are picking to display on the graph. However, in this first example, we can visualize all the labels.
  62. ```python
  63. import matplotlib.pyplot as plt
  64. import random
  65. def plot_with_matplotlib(x_vals, y_vals, labels, num_to_label):
  66. plt.figure(figsize=(5, 5))
  67. plt.scatter(x_vals, y_vals)
  68. plt.title("Embedding Space")
  69. indices = list(range(len(labels)))
  70. selected_indices = random.sample(indices, num_to_label)
  71. for i in selected_indices:
  72. plt.annotate(labels[i], (x_vals[i], y_vals[i]))
  73. plt.savefig('ex.png')
  74. plot_with_matplotlib(x_vals, y_vals, labels, 7)
  75. ```
  76. ![png](media/word-embeddings/output_9_0.png)
  77. Since my blog is written in markdown files, we need to extract all the code blocks out of it so that our model does not get fooled by things that are not "English."
  78. ```python
  79. def process_file(fileName):
  80. result = ""
  81. tempResult = ""
  82. inCodeBlock = False
  83. with open(fileName) as file:
  84. for line in file:
  85. if line.startswith("```"):
  86. inCodeBlock = not inCodeBlock
  87. elif inCodeBlock:
  88. pass
  89. else:
  90. for word in line.split():
  91. if "http" not in word and "media/"not in word:
  92. result = result + " " + word
  93. return result
  94. print(process_file("data/ackermann-function-written-in-java.md"))
  95. "The Ackermann function is a classic example of a function that is not primitive recursive – you cannot solve it using loops like Fibonacci. In other words, you have to use recursion to solve for values of the Ackermann function. For more information on the Ackermann function [click"
  96. ```
  97. Although this script takes us most of the way there with removing the code block, we still want to remove stopper words and punctuations. The Gensim library has a function that already does that for us.
  98. ```python
  99. from gensim import utils
  100. print(utils.simple_preprocess(process_file("data/ackermann-function-written-in-java.md")))
  101. ['the', 'ackermann', 'function', 'is', 'classic', 'example', 'of', 'function', 'that', 'is', 'not', 'primitive', 'recursive', 'you', 'cannot', 'solve', 'it', 'using', 'loops', 'like', 'fibonacci', 'in', 'other', 'words', 'you', 'have', 'to', 'use', 'recursion', 'to', 'solve', 'for', 'values', 'of', 'the', 'ackermann', 'function', 'for', 'more', 'information', 'on', 'the', 'ackermann', 'function', 'click']
  102. ```
  103. Loading everything in as memory and feeding it into an extensive list like the dummy example, although possible, is not feasible or efficient for large data sets. In Gensim, it is typical to create a Corpus: a corpus is a collection of documents used for training. To generate the corpus based on my prior blog posts, I need to process each blog post and store the cleaned version in a file.
  104. ```python
  105. import os
  106. file = open("jrtechs.cor", "w+")
  107. for file_name in os.listdir("data"):
  108. file.write(process_file("data/" + file_name) + "\n")
  109. file.close()
  110. ```
  111. To read our text version corpus, we create an iterable source that reads out each line from our text file. In our case, each line is an entire blog post.
  112. ```python
  113. from gensim.test.utils import datapath
  114. class MyCorpus(object):
  115. """An interator that yields sentences (lists of str)."""
  116. def __iter__(self):
  117. corpus_path = "jrtechs.cor"
  118. for line in open(corpus_path):
  119. # assume there's one document per line, tokens separated by whitespace
  120. yield utils.simple_preprocess(line)
  121. ```
  122. Now we can create a new model using our custom corpus. We will want to change the embedding size and the number of epochs we train our model.
  123. ```python
  124. sentences = MyCorpus()
  125. model = Word2Vec(min_count=1, size=20, sentences=sentences)
  126. model.train(sentences, total_examples=model.corpus_count, epochs=500)
  127. model.save("jrtechs-word2vec-500.model")
  128. ```
  129. With our model trained, we can run word similarity algorithms.
  130. ```python
  131. model.wv.most_similar("however")
  132. [('that', 0.7954616546630859),
  133. ('also', 0.7654829025268555),
  134. ('but', 0.7392309904098511),
  135. ('since', 0.7380496859550476),
  136. ('although', 0.7112778425216675),
  137. ('why', 0.7025969624519348),
  138. ('encouragement', 0.6704230308532715),
  139. ('because', 0.6692429184913635),
  140. ('faster', 0.6633850932121277),
  141. ('so', 0.6504907011985779)]
  142. ```
  143. ```python
  144. model.wv.most_similar("method")
  145. [('function', 0.8110014200210571),
  146. ('grouping', 0.7134513854980469),
  147. ('import', 0.7052735090255737),
  148. ('select', 0.6987707614898682),
  149. ('max', 0.6856316328048706),
  150. ('authentication', 0.6762576103210449),
  151. ('pseudo', 0.663935124874115),
  152. ('namespace', 0.6549093723297119),
  153. ('matrix', 0.6515613198280334),
  154. ('zero', 0.6360148191452026)]
  155. ```
  156. ```python
  157. x_vals, y_vals, labels = reduce_dimensions(model)
  158. ```
  159. If you run the TSNE algorithm with different training parameters, you will get various shapes in your output. I noticed that for my data, if I ran it with five epochs, I got a curvy distribution where if I trained the model with over 100 epochs, I just got a single cluster shape.
  160. ```python
  161. plot_with_matplotlib(x_vals, y_vals, labels, 10)
  162. ```
  163. ![embedding visualization](media/word-embeddings/ex1.png)
  164. ![embedding visualization](media/word-embeddings/output_24_0.png)
  165. If we zoom in on the embedding space, we can get a better idea of what clusters close to each other. This visualization is not perfect because we are trying to visualize 15 dimensions down into a two-dimensional graph.
  166. ```python
  167. import matplotlib.pyplot as plt
  168. import random
  169. def plot_with_matplotlib(x_vals, y_vals, labels, xmin=3, xmax=7, ymin=3, ymax=7):
  170. plt.figure(figsize=(5, 5))
  171. plt.scatter(x_vals, y_vals)
  172. plt.title("Embedding Space")
  173. plt.xlim(xmin, xmax)
  174. plt.ylim(ymin, ymax)
  175. for x, y, l in zip(x_vals, y_vals, labels):
  176. plt.annotate(l, (x, y))
  177. plt.savefig('smallex.png')
  178. plot_with_matplotlib(x_vals, y_vals, labels)
  179. ```
  180. ![embedding visualization](media/word-embeddings/smallex.png)
  181. It was impressive the results I got on my limited data scope. The Google news dataset explored last time was far more accurate; however, the model was quite large, and I was unable to run the TSNE algorithm on it without crashing my computer. Training your own word embedding model is useful because words can mean different things in different applications and contexts. Right now, there is a significant push at seeing how we can leverage pre-trained models. Pre-trained models enable you to take google's word2vec model trained on 100 billion words and tweak it using your data to fit your application better. Maybe I'll make that part three of this blog post series on embeddings.