Repository where I mostly put random python scripts.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

486 lines
595 KiB

  1. {
  2. "cells": [
  3. {
  4. "cell_type": "markdown",
  5. "metadata": {},
  6. "source": [
  7. "Let's get right into this post! But, before you read this post, be sure to check out my original post on [word embeddings](https://jrtechs.net/data-science/word-embeddings). In this post we will be using data from this blog to create and visualize a word embedding.\n",
  8. "\n",
  9. "To recap, let's first look at a small example using the [Gensim](https://pypi.org/project/gensim/) Word2Vec model:"
  10. ]
  11. },
  12. {
  13. "cell_type": "code",
  14. "execution_count": 156,
  15. "metadata": {},
  16. "outputs": [],
  17. "source": [
  18. "from gensim.models import Word2Vec\n",
  19. "\n",
  20. "# defines some dummy data\n",
  21. "sentences = [[\"cat\", \"say\", \"meow\"], [\"dog\", \"say\", \"woof\"], [\"man\", \"say\", \"dam\"]]\n",
  22. "\n",
  23. "# creates and trains model\n",
  24. "model = Word2Vec(min_count=1, size=10)\n",
  25. "model.build_vocab(sentences)\n",
  26. "model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs) \n",
  27. "\n",
  28. "# saves the model so that you can load it later with model.load\n",
  29. "model.save(\"basic-word2vec.model\")"
  30. ]
  31. },
  32. {
  33. "cell_type": "markdown",
  34. "metadata": {},
  35. "source": [
  36. "With this word2vec model we can do things like get the most similar words and get the vectorized version of the word. "
  37. ]
  38. },
  39. {
  40. "cell_type": "code",
  41. "execution_count": 114,
  42. "metadata": {},
  43. "outputs": [
  44. {
  45. "data": {
  46. "text/plain": [
  47. "[('woof', 0.3232297897338867),\n",
  48. " ('dam', 0.14384251832962036),\n",
  49. " ('man', 0.11316978931427002),\n",
  50. " ('cat', -0.06251632422208786),\n",
  51. " ('say', -0.1781214326620102),\n",
  52. " ('meow', -0.21009384095668793)]"
  53. ]
  54. },
  55. "execution_count": 114,
  56. "metadata": {},
  57. "output_type": "execute_result"
  58. }
  59. ],
  60. "source": [
  61. "model.wv.most_similar(\"dog\")"
  62. ]
  63. },
  64. {
  65. "cell_type": "code",
  66. "execution_count": 14,
  67. "metadata": {},
  68. "outputs": [
  69. {
  70. "name": "stdout",
  71. "output_type": "stream",
  72. "text": [
  73. "[ 0.04777663 0.01543251 -0.04632503 0.03601828 -0.00572644 0.00553683\n",
  74. " -0.04476452 -0.0274465 0.0047655 0.00508591]\n"
  75. ]
  76. }
  77. ],
  78. "source": [
  79. "print(model.wv.get_vector(\"dog\"))"
  80. ]
  81. },
  82. {
  83. "cell_type": "markdown",
  84. "metadata": {},
  85. "source": [
  86. "Using the TSNE method, we can reduce the dimensionality so that we can visualize it."
  87. ]
  88. },
  89. {
  90. "cell_type": "code",
  91. "execution_count": 158,
  92. "metadata": {},
  93. "outputs": [],
  94. "source": [
  95. "from sklearn.decomposition import IncrementalPCA # inital reduction\n",
  96. "from sklearn.manifold import TSNE # final reduction\n",
  97. "import numpy as np \n",
  98. "\n",
  99. "def reduce_dimensions(model):\n",
  100. " num_dimensions = 2 # final num dimensions (2D, 3D, etc)\n",
  101. "\n",
  102. " vectors = [] # positions in vector space\n",
  103. " labels = [] # keep track of words to label our data again later\n",
  104. " for word in model.wv.vocab:\n",
  105. " vectors.append(model.wv[word])\n",
  106. " labels.append(word)\n",
  107. "\n",
  108. " # convert both lists into numpy vectors for reduction\n",
  109. " vectors = np.asarray(vectors)\n",
  110. " labels = np.asarray(labels)\n",
  111. "\n",
  112. " # reduce using t-SNE\n",
  113. " vectors = np.asarray(vectors)\n",
  114. " tsne = TSNE(n_components=num_dimensions, random_state=0)\n",
  115. " vectors = tsne.fit_transform(vectors)\n",
  116. "\n",
  117. " x_vals = [v[0] for v in vectors]\n",
  118. " y_vals = [v[1] for v in vectors]\n",
  119. " return x_vals, y_vals, labels\n",
  120. "\n",
  121. "\n",
  122. "x_vals, y_vals, labels = reduce_dimensions(model)"
  123. ]
  124. },
  125. {
  126. "cell_type": "code",
  127. "execution_count": 26,
  128. "metadata": {},
  129. "outputs": [
  130. {
  131. "name": "stdout",
  132. "output_type": "stream",
  133. "text": [
  134. "['cat' 'say' 'meow' 'dog' 'woof' 'man' 'dam']\n",
  135. "[-29.594002, -45.996586, 20.368856, 53.92877, -12.437127, 3.9659712, 37.524284]\n",
  136. "[60.112713, 11.891685, 70.019325, 31.70431, -26.423267, 21.79772, -16.517805]\n"
  137. ]
  138. }
  139. ],
  140. "source": [
  141. "print(labels)\n",
  142. "print(x_vals)\n",
  143. "print(y_vals)"
  144. ]
  145. },
  146. {
  147. "cell_type": "markdown",
  148. "metadata": {},
  149. "source": [
  150. "Due to simplicity, we are using a basic matplotlib scatter plot to visualize it. Since it is messy to visualize every label on larger models, we are reducing the number of labels we are picking to display on the graph. However, on this initial example we can visualize all the labels. "
  151. ]
  152. },
  153. {
  154. "cell_type": "code",
  155. "execution_count": 161,
  156. "metadata": {},
  157. "outputs": [
  158. {
  159. "data": {
  160. "image/png": "iVBORw0KGgoAAAANSUhEUgAAAUcAAAE/CAYAAADVOAHHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAfKUlEQVR4nO3de3hU5bn+8e9DRAxYAQsiBOTQ0nAaMRDESoNyUKgWQazVFi2Hst1YtW5bEWh/9odWWyt0222rtZ4AWxSxQkTbXQqCmtgiBIMcxHiMykHEahA1VEif/cespAN9kUAmmUxyf65rrqz1rvWu9bwx3K7DzCxzd0REZH9NUl2AiEh9pHAUEQlQOIqIBCgcRUQCFI4iIgEKRxGRAIWjHDYzm2tmNyVpWxPMrPAzlj9lZpOj6XFm9pdk7FfkUBSOjYSZlZpZuZl9lPD6darrOhzuPt/dz66NbZvZV8zsr2a2y8zeN7NnzWxAbexL0sNRqS5A6tQod1+e6iLqGzM7DngCuBxYCBwN5AH/SGVdklo6cpTKU9tnzew2Myszs9fN7PSo/W0ze9fMxh/QrY2ZLTOz3Wb2tJl1Tthej2jZ+2ZWYmbfSFj2eTNbYmYfmtlq4AsH1HKWmb0UHcH9GrAD6ixMmHczm2Jmr5jZB2Z2h5lZtCzDzH5hZu+Z2RtmdmW0fuiA4EsA7v6Qu1e4e7m7/8Xd1x/w+/lVVNdLZjYsoY6JZrY5+l28bmb/ecCYRpvZumjMr5nZyKi9pZndZ2bbzWyrmd1kZhnV/e8mtUvhKJUGAuuBzwMPAguAAcAXgUuAX5vZsQnrjwN+ArQB1gHzAcysBbAs2sYJwDeBO82sd9TvDmAP0B6YFL2I+rYBHgX+X7Td14BBh6j7a1GdfYFvACOi9v8AvgqcAvQDxnzGNl4GKsxsnpl91cxaB9YZCLwe1fX/gUVmdny07N2ojuOAicBtZtYvGtOpwAPAVKAVMBgojfrNA/YR/x3nAGcDkw8xXqkr7q5XI3gR/wf5EVCW8PqPaNkE4JWEdWOAA+0S2v4OnBJNzwUWJCw7FqgAOgEXAQUH7Pu3xAMlA9gL9EhY9lOgMJr+NrAqYZkBW4DJCXUWJix34CsJ8wuB6dH0CuA/E5YNj9Y/6iC/n57RuLYQD6wlleOP9rsNsIT1VwOXHmRb+cDVCWO/LbBOO+Kn7ZkJbd8EVqb6b0Wv+EvXHBuXMX7wa447EqbLAdz9wLbEI8e3Kyfc/SMzex/oAHQGBppZWcK6RwG/A9pG028nLHszYbrDAdt1M0tcN+SdhOlPEmrcb1sHTP8bd99MPAQxsx7A74FfEg8sgK0eJVhC3R2i9b9KPPy/RPxsrDmwIVqvE/CnwC47A02B7dGVAKK+hxqv1BGFoxypTpUT0en28cSPrt4Gnnb3sw7sEF1P2xf1fSlqPilhle0HbNcS5w/TdqBjqN5DcfeXzGwukHjtMMvMLCEgTwKWmFkz4pcCvg085u57zSyff10rfZsDrqsmtP8DaOPu+6pbm9QdXXOUI3VO9PaXo4lfe3zO3d8mftf3S2Z2qZk1jV4DzKynu1cAi4CZZtbczHoBiTd6/gj0NrOx0Y2T7wEnHmF9C4GrzSzLzFoB0w62YnQD6Qdm1jGa70T8iHFVwmonAN+LxnMh8dPwPxG/s90M2Ansi44iE99udB8w0cyGmVmTqJ4e7r4d+AvwCzM7Llr2BTM74wjHK0mmcGxcHrf93+e4uAbbepD4qeT7QH/iN2hw993Ew+Fi4keS7wA/Jx4gAFcSP/V9h/g1vjmVG3T394ALgVuIX+PsDjx7hPXdQzx81gPFxINsH/FrowfaTfyGy3Nm9jHxUNwI/CBhneeiet4Dbga+7u5/j8b7PeJh/AHwLeLXKyvHtJroJg2wC3ia+Ck1xI82jwZejPr+gfiNKqkHbP/LKCINU3REd5e7dz7kyv/edwLxm0JfSXphUm/pyFEaJDPLNLNzzOwoM8sifpRbkyNlaWQUjtJQGXAD8dPVYmAz8OOUViRpRafVIiIBOnIUEQlQOIqIBNSrN4G3adPGu3TpkuoyRKSBWbt27Xvu3vZw+tSrcOzSpQtFRUWpLkNEGhgze/PQa+1Pp9UiIgEKRxGRAIWjiEiAwlFEJEDhKCISoHAUqWWlpaX06NGDyZMn06dPH8aNG8fy5csZNGgQ3bt3Z/Xq1Xz88cdMmjSJAQMGkJOTw2OPPQbAnj17mDhxIrFYjJycHFauXAnAOeecw/r16wHIycnhxhtvBOD666/n3nvvTc1AG5h69VYekYbq1Vdf5ZFHHuHuu+9mwIABPPjggxQWFrJkyRJ++tOf0qtXL4YOHcr9999PWVkZp556KsOHD+euu+4CYMOGDbz00kucffbZvPzyywwePJiCggK6dOnCUUcdxbPPxr/ZrbCwkEsuuSSVQ20wFI4itSC/eCuzlpawrayc430XJ3ToRCwWA6B3794MGzYMMyMWi1FaWsqWLVtYsmQJs2fPBuJHjG+99RaFhYVcddVVAPTo0YPOnTvz8ssvk5eXx+23307Xrl0599xzWbZsGZ988gmlpaVkZ2enbNwNicJRJMnyi7cyY9EGyvfGv1d3x4d7+PseJ794K2NysmjSpAnNmsW/+7dJkybs27ePjIwMHn300X8LtoN9McyAAQMoKiqiW7dunHXWWbz33nvcc8899O/fv3YH14jomqNIks1aWlIVjJXcnVlLSw7aZ8SIEfzqV7+qCsPi4mIABg8ezPz58wF4+eWXeeutt8jOzuboo4+mU6dOLFy4kNNOO428vDxmz55NXl5eLY2q8VE4iiTZtrLyw2qH+I2UvXv3cvLJJ9OnTx+uv/56AL773e9SUVFBLBbjoosuYu7cuVVHnXl5ebRr147mzZuTl5fHli1bFI5JVK++zzE3N9f12WpJd4NuWcHWQBBmtcrk2elDU1CRmNlad889nD46chRJsqkjsslsmrFfW2bTDKaO0I2SdKIbMiJJNiYnC6DqbnWHVplMHZFd1S7pQeEoUgvG5GQpDNOcTqtFRAIUjiIiAQpHEZEAhaOISIDCUUQkQOEoIhKgcBQRCVA4iogEKBxFRAIUjiIiAQpHEZEAhaOISIDCUUQkQOFYDz311FP89a9/TXUZIo2awrEeUjiKpJ7CsQ498MADnHzyyfTt25dLL72Uxx9/nIEDB5KTk8Pw4cPZsWMHpaWl3HXXXdx2222ccsopFBQUpLpskUZJX3ZbRzZt2sTNN9/Ms88+S5s2bXj//fcxM1atWoWZce+993Lrrbfyi1/8gilTpnDsscdy7bXXprpskUYrKeFoZq2Ae4E+gAOTgBLgYaALUAp8w90/SMb+0tGKFSv4+te/Tps2bQA4/vjj2bBhAxdddBHbt2/n008/pWvXrimuUkQqJeu0+n+AP7t7D6AvsBmYDjzp7t2BJ6P5Rie/eCuDblnBzMc28sDf3iS/eGvVsquuuoorr7ySDRs28Nvf/pY9e/aksFIRSVTjcDSz44DBwH0A7v6pu5cBo4F50WrzgDE13Ve6yS/eyoxFG9haVk6zzn15Z91Krvv9s+QXb+X9999n165dZGXFnzMyb968qn6f+9zn2L17d6rKFhGSc+TYDdgJzDGzYjO718xaAO3cfTtA9POEJOwrrcxaWkL53goAjm7bmZZfvojSB6Yy7pzBfP/732fmzJlceOGF5OXlVZ1uA4waNYrFixfrhoxICpm712wDZrnAKmCQuz9nZv8DfAhc5e6tEtb7wN1bB/pfBlwGcNJJJ/V/8803a1RPfdJ1+h8J/XYNeOOWc+u6HJFGy8zWunvu4fRJxpHjFmCLuz8Xzf8B6AfsMLP2UWHtgXdDnd39bnfPdffctm3bJqGc+qNDq8zDaheR+qPG4eju7wBvm1l21DQMeBFYAoyP2sYDj9V0X+lm6ohsMptm7NeW2TSDqSOyD9JDROqLZL3P8SpgvpkdDbwOTCQevAvN7DvAW8CFSdpX2qh8qPuspSVsKyunQ6tMpo7I1sPeRdJAja85JlNubq4XFRWlugwRaWBSdc1
  161. "text/plain": [
  162. "<Figure size 360x360 with 1 Axes>"
  163. ]
  164. },
  165. "metadata": {
  166. "needs_background": "light"
  167. },
  168. "output_type": "display_data"
  169. }
  170. ],
  171. "source": [
  172. "import matplotlib.pyplot as plt\n",
  173. "import random\n",
  174. "\n",
  175. "def plot_with_matplotlib(x_vals, y_vals, labels, num_to_label):\n",
  176. " plt.figure(figsize=(5, 5))\n",
  177. " plt.scatter(x_vals, y_vals)\n",
  178. " plt.title(\"Embedding Space\")\n",
  179. " indices = list(range(len(labels)))\n",
  180. " selected_indices = random.sample(indices, num_to_label)\n",
  181. " for i in selected_indices:\n",
  182. " plt.annotate(labels[i], (x_vals[i], y_vals[i]))\n",
  183. " plt.savefig('ex.png')\n",
  184. " \n",
  185. "plot_with_matplotlib(x_vals, y_vals, labels, 7)"
  186. ]
  187. },
  188. {
  189. "cell_type": "markdown",
  190. "metadata": {},
  191. "source": [
  192. "Since my blog is written in markdown files, we need a way to extract all the code blocks out of it so that our model does not get fooled by things that are not \"English.\""
  193. ]
  194. },
  195. {
  196. "cell_type": "code",
  197. "execution_count": 118,
  198. "metadata": {},
  199. "outputs": [
  200. {
  201. "name": "stdout",
  202. "output_type": "stream",
  203. "text": [
  204. " The Ackermann function is a classic example of a function that is not primitive recursive – you cannot solve it using loops like Fibonacci. In other words, you have to use recursion to solve for values of the Ackermann function. For more information on the Ackermann function [click\n"
  205. ]
  206. }
  207. ],
  208. "source": [
  209. "def process_file(fileName):\n",
  210. " result = \"\"\n",
  211. " tempResult = \"\"\n",
  212. " inCodeBlock = False\n",
  213. "\n",
  214. " with open(fileName) as file:\n",
  215. " for line in file:\n",
  216. " if line.startswith(\"```\"):\n",
  217. " inCodeBlock = not inCodeBlock\n",
  218. " elif inCodeBlock:\n",
  219. " pass\n",
  220. " else:\n",
  221. " for word in line.split():\n",
  222. " if \"http\" not in word and \"media/\"not in word:\n",
  223. " result = result + \" \" + word\n",
  224. " return result\n",
  225. "\n",
  226. "print(process_file(\"data/ackermann-function-written-in-java.md\"))"
  227. ]
  228. },
  229. {
  230. "cell_type": "markdown",
  231. "metadata": {},
  232. "source": [
  233. "Although this script takes us most of the way there with removing the code block, we still want to remove stopper words and punctuations. The gensim library has a function that already does that for us."
  234. ]
  235. },
  236. {
  237. "cell_type": "code",
  238. "execution_count": 119,
  239. "metadata": {},
  240. "outputs": [
  241. {
  242. "name": "stdout",
  243. "output_type": "stream",
  244. "text": [
  245. "['the', 'ackermann', 'function', 'is', 'classic', 'example', 'of', 'function', 'that', 'is', 'not', 'primitive', 'recursive', 'you', 'cannot', 'solve', 'it', 'using', 'loops', 'like', 'fibonacci', 'in', 'other', 'words', 'you', 'have', 'to', 'use', 'recursion', 'to', 'solve', 'for', 'values', 'of', 'the', 'ackermann', 'function', 'for', 'more', 'information', 'on', 'the', 'ackermann', 'function', 'click']\n"
  246. ]
  247. }
  248. ],
  249. "source": [
  250. "from gensim import utils\n",
  251. "print(utils.simple_preprocess(process_file(\"data/ackermann-function-written-in-java.md\")))"
  252. ]
  253. },
  254. {
  255. "cell_type": "markdown",
  256. "metadata": {},
  257. "source": [
  258. "Loading everything in as memory and feeding it into a large list like the dummy example, although possible, is not feasable or efficient for large data sets. In Gensim it is typical to create a Corpus: a corpus is a collection of documents used for training. To create the corpus baised on my prior blog posts, I need to process each blog post and store the cleaned version in a file."
  259. ]
  260. },
  261. {
  262. "cell_type": "code",
  263. "execution_count": 49,
  264. "metadata": {},
  265. "outputs": [],
  266. "source": [
  267. "import os\n",
  268. "file = open(\"jrtechs.cor\", \"w+\")\n",
  269. "for file_name in os.listdir(\"data\"):\n",
  270. " file.write(process_file(\"data/\" + file_name) + \"\\n\")\n",
  271. "file.close()"
  272. ]
  273. },
  274. {
  275. "cell_type": "markdown",
  276. "metadata": {},
  277. "source": [
  278. "To read our text version corpus, we simply create an iterable source that just reads out each line from our text file. In our case each line is an entire blog post."
  279. ]
  280. },
  281. {
  282. "cell_type": "code",
  283. "execution_count": 51,
  284. "metadata": {},
  285. "outputs": [],
  286. "source": [
  287. "from gensim.test.utils import datapath\n",
  288. "\n",
  289. "class MyCorpus(object):\n",
  290. " \"\"\"An interator that yields sentences (lists of str).\"\"\"\n",
  291. "\n",
  292. " def __iter__(self):\n",
  293. " corpus_path = \"jrtechs.cor\"\n",
  294. " for line in open(corpus_path):\n",
  295. " # assume there's one document per line, tokens separated by whitespace\n",
  296. " yield utils.simple_preprocess(line)"
  297. ]
  298. },
  299. {
  300. "cell_type": "markdown",
  301. "metadata": {},
  302. "source": [
  303. "Now we can create a new model using our custom corpus. The things that we will want to change is the embedding size and the number of epochs we train our model."
  304. ]
  305. },
  306. {
  307. "cell_type": "code",
  308. "execution_count": 128,
  309. "metadata": {},
  310. "outputs": [],
  311. "source": [
  312. "sentences = MyCorpus()\n",
  313. "model = Word2Vec(min_count=1, size=20, sentences=sentences)\n",
  314. "model.train(sentences, total_examples=model.corpus_count, epochs=500) \n",
  315. "model.save(\"jrtechs-word2vec-500.model\")"
  316. ]
  317. },
  318. {
  319. "cell_type": "code",
  320. "execution_count": 130,
  321. "metadata": {},
  322. "outputs": [
  323. {
  324. "data": {
  325. "text/plain": [
  326. "[('that', 0.7954616546630859),\n",
  327. " ('also', 0.7654829025268555),\n",
  328. " ('but', 0.7392309904098511),\n",
  329. " ('since', 0.7380496859550476),\n",
  330. " ('although', 0.7112778425216675),\n",
  331. " ('why', 0.7025969624519348),\n",
  332. " ('encouragement', 0.6704230308532715),\n",
  333. " ('because', 0.6692429184913635),\n",
  334. " ('faster', 0.6633850932121277),\n",
  335. " ('so', 0.6504907011985779)]"
  336. ]
  337. },
  338. "execution_count": 130,
  339. "metadata": {},
  340. "output_type": "execute_result"
  341. }
  342. ],
  343. "source": [
  344. "model.wv.most_similar(\"however\")"
  345. ]
  346. },
  347. {
  348. "cell_type": "code",
  349. "execution_count": 139,
  350. "metadata": {},
  351. "outputs": [
  352. {
  353. "data": {
  354. "text/plain": [
  355. "[('function', 0.8110014200210571),\n",
  356. " ('grouping', 0.7134513854980469),\n",
  357. " ('import', 0.7052735090255737),\n",
  358. " ('select', 0.6987707614898682),\n",
  359. " ('max', 0.6856316328048706),\n",
  360. " ('authentication', 0.6762576103210449),\n",
  361. " ('pseudo', 0.663935124874115),\n",
  362. " ('namespace', 0.6549093723297119),\n",
  363. " ('matrix', 0.6515613198280334),\n",
  364. " ('zero', 0.6360148191452026)]"
  365. ]
  366. },
  367. "execution_count": 139,
  368. "metadata": {},
  369. "output_type": "execute_result"
  370. }
  371. ],
  372. "source": [
  373. "model.wv.most_similar(\"method\")"
  374. ]
  375. },
  376. {
  377. "cell_type": "code",
  378. "execution_count": 140,
  379. "metadata": {},
  380. "outputs": [],
  381. "source": [
  382. "x_vals, y_vals, labels = reduce_dimensions(model)"
  383. ]
  384. },
  385. {
  386. "cell_type": "markdown",
  387. "metadata": {},
  388. "source": [
  389. "If you run the TSNE algorithm with different training parameters, you will get different shapes in your output. I noticed that for my data, If I ran it with 5 epochs I got a curvy shape where if I trained the model with over 100 epochs I just got a single cluster shape. "
  390. ]
  391. },
  392. {
  393. "cell_type": "code",
  394. "execution_count": 155,
  395. "metadata": {},
  396. "outputs": [
  397. {
  398. "data": {
  399. "image/png": "iVBORw0KGgoAAAANSUhEUgAAAUcAAAE/CAYAAADVOAHHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nOy9eXxU5dn//75ncpJMAiSEnUBYooJSBAQFjO1Tl5YqLhEXRGittdrv89S2qE0bWn4Vv1JJHx4r3Vv92sVHVFA0omhRi0tFwYJhKRWqIAQnLIEsLJkkk5n798fMGWY558yZZJJMJvf79eJFcubMOWcmM9e57mv5XEJKiUKhUCgicXT3BSgUCkUqooyjQqFQGKCMo0KhUBigjKNCoVAYoIyjQqFQGKCMo0KhUBigjKMiYYQQfxZCLE3Ssb4uhHjX4vG3hBDfDP48XwjxWjLOq1DEQxnHXoIQYr8QwiOEOBX279fdfV2JIKVcKaX8cmccWwhxiRDiPSFEoxCiTgixUQhxYWecS9EzyOjuC1B0KddIKd/o7otINYQQ/YCXgf8EVgOZwOeBlu68LkX3ojxHhb603SiEeEQI0SCE2CeEuDi4/aAQ4qgQ4raopw0UQrwuhDgphHhbCDEq7Hjjg4/VCSH2CCFuDntsgBBirRDihBDiA6A46lq+JITYHfTgfg2IqOt8N+x3KYT4P0KIj4UQ9UKI3wghRPAxpxDiYSHEMSHEp0KIu4P7GzkE5wBIKZ+WUvqklB4p5WtSyh1R78+vgte1Wwhxedh13C6E+Cj4XuwTQnwr6jVdJ4TYFnzNe4UQXwluzxNCPC6EOCSEcAshlgohnHb/borORRlHhc50YAcwAHgKeAa4EDgLWAD8WgjRJ2z/+cCDwEBgG7ASQAiRC7wePMZgYB7wWyHEhODzfgM0A8OAbwT/EXzuQGANsDh43L1ASZzrvjp4nZOAm4FZwe13AlcCk4ELgFKLY/wb8Akh/iKEuFII0d9gn+nAvuB13Q88L4QoCD52NHgd/YDbgUeEEBcEX9NFwBNAGZAPfAHYH3zeX4A2Au/xFODLwDfjvF5FVyGlVP96wT8CX8hTQEPYvzuDj30d+Dhs34mABIaEbTsOTA7+/GfgmbDH+gA+YCQwF/h71Ln/QMCgOAEvMD7ssYeAd4M/fw3YFPaYAD4Dvhl2ne+GPS6BS8J+Xw2UB3/eAHwr7LErgvtnmLw/5wZf12cEDNZa/fUHz1sDiLD9PwC+anKsSuB7Ya/9EYN9hhBYtrvCts0D3uzuz4r6F/inYo69i1JpHnM8EvazB0BKGb0t3HM8qP8gpTwlhKgDhgOjgOlCiIawfTOA/wUGBX8+GPbYgbCfh0cdVwohwvc14nDYz01h1xhxrKifY5BSfkTACCKEGA88CawgYLAA3DJowcKue3hw/ysJGP9zCKzGcoCdwf1GAq8YnHIUoAGHgpEAgs+N93oVXYQyjor2MlL/IbjcLiDgXR0E3pZSfin6CcF4WlvwubuDm4vCdjkUdVwR/nuCHAJGGF1vPKSUu4UQfwbCY4eFQggRZiCLgLVCiCwCoYCvAS9KKb1CiErOxEoPEhVXDdveAgyUUrbZvTZF16Fijor2clWw/CWTQOxxs5TyIIGs7zlCiK8KIbTgvwuFEOdKKX3A88ASIUSOEOI8IDzRsw6YIISYE0ycfBcY2s7rWw18TwhRKITIB35otmMwgXSfEGJE8PeRBDzGTWG7DQa+G3w9NxFYhr9CILOdBdQCbUEvMrzc6HHgdiHE5UIIR/B6xkspDwGvAQ8LIfoFHysWQvxHO1+vIsko49i7eElE1jm+0IFjPUVgKVkHTCWQoEFKeZKAcbiFgCd5GPgZAQMCcDeBpe9hAjG+P+kHlFIeA24CKgjEOM8GNrbz+h4jYHx2AFUEDFkbgdhoNCcJJFw2CyFOEzCK/wTuC9tnc/B6jgE/BW6UUh4Pvt7vEjDG9cCtBOKV+mv6gGCSBmgE3iawpIaAt5kJ/Cv43OcIJKoUKYCIDKMoFOlJ0KP7vZRyVNydY5/7dQJJoUuSfmGKlEV5joq0RAjhEkJcJYTIEEIUEvByO+IpK3oZyjgq0hUBPEBguVoFfAT8pFuvSNGjUMtqhUKhMEB5jgqFQmGAMo4KhUJhQEoVgQ8cOFCOHj26uy9DoVCkGVu3bj0mpRyUyHNSyjiOHj2aLVu2dPdlKBSKNEMIcSD+XpGoZbVCoVAYoIyjQqFQGKCMo0KhUBigjKNCoVAYoIyjQqFQGKCMo0KhUBigjKNCoVAYkFJ1joqexeLKnTy9+SA+KXEKwbzpI1laOrG7LyuGyio3y9fvoabBw/B8F2WzxlE6pbC7L0uR4ijjqGgXiyt38uSm6tDvPilDv6eSgayscrPo+Z14vAGNW3eDh0XPB8a7KAOpsEItqxXt4unNxnOgzLZ3FpVVbkoqNjCmfB0lFRuorHJHPL58/Z6QYdTxeH0sX7+nKy9T0QNRnqOiXfhMpO7MtieKnaVwZZWbsme34/UHzulu8FD27HbgjFdY0+AxPL7ZdoVCR3mOinbhPDNO1Nb2RNCXwu4GD5IzS+For3DJ2l0hw6jj9UuWrN0V+n14vsvwHGbbFQodZRwV7WLedONJp2bbE8FsKfyj53dEbGvweA2fH769bNY4XJoz4nGX5qRs1rgOX6civVHLakW70JMunZGtNlvyNnn9LK7cmdA59OV1srLVlVVulqzdFTLA/XM07r9mgkrupCHKOCrazdLSiXEN1fzH3mfj3rrQ7yXFBay8c6blc4bnu3CbGMinNx8MnTPTKWj1xcY4M52RS/vSKYWmxiuRcqToGCdAfZOXsuci45yK9EAtq9OYxZU7KV70CqPL11G86BUWV+7s0vNHG0aAjXvrmP/Y+5bPs1ryhid8ouON8bZHo5cj6cfUy5HM3qfl6/cYHtvrkyr7nYYozzFNiTZMya5DtONxRRtGs+1GxxICjBLf4Qkfs8S43YS5VTmS0XtkleF2N3iorHIr7zGNUMYxDamscpsaJrMvfjiLK3eyclM1uo3JzXTy0+snhr74ySwANzvW2YNz+fjo6Zj9wxM+TiEMS4d0AxqvHCjRciSr5T6gisvTDLWsTkOslnjx6hB1YxW+1+lWH/c9uz1USpPMAnCz5+yrbaKkuCBiW0lxQYTxtcqYG5UDLVy1LWJJn2g5UtmscWgO81Ilj9fHoqiMuqLnooxjGmK1/ItXh2hmrHz+M3E1ux5XtHEz2m51rPfixCuXlk5kwYyi0GtyCsGCGUUsLZ1oWA6kH0OPKZoZ17GDcgxjtaVTCll+0yTyXZrh8wA8wYy6ouejltVpiNXyL14dopVnmWhXyco7Z8bNVpstjQGMturGTfcgzTLmVteq3wCibwROIRg7KCdiOa8v85/cVE2+S2PJtRPYdv+XKanYYCujboeeIuDR21DGMQ0pmzUuQmxBJ3pZaoSVsWpPV0l02Y6eQdcNQbQxsoMd42N1gwiPkYYzb/pIy9BAg8cbak8smzWOhau2mR7fLj1FwKM3opbVaUjplEKWzZlIYb4LARTmu1gxd3KMoTISbTDzLJ0OESqxaU/rYGWVm3P/v1djSmc+Pnqaswfnhp5rp/nQjvFpTweM7r1Z4Q2GF0qnFGIWfgx/H+KVU6WKgIcilqQYRyFEvhDiOSHEbiHER0KImUKIAiHE60KIj4P/90/GuRT2KJ1SyMbyy/i0YjYbyy8zFG0w6l+eNqqABTOKIoxUbqaTh2+aFDqGmQHN1hyG6jhnZMP8hs/bV9vE3mVXsWBGkeFSOho7/dulUwpNY55m6N5sPHSP9NbpRYaP6++PnTrKzhbwULSfZC2rfwH8VUp5oxAiE8gBfgT8TUpZIYQoB8qBHybpfIoOYiXltbH8MsslXXTroAAcDsHpVmPNRLPkiI5uCFYaLHWNMDPOlVVuHnhpF/VNZ3qrNQf4JPgloXjeys3VpjWU86aPNFxyh6N7jPFaKO3UUcYrR1J0Hx02jkKIfsAXgK8DSClbgVYhxHXAF4O7/QV4C2UcU4ZEpbyMagb1L7h
  400. "text/plain": [
  401. "<Figure size 360x360 with 1 Axes>"
  402. ]
  403. },
  404. "metadata": {
  405. "needs_background": "light"
  406. },
  407. "output_type": "display_data"
  408. }
  409. ],
  410. "source": [
  411. "plot_with_matplotlib(x_vals, y_vals, labels, 10)"
  412. ]
  413. },
  414. {
  415. "cell_type": "markdown",
  416. "metadata": {},
  417. "source": [
  418. "If we zoom in on the embedding space, we can get a better idea of what clusters close to each other. This visualization of course is not perfect because we are trying to visualize a 15 dimensions down into a two dimensional graph. "
  419. ]
  420. },
  421. {
  422. "cell_type": "code",
  423. "execution_count": 152,
  424. "metadata": {},
  425. "outputs": [
  426. {
  427. "data": {
  428. "image/png": "iVBORw0KGgoAAAANSUhEUgAAJzAAACWCCAYAAAAyIseWAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8li6FKAAAgAElEQVR4nOzOMQGAAAwDMKh/z+OshHIkCvLe3QMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPxD1gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKCyDgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJV1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAqKwDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAZR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAq6wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBZBwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgMo6AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVNYBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACgsg4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACVdQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKisAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQGUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKusAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABQWQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIDKOgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFTWAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAoLIOAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlXUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACorAMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEBlHQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACrrAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAUFkHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAyjoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABU1gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKCyDgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJV1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAqKwDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAZR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAq6wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBZBwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgMo6AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVNYBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACgsg4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACVdQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKisAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQGUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKusAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABQWQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIDKOgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFTWAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAoLIOAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlXUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACorAMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEBlHQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACrrAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAUFkHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAyjoAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABU1gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKCyDgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAJV1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAqKwDAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAZR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAq6wAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFBZBwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgMo6AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVNYBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACgsg4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACVdQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKisAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQGUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAKusAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABQWQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIDKOgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFTWAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAoLIOAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAlXUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACorAMAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEBlHQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  429. "text/plain": [
  430. "<Figure size 360x360 with 1 Axes>"
  431. ]
  432. },
  433. "metadata": {
  434. "needs_background": "light"
  435. },
  436. "output_type": "display_data"
  437. }
  438. ],
  439. "source": [
  440. "import matplotlib.pyplot as plt\n",
  441. "import random\n",
  442. "\n",
  443. "def plot_with_matplotlib(x_vals, y_vals, labels, xmin=3, xmax=7, ymin=3, ymax=7):\n",
  444. " plt.figure(figsize=(5, 5))\n",
  445. " plt.scatter(x_vals, y_vals)\n",
  446. " plt.title(\"Embedding Space\")\n",
  447. " plt.xlim(xmin, xmax)\n",
  448. " plt.ylim(ymin, ymax)\n",
  449. " \n",
  450. " for x, y, l in zip(x_vals, y_vals, labels):\n",
  451. " plt.annotate(l, (x, y))\n",
  452. " plt.savefig('smallex.png')\n",
  453. " \n",
  454. "plot_with_matplotlib(x_vals, y_vals, labels)"
  455. ]
  456. },
  457. {
  458. "cell_type": "markdown",
  459. "metadata": {},
  460. "source": [
  461. "It was impressive the results I got on my limited data scope. The Google news dataset explored last time was far more accurate, however, the model was quite large and I was unable to run the TSNE algorithm on it without crashing my computer. Training your own word embedding model is useful because in different applications and contexts words can mean different things. Right now there is a large push at seeing how we can leverage pre-trained models. This enables you to take google's word2vec model trained on 100 billion words and tweak the model using your own data to better fit your own application. Maybe I'll make that a part three of this blog post series on embeddings. "
  462. ]
  463. }
  464. ],
  465. "metadata": {
  466. "kernelspec": {
  467. "display_name": "Python 3",
  468. "language": "python",
  469. "name": "python3"
  470. },
  471. "language_info": {
  472. "codemirror_mode": {
  473. "name": "ipython",
  474. "version": 3
  475. },
  476. "file_extension": ".py",
  477. "mimetype": "text/x-python",
  478. "name": "python",
  479. "nbconvert_exporter": "python",
  480. "pygments_lexer": "ipython3",
  481. "version": "3.7.6"
  482. }
  483. },
  484. "nbformat": 4,
  485. "nbformat_minor": 4
  486. }