edit changes to steam graphs blog post

5 years ago · 7df50ac5c8
--- a/blogContent/posts/data-science/node2vec-with-steam-data.md
+++ b/blogContent/posts/data-science/node2vec-with-steam-data.md
@ -1,5 +1,5 @@
 Graph algorithms!!!
 Working with graphs can be a great deal of fun, but, sometimes we just want some cold hard vectors to do some good old fashion machine learning.
 Working with graphs can be a great deal of fun, but sometimes we just want some cold hard vectors to do some good old-fashioned machine learning.
 This post looks at the famous [node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) algorithm used to quantize graph data.
 The example I'm giving in this blog post uses data from my recently resurrected [steam graph project](https://jrtechs.net/projects/steam-friends-graph). 
@ -22,20 +22,20 @@ Visualization of graph data brings us to our topic today:
 # Node2Vec
 Node2Vec is an embedding algorithm inspired by [word2vec](https://jrtechs.net/data-science/word-embeddings).
 The goal of this algorithm is to covert every node in a graph into a vectorized output where points close in the latent space correspond to related nodes.
 This algorithm aims to covert every node in a graph into a vectorized output where points close in the latent space correspond to related nodes.
 Simplifying a few things: node2vec uses the notion of a biased random walk through the graph.
 Two parameters -P and Q- defines whether we want to favor a BFS vs. a DFS traversal. A BFS search will give us a better local view of the network where a DFS traversal will provide us with a more global view of the graph. Applying some maths to the random walker outputs, distance in this embedding space is correlated to the probability that two nodes co-occur on the same random walk over the network. 
 Two hyperparameters(P and Q) defines whether we want to favor a BFS vs. a DFS traversal. A BFS search will give us a better local view of the network, where a DFS traversal will provide us with a more global view of the graph. After applying some maths to the random walker outputs, distance in this embedding space is correlated to the probability that two nodes co-occur on the same random walk over the network. 
 If you have the time, I urge you to watch [Jure Leskovec's](https://scholar.google.com/citations?user=Q_kKkIUAAAAJ&hl=en) lectures on graph learning on Youtube:
 If you have the time, I urge you to watch [Jure Leskovec's](https://scholar.google.com/citations?user=Q_kKkIUAAAAJ&hl=en) Stanford lectures on graph learning on Youtube:
 <youtube src="YrhBZUtgG4E" />
 Additionally, if you want to dive deeper into graph learning, I suggest that you dig through the Stanford [CS-224 course page](https://github.com/jrtechs/cs224w-notes).
 Additionally, if you want to dive deeper into graph learning, I suggest that you dig through the Stanford [CS-224 course page on Github](https://github.com/jrtechs/cs224w-notes).
 # Python Node2Vec Code
 I'm using a simple implementation of Node2Vec that I found on GitHub: [aditya-grover/node2vec](https://github.com/aditya-grover/node2vec).
 I'm using this package because it is a faithful implementation to the original paper and doesn't require you to install a lot of dependencies. This was written in Python 2, so to get Python 3 support, you will need to merge in changes from someone's fork because the maintainer is not reviewing any of the pull requests.
 I'm using this package because it is a faithful implementation of the original paper and doesn't require you to install too many dependencies. This project was written in Python 2, so to get Python 3 support, you will need to merge in changes from someone's fork because the maintainer is not reviewing any of the pull requests.
 ```
 git clone https://github.com/aditya-grover/node2vec
@ -47,17 +47,17 @@ git pull python3 updated_requirements
 Pip makes installing the dependencies easy using a requirements file.
 ```python
 ```Python
 !pip install -r node2vec/requirements.txt
 ```
 Before we run this algorithm, we need to generate some data.
 An example edge list is in the [git repo](https://github.com/aditya-grover/node2vec/blob/master/graph/karate.edgelist). 
 Using the JanusGraph database I'm using for the [Steam graphs project](https://github.com/jrtechs/SteamFriendsGraph), I generated an edge list for a single player's network.
 The code is straightforward; first, I pull the steam ids of the people that I want in my embedding graph.
 Second, I pull all the friend connections for the people that I want in the graph, and I filter out any players that are not already in the network.
 Finally, I take all the friend relationships and generate the edge pairs and save it to a file. If you are not familiar with data streaming in Java, I suggest that you check out my [last blog post](https://jrtechs.net/java/fun-with-functional-java).
 Using the JanusGraph database I'm using for the [Steam graphs project](https://github.com/jrtechs/SteamFriendsGraph), I generated an edge list of a single player's network.
 The code is straightforward; first, I pull the steam ids of the people we want in our embedding graph.
 Second, I pull all the friend connections for the people we want in the graph and filter out any players that are not already in the network.
 Finally, I take all the friend relationships and generate the edge pairs and save them to a file. If you are not familiar with Java's data streaming, I suggest that you check out my [last blog post on functional programming in Java](https://jrtechs.net/java/fun-with-functional-java).
 ```java
 Set<String> importantEdges = graph
@ -92,7 +92,6 @@ List edges = edgeList.keySet()
 WrappedFileWriter.writeToFileFromList(edges, outFile);
 ```
 Using the edge list generated in the prior script, I feed it into the node2vec program.
@ -108,7 +107,7 @@ output:
 ```
 Once we have our embedding file, we can load it into Python to do machine learning or visualization.
 The output of the node2vec algorithm is a sequence of lines where each line starts with the node label, and the rest of the line is the embedding vector.
 The node2vec algorithm's output is a sequence of lines where each line starts with the node label, and the rest of the line is the embedding vector.
 ```python
 labels=[]
@ -121,9 +120,9 @@ with open("output/jrtechs2.emd") as fp:
        labels.append(line.split()[0])        
 ```
 Right now, I am interested in visualizing the output. However, that is impractical since it has 50 dimensions! Using the TSNE method, we can reduce the dimensionality so that we can visualize it.
 Right now, I am interested in visualizing the output. However, that is impractical since it has 50 dimensions! Using the TSNE method, we can reduce the dimensionality so that we can visualize it. Alternatively, we could use another algorithm like Principal Component Analysis (PCA).
 ```python
 ```Python
 from sklearn.decomposition import IncrementalPCA    # inital reduction
 from sklearn.manifold import TSNE                   # final reduction
 import numpy as np                      
@ -146,7 +145,7 @@ def reduce_dimensions(labels, vectors, num_dimensions=2):
 x_vals, y_vals, labels = reduce_dimensions(labels, vectors)
 ```
 Before we visualize our data, we will want to grab the name of the steam users because right now, all we have is their steam id, which is just a unique number.
 Before we visualize our data, we will want to grab the steam users' name because right now, all we have is their steam id, which is just a unique number.
 Back in our JanusGraph, we can quickly export the steam ids mapped to the players' names.
 ```java
@ -173,7 +172,7 @@ with open("friendsMap.map") as fp:
        name_map[line.split()[0]] = line.split()[1]
 ```
 ```python
 ```Python
 name_map
    {'76561198188400721': 'jrtechs',
@ -183,9 +182,9 @@ name_map
     '76561198121369685': 'DataFrogman'}
 ```
 Using the output from the TSNE dimensionality reduction, we can view all the nodes on a single plot. To make the graph look more delightful, we only label a fraction of the nodes.
 Using the TSNE dimensionality reduction output, we can view all the nodes on a single plot. To make the graph look more delightful, we only label a fraction of the nodes.
 ```python
 ```Python
 import matplotlib.pyplot as plt
 import random
@ -205,8 +204,9 @@ plot_with_matplotlib(x_vals, y_vals, labels, 12)
 ![algorithm output showing embedding](media/steamNode2vec/output_9_0.png)
 This graph may not look exciting, but I assure you that it is.
 Just eyeballing it, I can notice that my high school friends and college friends are in different regions of the graph.
 I can notice that my high school friends and college friends are in different graphical regions, just from eyeballing it.
 Moving forward with this, I plan on incorporating game data.
 With the graph data vectorized in this fashion, it becomes possible to start employing classification and link prediction algorithms.
 Some examples for the steam network could be a friend or game recommendation system, or even a community detection algorithm.
 Some steam network examples could be a friend or game recommendation system, or even a community detection algorithm.
 Although this post only went over a shallow encoder, future work may use a Graph Convolutional Neural Network (GCN) to incorporate user features.