Blog post on node2vec

5 years ago · b3e77605af
--- a/blogContent/headerImages/steamGraphs2.png
+++ b/blogContent/headerImages/steamGraphs2.png
--- a/blogContent/posts/data-science/media/steamNode2vec/friends.png
+++ b/blogContent/posts/data-science/media/steamNode2vec/friends.png
--- a/blogContent/posts/data-science/media/steamNode2vec/games.png
+++ b/blogContent/posts/data-science/media/steamNode2vec/games.png
--- a/blogContent/posts/data-science/media/steamNode2vec/output_9_0.png
+++ b/blogContent/posts/data-science/media/steamNode2vec/output_9_0.png
--- a/blogContent/posts/data-science/node2vec-with-steam-data.md
+++ b/blogContent/posts/data-science/node2vec-with-steam-data.md
@ -0,0 +1,212 @@
 Graph algorithms!!!
 Working with graphs can be a great deal of fun, but, sometimes we just want some cold hard vectors to do some good old fashion machine learning.
 This post looks at the famous [node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) algorithm used to quantize graph data.
 The example I'm giving in this blog post uses data from my recently resurrected [steam graph project](https://jrtechs.net/projects/steam-friends-graph). 
 If you live under a rock, [Steam](https://store.steampowered.com/) is a platform where users can purchase, manage, and play games with friends.
 Although there is a ton of data within the Steam network, I am only interested in the graphs formed connecting users, friends, and games.
 My updated visualization to show a friendship network looks like this:
 ![media](media/steamNode2vec/friends.png)
 I'm also working on visualizations to show both friends and their games.
 ![media](media/steamNode2vec/games.png)
 The issue that I'm currently running into is that these graphs quickly become egregiously large. Players on Steam frequently have 100+ friends and own over 50 games-- one person I indexed somehow had over 400 games!
 The size of this graph balloons exponentially with traversal depth.
 The sheer scale of the graph data makes it challenging to visualize concisely.
 Visualization of graph data brings us to our topic today:
 # Node2Vec
 Node2Vec is an embedding algorithm inspired by [word2vec](https://jrtechs.net/data-science/word-embeddings).
 The goal of this algorithm is to covert every node in a graph into a vectorized output where points close in the latent space correspond to related nodes.
 Simplifying a few things: node2vec uses the notion of a biased random walk through the graph.
 Two parameters -P and Q- defines whether we want to favor a BFS vs. a DFS traversal. A BFS search will give us a better local view of the network where a DFS traversal will provide us with a more global view of the graph. Applying some maths to the random walker outputs, distance in this embedding space is correlated to the probability that two nodes co-occur on the same random walk over the network. 
 If you have the time, I urge you to watch [Jure Leskovec's](https://scholar.google.com/citations?user=Q_kKkIUAAAAJ&hl=en) lectures on graph learning on Youtube:
 <youtube src="YrhBZUtgG4E" />
 Additionally, if you want to dive deeper into graph learning, I suggest that you dig through the Stanford [CS-224 course page](https://github.com/jrtechs/cs224w-notes).
 # Python Node2Vec Code
 I'm using a simple implementation of Node2Vec that I found on GitHub: [aditya-grover/node2vec](https://github.com/aditya-grover/node2vec).
 I'm using this package because it is a faithful implementation to the original paper and doesn't require you to install a lot of dependencies. This was written in Python 2, so to get Python 3 support, you will need to merge in changes from someone's fork because the maintainer is not reviewing any of the pull requests.
 ```
 git clone https://github.com/aditya-grover/node2vec
 git remote add python3 https://github.com/mcwehner/node2vec
 git pull python3 master
 git pull python3 updated_requirements
 ```
 Pip makes installing the dependencies easy using a requirements file.
 ```python
 !pip install -r node2vec/requirements.txt
 ```
 Before we run this algorithm, we need to generate some data.
 An example edge list is in the [git repo](https://github.com/aditya-grover/node2vec/blob/master/graph/karate.edgelist). 
 Using the JanusGraph database I'm using for the [Steam graphs project](https://github.com/jrtechs/SteamFriendsGraph), I generated an edge list for a single player's network.
 The code is straightforward; first, I pull the steam ids of the people that I want in my embedding graph.
 Second, I pull all the friend connections for the people that I want in the graph, and I filter out any players that are not already in the network.
 Finally, I take all the friend relationships and generate the edge pairs and save it to a file. If you are not familiar with data streaming in Java, I suggest that you check out my [last blog post](https://jrtechs.net/java/fun-with-functional-java).
 ```java
 Set<String> importantEdges = graph
        .getPlayer(baseID)
        .getFriends()
        .parallelStream()
        .map(Player::getId)
        .collect(Collectors.toSet());
 Map<String, Set<String>> edgeList = new HashMap<String, Set<String>>()
 {{
    importantEdges.forEach(f ->
        put(f,
            graph.getPlayer(f)
                .getFriends()
                .parallelStream()
                .map(Player::getId)
                .filter(importantEdges::contains)
                .collect(Collectors.toSet()))
    );
 }};
 List<String> edges = edgeList.keySet()
    .parallelStream().map(k ->
        edgeList.get(k)
            .stream()
            .map(k2 -> k + " " + k2)
            .collect(Collectors.toList()))
    .flatMap(Collection::stream)
    .collect(Collectors.toList());
 WrappedFileWriter.writeToFileFromList(edges, outFile);
 ```
 Using the edge list generated in the prior script, I feed it into the node2vec program.
 ```python
 !python node2vec/src/main.py --input jrtechs.edgelist --output output/jrtechs2.emd --num-walks=40 --dimensions=50
 output:
    Walk iteration:
    1 / 40
 ...
    40 / 40
 ```
 Once we have our embedding file, we can load it into Python to do machine learning or visualization.
 The output of the node2vec algorithm is a sequence of lines where each line starts with the node label, and the rest of the line is the embedding vector.
 ```python
 labels=[]
 vectors=[]
 with open("output/jrtechs2.emd") as fp:
    for line in fp:
        l_list = list(map(float, line.split()))
        vectors.append(l_list[1::])
        labels.append(line.split()[0])        
 ```
 Right now, I am interested in visualizing the output. However, that is impractical since it has 50 dimensions! Using the TSNE method, we can reduce the dimensionality so that we can visualize it.
 ```python
 from sklearn.decomposition import IncrementalPCA    # inital reduction
 from sklearn.manifold import TSNE                   # final reduction
 import numpy as np                      
 def reduce_dimensions(labels, vectors, num_dimensions=2):
    # convert both lists into numpy vectors for reduction
    vectors = np.asarray(vectors)
    labels = np.asarray(labels)
    # reduce using t-SNE
    vectors = np.asarray(vectors)
    tsne = TSNE(n_components=num_dimensions, random_state=0)
    vectors = tsne.fit_transform(vectors)
    x_vals = [v[0] for v in vectors]
    y_vals = [v[1] for v in vectors]
    return x_vals, y_vals, labels
 x_vals, y_vals, labels = reduce_dimensions(labels, vectors)
 ```
 Before we visualize our data, we will want to grab the name of the steam users because right now, all we have is their steam id, which is just a unique number.
 Back in our JanusGraph, we can quickly export the steam ids mapped to the players' names.
 ```java
 Player player = graph.getPlayer(id);
 List<String> names = new ArrayList<String>()
 {{
    add(id + " " + player.getName());
    addAll(
        player.getFriends()
            .stream()
            .map(p -> p.getId() + " " + p.getName())
            .collect(Collectors.toList())
    );
 }};
 WrappedFileWriter.writeToFileFromList(names, "friendsMap.map");
 ```
 We then just create a map in Python linking the steam id to the player ID.
 ```python
 name_map = {}
 with open("friendsMap.map") as fp:
    for line in fp:
        name_map[line.split()[0]] = line.split()[1]
 ```
 ```python
 name_map
    {'76561198188400721': 'jrtechs',
     '76561198049526995': 'Noosh',
 ...
     '76561198065642391': 'Therefore',
     '76561198121369685': 'DataFrogman'}
 ```
 Using the output from the TSNE dimensionality reduction, we can view all the nodes on a single plot. To make the graph look more delightful, we only label a fraction of the nodes.
 ```python
 import matplotlib.pyplot as plt
 import random
 def plot_with_matplotlib(x_vals, y_vals, labels, num_to_label):
    plt.figure(figsize=(5, 5))
    plt.scatter(x_vals, y_vals)
    plt.title("Embedding Space")
    indices = list(range(len(labels)))
    selected_indices = random.sample(indices, num_to_label)
    for i in selected_indices:
        plt.annotate(name_map[labels[i]], (x_vals[i], y_vals[i]))
    plt.savefig('ex.png')
 plot_with_matplotlib(x_vals, y_vals, labels, 12)
 ```
 ![algorithm output showing embedding](media/steamNode2vec/output_9_0.png)
 This graph may not look exciting, but I assure you that it is.
 Just eyeballing it, I can notice that my high school friends and college friends are in different regions of the graph.
 Moving forward with this, I plan on incorporating game data.
 With the graph data vectorized in this fashion, it becomes possible to start employing classification and link prediction algorithms.
 Some examples for the steam network could be a friend or game recommendation system, or even a community detection algorithm.