| @ -0,0 +1,212 @@ | |||||
| Graph algorithms!!! | |||||
| Working with graphs can be a great deal of fun, but, sometimes we just want some cold hard vectors to do some good old fashion machine learning. | |||||
| This post looks at the famous [node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) algorithm used to quantize graph data. | |||||
| The example I'm giving in this blog post uses data from my recently resurrected [steam graph project](https://jrtechs.net/projects/steam-friends-graph). | |||||
| If you live under a rock, [Steam](https://store.steampowered.com/) is a platform where users can purchase, manage, and play games with friends. | |||||
| Although there is a ton of data within the Steam network, I am only interested in the graphs formed connecting users, friends, and games. | |||||
| My updated visualization to show a friendship network looks like this: | |||||
|  | |||||
| I'm also working on visualizations to show both friends and their games. | |||||
|  | |||||
| The issue that I'm currently running into is that these graphs quickly become egregiously large. Players on Steam frequently have 100+ friends and own over 50 games-- one person I indexed somehow had over 400 games! | |||||
| The size of this graph balloons exponentially with traversal depth. | |||||
| The sheer scale of the graph data makes it challenging to visualize concisely. | |||||
| Visualization of graph data brings us to our topic today: | |||||
| # Node2Vec | |||||
| Node2Vec is an embedding algorithm inspired by [word2vec](https://jrtechs.net/data-science/word-embeddings). | |||||
| The goal of this algorithm is to covert every node in a graph into a vectorized output where points close in the latent space correspond to related nodes. | |||||
| Simplifying a few things: node2vec uses the notion of a biased random walk through the graph. | |||||
| Two parameters -P and Q- defines whether we want to favor a BFS vs. a DFS traversal. A BFS search will give us a better local view of the network where a DFS traversal will provide us with a more global view of the graph. Applying some maths to the random walker outputs, distance in this embedding space is correlated to the probability that two nodes co-occur on the same random walk over the network. | |||||
| If you have the time, I urge you to watch [Jure Leskovec's](https://scholar.google.com/citations?user=Q_kKkIUAAAAJ&hl=en) lectures on graph learning on Youtube: | |||||
| <youtube src="YrhBZUtgG4E" /> | |||||
| Additionally, if you want to dive deeper into graph learning, I suggest that you dig through the Stanford [CS-224 course page](https://github.com/jrtechs/cs224w-notes). | |||||
| # Python Node2Vec Code | |||||
| I'm using a simple implementation of Node2Vec that I found on GitHub: [aditya-grover/node2vec](https://github.com/aditya-grover/node2vec). | |||||
| I'm using this package because it is a faithful implementation to the original paper and doesn't require you to install a lot of dependencies. This was written in Python 2, so to get Python 3 support, you will need to merge in changes from someone's fork because the maintainer is not reviewing any of the pull requests. | |||||
| ``` | |||||
| git clone https://github.com/aditya-grover/node2vec | |||||
| git remote add python3 https://github.com/mcwehner/node2vec | |||||
| git pull python3 master | |||||
| git pull python3 updated_requirements | |||||
| ``` | |||||
| Pip makes installing the dependencies easy using a requirements file. | |||||
| ```python | |||||
| !pip install -r node2vec/requirements.txt | |||||
| ``` | |||||
| Before we run this algorithm, we need to generate some data. | |||||
| An example edge list is in the [git repo](https://github.com/aditya-grover/node2vec/blob/master/graph/karate.edgelist). | |||||
| Using the JanusGraph database I'm using for the [Steam graphs project](https://github.com/jrtechs/SteamFriendsGraph), I generated an edge list for a single player's network. | |||||
| The code is straightforward; first, I pull the steam ids of the people that I want in my embedding graph. | |||||
| Second, I pull all the friend connections for the people that I want in the graph, and I filter out any players that are not already in the network. | |||||
| Finally, I take all the friend relationships and generate the edge pairs and save it to a file. If you are not familiar with data streaming in Java, I suggest that you check out my [last blog post](https://jrtechs.net/java/fun-with-functional-java). | |||||
| ```java | |||||
| Set<String> importantEdges = graph | |||||
| .getPlayer(baseID) | |||||
| .getFriends() | |||||
| .parallelStream() | |||||
| .map(Player::getId) | |||||
| .collect(Collectors.toSet()); | |||||
| Map<String, Set<String>> edgeList = new HashMap<String, Set<String>>() | |||||
| {{ | |||||
| importantEdges.forEach(f -> | |||||
| put(f, | |||||
| graph.getPlayer(f) | |||||
| .getFriends() | |||||
| .parallelStream() | |||||
| .map(Player::getId) | |||||
| .filter(importantEdges::contains) | |||||
| .collect(Collectors.toSet())) | |||||
| ); | |||||
| }}; | |||||
| List<String> edges = edgeList.keySet() | |||||
| .parallelStream().map(k -> | |||||
| edgeList.get(k) | |||||
| .stream() | |||||
| .map(k2 -> k + " " + k2) | |||||
| .collect(Collectors.toList())) | |||||
| .flatMap(Collection::stream) | |||||
| .collect(Collectors.toList()); | |||||
| WrappedFileWriter.writeToFileFromList(edges, outFile); | |||||
| ``` | |||||
| Using the edge list generated in the prior script, I feed it into the node2vec program. | |||||
| ```python | |||||
| !python node2vec/src/main.py --input jrtechs.edgelist --output output/jrtechs2.emd --num-walks=40 --dimensions=50 | |||||
| output: | |||||
| Walk iteration: | |||||
| 1 / 40 | |||||
| ... | |||||
| 40 / 40 | |||||
| ``` | |||||
| Once we have our embedding file, we can load it into Python to do machine learning or visualization. | |||||
| The output of the node2vec algorithm is a sequence of lines where each line starts with the node label, and the rest of the line is the embedding vector. | |||||
| ```python | |||||
| labels=[] | |||||
| vectors=[] | |||||
| with open("output/jrtechs2.emd") as fp: | |||||
| for line in fp: | |||||
| l_list = list(map(float, line.split())) | |||||
| vectors.append(l_list[1::]) | |||||
| labels.append(line.split()[0]) | |||||
| ``` | |||||
| Right now, I am interested in visualizing the output. However, that is impractical since it has 50 dimensions! Using the TSNE method, we can reduce the dimensionality so that we can visualize it. | |||||
| ```python | |||||
| from sklearn.decomposition import IncrementalPCA # inital reduction | |||||
| from sklearn.manifold import TSNE # final reduction | |||||
| import numpy as np | |||||
| def reduce_dimensions(labels, vectors, num_dimensions=2): | |||||
| # convert both lists into numpy vectors for reduction | |||||
| vectors = np.asarray(vectors) | |||||
| labels = np.asarray(labels) | |||||
| # reduce using t-SNE | |||||
| vectors = np.asarray(vectors) | |||||
| tsne = TSNE(n_components=num_dimensions, random_state=0) | |||||
| vectors = tsne.fit_transform(vectors) | |||||
| x_vals = [v[0] for v in vectors] | |||||
| y_vals = [v[1] for v in vectors] | |||||
| return x_vals, y_vals, labels | |||||
| x_vals, y_vals, labels = reduce_dimensions(labels, vectors) | |||||
| ``` | |||||
| Before we visualize our data, we will want to grab the name of the steam users because right now, all we have is their steam id, which is just a unique number. | |||||
| Back in our JanusGraph, we can quickly export the steam ids mapped to the players' names. | |||||
| ```java | |||||
| Player player = graph.getPlayer(id); | |||||
| List<String> names = new ArrayList<String>() | |||||
| {{ | |||||
| add(id + " " + player.getName()); | |||||
| addAll( | |||||
| player.getFriends() | |||||
| .stream() | |||||
| .map(p -> p.getId() + " " + p.getName()) | |||||
| .collect(Collectors.toList()) | |||||
| ); | |||||
| }}; | |||||
| WrappedFileWriter.writeToFileFromList(names, "friendsMap.map"); | |||||
| ``` | |||||
| We then just create a map in Python linking the steam id to the player ID. | |||||
| ```python | |||||
| name_map = {} | |||||
| with open("friendsMap.map") as fp: | |||||
| for line in fp: | |||||
| name_map[line.split()[0]] = line.split()[1] | |||||
| ``` | |||||
| ```python | |||||
| name_map | |||||
| {'76561198188400721': 'jrtechs', | |||||
| '76561198049526995': 'Noosh', | |||||
| ... | |||||
| '76561198065642391': 'Therefore', | |||||
| '76561198121369685': 'DataFrogman'} | |||||
| ``` | |||||
| Using the output from the TSNE dimensionality reduction, we can view all the nodes on a single plot. To make the graph look more delightful, we only label a fraction of the nodes. | |||||
| ```python | |||||
| import matplotlib.pyplot as plt | |||||
| import random | |||||
| def plot_with_matplotlib(x_vals, y_vals, labels, num_to_label): | |||||
| plt.figure(figsize=(5, 5)) | |||||
| plt.scatter(x_vals, y_vals) | |||||
| plt.title("Embedding Space") | |||||
| indices = list(range(len(labels))) | |||||
| selected_indices = random.sample(indices, num_to_label) | |||||
| for i in selected_indices: | |||||
| plt.annotate(name_map[labels[i]], (x_vals[i], y_vals[i])) | |||||
| plt.savefig('ex.png') | |||||
| plot_with_matplotlib(x_vals, y_vals, labels, 12) | |||||
| ``` | |||||
|  | |||||
| This graph may not look exciting, but I assure you that it is. | |||||
| Just eyeballing it, I can notice that my high school friends and college friends are in different regions of the graph. | |||||
| Moving forward with this, I plan on incorporating game data. | |||||
| With the graph data vectorized in this fashion, it becomes possible to start employing classification and link prediction algorithms. | |||||
| Some examples for the steam network could be a friend or game recommendation system, or even a community detection algorithm. | |||||