Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 

20 KiB

Last week I scrapped a bunch of data from the Steam API using my Steam Graph Project. This project captures steam users, their friends, and the games that they own. Using the Janus-Graph traversal object, I use the Gremlin graph query language to pull this data. Since I am storing the hours played in a game as a property on the relationship between a player and a game node, I had to make a "join" statement to get the hours property with the game information in a single query.

Object o = graph.con.getTraversal()
    .V()
    .hasLabel(Game.KEY_DB)
    .match(
            __.as("c").values(Game.KEY_STEAM_GAME_ID).as("gameID"),
            __.as("c").values(Game.KEY_GAME_NAME).as("gameName"),
            __.as("c").inE(Game.KEY_RELATIONSHIP).values(Game.KEY_PLAY_TIME).as("time")
    ).select("gameID", "time", "gameName").toList();
WrappedFileWriter.writeToFile(new Gson().toJson(o).toLowerCase(), "games.json");

Using the game indexing property on the players, I noted that I only ended up wholly indexing the games of 481 players after 8 hours.

graph.con.getTraversal()
    .V()
    .hasLabel(SteamGraph.KEY_PLAYER)
    .has(SteamGraph.KEY_CRAWLED_GAME_STATUS, 1)
    .count().next()

We now transition to Python and Matlptlib to visualize the data exported from our JanusGraph Query as a JSON object. The dependencies for this notebook can get installed using pip.

!pip install pandas
!pip install matplotlib
    Collecting pandas
      Downloading pandas-1.0.5-cp38-cp38-manylinux1_x86_64.whl (10.0 MB)
         |████████████████████████████████| 10.0 MB 4.3 MB/s eta 0:00:01
    [?25hCollecting pytz>=2017.2
      Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB)
         |████████████████████████████████| 510 kB 2.9 MB/s eta 0:00:01
    [?25hRequirement already satisfied: numpy>=1.13.3 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (1.18.5)
    Requirement already satisfied: python-dateutil>=2.6.1 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (2.8.1)
    Requirement already satisfied: six>=1.5 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0)
    Installing collected packages: pytz, pandas
    Successfully installed pandas-1.0.5 pytz-2020.1

The first thing we are doing is importing our JSON data as a pandas data frame. Pandas is an open-source data analysis and manipulation tool. I enjoy pandas because it has native integration with matplotlib and supports operations like aggregations and groupings.

import matplotlib.pyplot as plt
import pandas as pd

games_df = pd.read_json('games.json')
games_df

gameid time gamename
0 210770 243 sanctum 2
1 210770 31 sanctum 2
2 210770 276 sanctum 2
3 210770 147 sanctum 2
4 210770 52 sanctum 2
... ... ... ...
36212 9800 9 death to spies
36213 445220 0 avorion
36214 445220 25509 avorion
36215 445220 763 avorion
36216 445220 3175 avorion

36217 rows × 3 columns

Using the built-in matplotlib wrapper function, we can graph a histogram of the number of hours played in a game.

ax = games_df.hist(column='time', bins=20, range=(0, 4000))
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

png

Notice that the vast majority of the games are rarely ever played, however, it is skewed to the right with a lot of outliers. We can change the scale to make it easier to view using the range parameter.

ax = games_df.hist(column='time', bins=20, range=(0, 100))
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

png

If we remove games that have never been played, the distribution looks more reasonable.

ax = games_df.hist(column='time', bins=20, range=(2, 100))
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

png

Although histograms are useful, viewing the CDF is often more helpful since it is easier to extract numerical information.

ax = games_df.hist(column='time',density=True, range=(0, 2000),  histtype='step',cumulative=True)
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

png

According to this graph, about 80% of people on steam who own a game, play it under 4 hours. Nearly half of all downloaded or purchased steam games go un-played. This data is a neat example of the legendary 80/20 principle -- aka the Pareto principle. The Pareto principle states that roughly 80% of the effects come from 20% of the causes. IE: 20% of software bugs result in 80% of debugging time.

As mentioned earlier, the time in owned game distribution is heavily skewed to the right.

ax = plt.gca()
ax.set_title('Game Play Distribution')
ax.boxplot(games_df['time'], vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Minutes")
ax.set_yticks([])
plt.show()

png

When zooming in on the distribution, we see that nearly half of all the purchased games go un-opened.

ax = plt.gca()
ax.set_title('Game Play Distribution')
ax.boxplot(games_df['time']/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Hours")
ax.set_yticks([])
ax.set_xlim([0, 10])
plt.show()

png

Viewing the aggregate pool of hours in particular game data is insightful; however, comparing different games against each other is more interesting. In pandas, after we create a grouping on a column, we can aggregate it into metrics such as max, min, mean, etc. I am also sorting the data I get by count since we are more interested in "popular" games.

stats_df = (games_df.groupby("gamename")
                    .agg({'time': ['count', "min", 'max', 'mean']})
                    .sort_values(by=('time', 'count')))
stats_df

time
count min max mean
gamename
龙魂时刻 1 14 14 14.000000
gryphon knight epic 1 0 0 0.000000
growing pains 1 0 0 0.000000
shoppy mart: steam edition 1 0 0 0.000000
ground pounders 1 0 0 0.000000
... ... ... ... ...
payday 2 102 0 84023 5115.813725
team fortress 2 105 7 304090 25291.180952
unturned 107 0 16974 1339.757009
garry's mod 121 0 311103 20890.314050
counter-strike: global offensive 129 0 506638 46356.209302

9235 rows × 4 columns

To prevent one-off esoteric games that I don't have a lot of data for, throwing off metrics, I am disregarding any games that I have less than ten values for.

stats_df = stats_df[stats_df[('time', 'count')] > 10]
stats_df

time
count min max mean
gamename
serious sam hd: the second encounter 11 0 329 57.909091
grim fandango remastered 11 0 248 35.000000
evga precision x1 11 0 21766 2498.181818
f.e.a.r. 2: project origin 11 0 292 43.272727
transistor 11 0 972 298.727273
... ... ... ... ...
payday 2 102 0 84023 5115.813725
team fortress 2 105 7 304090 25291.180952
unturned 107 0 16974 1339.757009
garry's mod 121 0 311103 20890.314050
counter-strike: global offensive 129 0 506638 46356.209302

701 rows × 4 columns

We see that the average, the playtime per player per game, is about 5 hours. However, as noted before, most purchased games go un-played.

ax = plt.gca()
ax.set_title('Game Play Distribution')
ax.boxplot(stats_df[('time', 'mean')]/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Mean Game Play in Hours")
ax.set_xlim([0, 40])
ax.set_yticks([])
plt.show()

png

I had a hunch that more popular games got played more; however, this dataset is still too small the verify this hunch.

stats_df.plot.scatter(x=('time', 'count'), y=('time', 'mean'))

png

We can create a new filtered data frame that only contains the result of a single game to graph it. 
cc_df = games_df[games_df['gamename'] == "counter-strike: global offensive"]
cc_df

gameid time gamename
13196 730 742 counter-strike: global offensive
13197 730 16019 counter-strike: global offensive
13198 730 1781 counter-strike: global offensive
13199 730 0 counter-strike: global offensive
13200 730 0 counter-strike: global offensive
... ... ... ...
13320 730 3867 counter-strike: global offensive
13321 730 174176 counter-strike: global offensive
13322 730 186988 counter-strike: global offensive
13323 730 103341 counter-strike: global offensive
13324 730 10483 counter-strike: global offensive

129 rows × 3 columns

It is shocking how many hours certain people play in Counter-Strike. The highest number in the dataset was 8,444 hours or 352 days!

ax = plt.gca()
ax.set_title('Game Play Distribution for Counter-Strike')
ax.boxplot(cc_df['time']/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Hours")
ax.set_yticks([])
plt.show()

png

Viewing the distribution for a different game like Unturned, yields a vastly different distribution than Counter-Strike. I believe the key difference is that Counter-Strike gets played competitively, where Unturned is a more leisurely game. Competitive gamers likely skew the distribution of Counter-Strike to be very high.

u_df = games_df[games_df['gamename'] == "unturned"]
u_df

gameid time gamename
167 304930 140 unturned
168 304930 723 unturned
169 304930 1002 unturned
170 304930 1002 unturned
171 304930 0 unturned
... ... ... ...
269 304930 97 unturned
270 304930 768 unturned
271 304930 1570 unturned
272 304930 23 unturned
273 304930 115 unturned

107 rows × 3 columns

ax = plt.gca()
ax.set_title('Game Play Distribution for Unturned')
ax.boxplot(u_df['time']/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Hours")
ax.set_yticks([])
plt.show()

png

Next, I made a data frame just containing the raw data points of games that had an aggregate count of over 80. For the crawl sample size that I did, having a count of 80 would make the game "popular." Since we only have 485 players indexed, having over 80 entries implies that over 17% of people indexed had the game. It is easy to verify that the games returned were very popular by glancing at the results.

df1 = games_df[games_df['gamename'].map(games_df['gamename'].value_counts()) > 80]
df1['time'] = df1['time']/60
df1

gameid time gamename
167 304930 2.333333 unturned
168 304930 12.050000 unturned
169 304930 16.700000 unturned
170 304930 16.700000 unturned
171 304930 0.000000 unturned
... ... ... ...
22682 578080 51.883333 playerunknown's battlegrounds
22683 578080 47.616667 playerunknown's battlegrounds
22684 578080 30.650000 playerunknown's battlegrounds
22685 578080 170.083333 playerunknown's battlegrounds
22686 578080 399.950000 playerunknown's battlegrounds

1099 rows × 3 columns

ax = df1.boxplot(column=["time"], by='gamename', notch=True, vert=False)
fig = ax.get_figure()
fig.suptitle('')
ax.set_title('Play-time Distribution')
plt.xlabel("Hours Played")
ax.set_xlim([0, 2000])
plt.ylabel("Game")
plt.savefig("playTimes.png", dpi=300, bbox_inches = "tight")

png

Overall it is fascinating to see how the distributions for different games vary. In the future, I will re-run some of these analytics with even more data and possibly put them on my website as an interactive graph.