jrtechs-NodeJSBlog

Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net

20 KiB

Raw Blame History

Last week I scrapped a bunch of data from the Steam API using my Steam Graph Project. This project captures steam users, their friends, and the games that they own. Using the Janus-Graph traversal object, I use the Gremlin graph query language to pull this data. Since I am storing the hours played in a game as a property on the relationship between a player and a game node, I had to make a "join" statement to get the hours property with the game information in a single query.

Object o = graph.con.getTraversal()
    .V()
    .hasLabel(Game.KEY_DB)
    .match(
            __.as("c").values(Game.KEY_STEAM_GAME_ID).as("gameID"),
            __.as("c").values(Game.KEY_GAME_NAME).as("gameName"),
            __.as("c").inE(Game.KEY_RELATIONSHIP).values(Game.KEY_PLAY_TIME).as("time")
    ).select("gameID", "time", "gameName").toList();
WrappedFileWriter.writeToFile(new Gson().toJson(o).toLowerCase(), "games.json");

Using the game indexing property on the players, I noted that I only ended up wholly indexing the games of 481 players after 8 hours.

graph.con.getTraversal()
    .V()
    .hasLabel(SteamGraph.KEY_PLAYER)
    .has(SteamGraph.KEY_CRAWLED_GAME_STATUS, 1)
    .count().next()

We now transition to Python and Matlptlib to visualize the data exported from our JanusGraph Query as a JSON object. The dependencies for this notebook can get installed using pip.

!pip install pandas
!pip install matplotlib

    Collecting pandas
      Downloading pandas-1.0.5-cp38-cp38-manylinux1_x86_64.whl (10.0 MB)
    [K     |████████████████████████████████| 10.0 MB 4.3 MB/s eta 0:00:01
    [?25hCollecting pytz>=2017.2
      Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB)
    [K     |████████████████████████████████| 510 kB 2.9 MB/s eta 0:00:01
    [?25hRequirement already satisfied: numpy>=1.13.3 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (1.18.5)
    Requirement already satisfied: python-dateutil>=2.6.1 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (2.8.1)
    Requirement already satisfied: six>=1.5 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0)
    Installing collected packages: pytz, pandas
    Successfully installed pandas-1.0.5 pytz-2020.1

The first thing we are doing is importing our JSON data as a pandas data frame. Pandas is an open-source data analysis and manipulation tool. I enjoy pandas because it has native integration with matplotlib and supports operations like aggregations and groupings.

import matplotlib.pyplot as plt
import pandas as pd

games_df = pd.read_json('games.json')
games_df

	gameid	time	gamename
0	210770	243	sanctum 2
1	210770	31	sanctum 2
2	210770	276	sanctum 2
3	210770	147	sanctum 2
4	210770	52	sanctum 2
...	...	...	...
36212	9800	9	death to spies
36213	445220	0	avorion
36214	445220	25509	avorion
36215	445220	763	avorion
36216	445220	3175	avorion

36217 rows × 3 columns

Using the built-in matplotlib wrapper function, we can graph a histogram of the number of hours played in a game.

ax = games_df.hist(column='time', bins=20, range=(0, 4000))
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

Notice that the vast majority of the games are rarely ever played, however, it is skewed to the right with a lot of outliers. We can change the scale to make it easier to view using the range parameter.

ax = games_df.hist(column='time', bins=20, range=(0, 100))
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

If we remove games that have never been played, the distribution looks more reasonable.

ax = games_df.hist(column='time', bins=20, range=(2, 100))
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

Although histograms are useful, viewing the CDF is often more helpful since it is easier to extract numerical information.

ax = games_df.hist(column='time',density=True, range=(0, 2000),  histtype='step',cumulative=True)
ax=ax[0][0]
ax.set_title("Game Play Distribution")
ax.set_xlabel("Minutes Played")
ax.set_ylabel("Frequency")

According to this graph, about 80% of people on steam who own a game, play it under 4 hours. Nearly half of all downloaded or purchased steam games go un-played. This data is a neat example of the legendary 80/20 principle -- aka the Pareto principle. The Pareto principle states that roughly 80% of the effects come from 20% of the causes. IE: 20% of software bugs result in 80% of debugging time.

As mentioned earlier, the time in owned game distribution is heavily skewed to the right.

ax = plt.gca()
ax.set_title('Game Play Distribution')
ax.boxplot(games_df['time'], vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Minutes")
ax.set_yticks([])
plt.show()

When zooming in on the distribution, we see that nearly half of all the purchased games go un-opened.

ax = plt.gca()
ax.set_title('Game Play Distribution')
ax.boxplot(games_df['time']/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Hours")
ax.set_yticks([])
ax.set_xlim([0, 10])
plt.show()

Viewing the aggregate pool of hours in particular game data is insightful; however, comparing different games against each other is more interesting. In pandas, after we create a grouping on a column, we can aggregate it into metrics such as max, min, mean, etc. I am also sorting the data I get by count since we are more interested in "popular" games.

stats_df = (games_df.groupby("gamename")
                    .agg({'time': ['count', "min", 'max', 'mean']})
                    .sort_values(by=('time', 'count')))
stats_df

	time
	count	min	max	mean
gamename
龙魂时刻	1	14	14	14.000000
gryphon knight epic	1	0	0	0.000000
growing pains	1	0	0	0.000000
shoppy mart: steam edition	1	0	0	0.000000
ground pounders	1	0	0	0.000000
...	...	...	...	...
payday 2	102	0	84023	5115.813725
team fortress 2	105	7	304090	25291.180952
unturned	107	0	16974	1339.757009
garry's mod	121	0	311103	20890.314050
counter-strike: global offensive	129	0	506638	46356.209302

9235 rows × 4 columns

To prevent one-off esoteric games that I don't have a lot of data for, throwing off metrics, I am disregarding any games that I have less than ten values for.

stats_df = stats_df[stats_df[('time', 'count')] > 10]
stats_df

	time
	count	min	max	mean
gamename
serious sam hd: the second encounter	11	0	329	57.909091
grim fandango remastered	11	0	248	35.000000
evga precision x1	11	0	21766	2498.181818
f.e.a.r. 2: project origin	11	0	292	43.272727
transistor	11	0	972	298.727273
...	...	...	...	...
payday 2	102	0	84023	5115.813725
team fortress 2	105	7	304090	25291.180952
unturned	107	0	16974	1339.757009
garry's mod	121	0	311103	20890.314050
counter-strike: global offensive	129	0	506638	46356.209302

701 rows × 4 columns

We see that the average, the playtime per player per game, is about 5 hours. However, as noted before, most purchased games go un-played.

ax = plt.gca()
ax.set_title('Game Play Distribution')
ax.boxplot(stats_df[('time', 'mean')]/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Mean Game Play in Hours")
ax.set_xlim([0, 40])
ax.set_yticks([])
plt.show()

I had a hunch that more popular games got played more; however, this dataset is still too small the verify this hunch.

stats_df.plot.scatter(x=('time', 'count'), y=('time', 'mean'))

We can create a new filtered data frame that only contains the result of a single game to graph it.

cc_df = games_df[games_df['gamename'] == "counter-strike: global offensive"]
cc_df

	gameid	time	gamename
13196	730	742	counter-strike: global offensive
13197	730	16019	counter-strike: global offensive
13198	730	1781	counter-strike: global offensive
13199	730	0	counter-strike: global offensive
13200	730	0	counter-strike: global offensive
...	...	...	...
13320	730	3867	counter-strike: global offensive
13321	730	174176	counter-strike: global offensive
13322	730	186988	counter-strike: global offensive
13323	730	103341	counter-strike: global offensive
13324	730	10483	counter-strike: global offensive

129 rows × 3 columns

It is shocking how many hours certain people play in Counter-Strike. The highest number in the dataset was 8,444 hours or 352 days!

ax = plt.gca()
ax.set_title('Game Play Distribution for Counter-Strike')
ax.boxplot(cc_df['time']/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Hours")
ax.set_yticks([])
plt.show()

Viewing the distribution for a different game like Unturned, yields a vastly different distribution than Counter-Strike. I believe the key difference is that Counter-Strike gets played competitively, where Unturned is a more leisurely game. Competitive gamers likely skew the distribution of Counter-Strike to be very high.

u_df = games_df[games_df['gamename'] == "unturned"]
u_df

	gameid	time	gamename
167	304930	140	unturned
168	304930	723	unturned
169	304930	1002	unturned
170	304930	1002	unturned
171	304930	0	unturned
...	...	...	...
269	304930	97	unturned
270	304930	768	unturned
271	304930	1570	unturned
272	304930	23	unturned
273	304930	115	unturned

107 rows × 3 columns

ax = plt.gca()
ax.set_title('Game Play Distribution for Unturned')
ax.boxplot(u_df['time']/60, vert=False,manage_ticks=False, notch=True)
plt.xlabel("Game Play in Hours")
ax.set_yticks([])
plt.show()

Next, I made a data frame just containing the raw data points of games that had an aggregate count of over 80. For the crawl sample size that I did, having a count of 80 would make the game "popular." Since we only have 485 players indexed, having over 80 entries implies that over 17% of people indexed had the game. It is easy to verify that the games returned were very popular by glancing at the results.

df1 = games_df[games_df['gamename'].map(games_df['gamename'].value_counts()) > 80]
df1['time'] = df1['time']/60
df1

	gameid	time	gamename
167	304930	2.333333	unturned
168	304930	12.050000	unturned
169	304930	16.700000	unturned
170	304930	16.700000	unturned
171	304930	0.000000	unturned
...	...	...	...
22682	578080	51.883333	playerunknown's battlegrounds
22683	578080	47.616667	playerunknown's battlegrounds
22684	578080	30.650000	playerunknown's battlegrounds
22685	578080	170.083333	playerunknown's battlegrounds
22686	578080	399.950000	playerunknown's battlegrounds

1099 rows × 3 columns

ax = df1.boxplot(column=["time"], by='gamename', notch=True, vert=False)
fig = ax.get_figure()
fig.suptitle('')
ax.set_title('Play-time Distribution')
plt.xlabel("Hours Played")
ax.set_xlim([0, 2000])
plt.ylabel("Game")
plt.savefig("playTimes.png", dpi=300, bbox_inches = "tight")

Overall it is fascinating to see how the distributions for different games vary. In the future, I will re-run some of these analytics with even more data and possibly put them on my website as an interactive graph.

20 KiB Raw Blame History

20 KiB

Raw Blame History