diff --git a/blogContent/headerImages/playTimes.png b/blogContent/headerImages/playTimes.png new file mode 100644 index 0000000..fd4256f Binary files /dev/null and b/blogContent/headerImages/playTimes.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_11_1.png b/blogContent/posts/data-science/media/steamGames/output_11_1.png new file mode 100644 index 0000000..1c2e3ed Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_11_1.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_13_1.png b/blogContent/posts/data-science/media/steamGames/output_13_1.png new file mode 100644 index 0000000..8b1cbb0 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_13_1.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_15_1.png b/blogContent/posts/data-science/media/steamGames/output_15_1.png new file mode 100644 index 0000000..8b86e94 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_15_1.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_17_0.png b/blogContent/posts/data-science/media/steamGames/output_17_0.png new file mode 100644 index 0000000..8265d56 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_17_0.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_19_0.png b/blogContent/posts/data-science/media/steamGames/output_19_0.png new file mode 100644 index 0000000..b422f60 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_19_0.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_25_0.png b/blogContent/posts/data-science/media/steamGames/output_25_0.png new file mode 100644 index 0000000..487c5df Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_25_0.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_27_1.png b/blogContent/posts/data-science/media/steamGames/output_27_1.png new file mode 100644 index 0000000..da75984 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_27_1.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_31_0.png b/blogContent/posts/data-science/media/steamGames/output_31_0.png new file mode 100644 index 0000000..d3bbc94 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_31_0.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_34_0.png b/blogContent/posts/data-science/media/steamGames/output_34_0.png new file mode 100644 index 0000000..24ec6be Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_34_0.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_37_0.png b/blogContent/posts/data-science/media/steamGames/output_37_0.png new file mode 100644 index 0000000..985dc41 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_37_0.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_38_0.png b/blogContent/posts/data-science/media/steamGames/output_38_0.png new file mode 100644 index 0000000..c515f84 Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_38_0.png differ diff --git a/blogContent/posts/data-science/media/steamGames/output_9_1.png b/blogContent/posts/data-science/media/steamGames/output_9_1.png new file mode 100644 index 0000000..28d566c Binary files /dev/null and b/blogContent/posts/data-science/media/steamGames/output_9_1.png differ diff --git a/blogContent/posts/data-science/time-spent-in-steam-games.md b/blogContent/posts/data-science/time-spent-in-steam-games.md new file mode 100644 index 0000000..6087528 --- /dev/null +++ b/blogContent/posts/data-science/time-spent-in-steam-games.md @@ -0,0 +1,900 @@ +Last week I scrapped a bunch of data from the Steam API using my [Steam Graph Project](https://github.com/jrtechs/SteamFriendsGraph). +This project captures steam users, their friends, and the games that they own. +Using the Janus-Graph traversal object, I use the Gremlin graph query language to pull this data. +Since I am storing the hours played in a game as a property on the relationship between a player and a game node, I had to make a "join" statement to get the hours property with the game information in a single query. + +```java +Object o = graph.con.getTraversal() + .V() + .hasLabel(Game.KEY_DB) + .match( + __.as("c").values(Game.KEY_STEAM_GAME_ID).as("gameID"), + __.as("c").values(Game.KEY_GAME_NAME).as("gameName"), + __.as("c").inE(Game.KEY_RELATIONSHIP).values(Game.KEY_PLAY_TIME).as("time") + ).select("gameID", "time", "gameName").toList(); +WrappedFileWriter.writeToFile(new Gson().toJson(o).toLowerCase(), "games.json"); +``` + +Using the game indexing property on the players, I noted that I only ended up wholly indexing the games of 481 players after 8 hours. + +```java +graph.con.getTraversal() + .V() + .hasLabel(SteamGraph.KEY_PLAYER) + .has(SteamGraph.KEY_CRAWLED_GAME_STATUS, 1) + .count().next() +``` + +We now transition to Python and Matlptlib to visualize the data exported from our JanusGraph Query as a JSON object. +The dependencies for this [notebook](https://github.com/jrtechs/RandomScripts/tree/master/notebooks) can get installed using pip. + + +```python +!pip install pandas +!pip install matplotlib +``` + +``` + Collecting pandas + Downloading pandas-1.0.5-cp38-cp38-manylinux1_x86_64.whl (10.0 MB) +  |████████████████████████████████| 10.0 MB 4.3 MB/s eta 0:00:01 + [?25hCollecting pytz>=2017.2 + Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB) +  |████████████████████████████████| 510 kB 2.9 MB/s eta 0:00:01 + [?25hRequirement already satisfied: numpy>=1.13.3 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (1.18.5) + Requirement already satisfied: python-dateutil>=2.6.1 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (2.8.1) + Requirement already satisfied: six>=1.5 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0) + Installing collected packages: pytz, pandas + Successfully installed pandas-1.0.5 pytz-2020.1 +``` + +The first thing we are doing is importing our JSON data as a pandas data frame. +Pandas is an open-source data analysis and manipulation tool. +I enjoy pandas because it has native integration with matplotlib and supports operations like aggregations and groupings. + + +```python +import matplotlib.pyplot as plt +import pandas as pd + +games_df = pd.read_json('games.json') +games_df +``` + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
gameidtimegamename
0210770243sanctum 2
121077031sanctum 2
2210770276sanctum 2
3210770147sanctum 2
421077052sanctum 2
............
3621298009death to spies
362134452200avorion
3621444522025509avorion
36215445220763avorion
362164452203175avorion
+

36217 rows × 3 columns

+
+ + +Using the built-in matplotlib wrapper function, we can graph a histogram of the number of hours played in a game. + + +```python +ax = games_df.hist(column='time', bins=20, range=(0, 4000)) +ax=ax[0][0] +ax.set_title("Game Play Distribution") +ax.set_xlabel("Minutes Played") +ax.set_ylabel("Frequency") +``` + +![png](media/steamGames/output_9_1.png) + + +Notice that the vast majority of the games are rarely ever played, however, it is skewed to the right with a lot of outliers. +We can change the scale to make it easier to view using the range parameter. + + +```python +ax = games_df.hist(column='time', bins=20, range=(0, 100)) +ax=ax[0][0] +ax.set_title("Game Play Distribution") +ax.set_xlabel("Minutes Played") +ax.set_ylabel("Frequency") +``` + + +![png](media/steamGames/output_11_1.png) + + +If we remove games that have never been played, the distribution looks more reasonable. + + +```python +ax = games_df.hist(column='time', bins=20, range=(2, 100)) +ax=ax[0][0] +ax.set_title("Game Play Distribution") +ax.set_xlabel("Minutes Played") +ax.set_ylabel("Frequency") +``` + +![png](media/steamGames/output_13_1.png) + + +Although histograms are useful, viewing the CDF is often more helpful since it is easier to extract numerical information. + + +```python +ax = games_df.hist(column='time',density=True, range=(0, 2000), histtype='step',cumulative=True) +ax=ax[0][0] +ax.set_title("Game Play Distribution") +ax.set_xlabel("Minutes Played") +ax.set_ylabel("Frequency") +``` + +![png](media/steamGames/output_15_1.png) + + +According to this graph, about 80% of people on steam who own a game, play it under 4 hours. Nearly half of all downloaded or purchased steam games go un-played. This data is a neat example of the legendary 80/20 principle -- aka the Pareto principle. The Pareto principle states that roughly 80% of the effects come from 20% of the causes. IE: 20% of software bugs result in 80% of debugging time. + +As mentioned earlier, the time in owned game distribution is heavily skewed to the right. + + +```python +ax = plt.gca() +ax.set_title('Game Play Distribution') +ax.boxplot(games_df['time'], vert=False,manage_ticks=False, notch=True) +plt.xlabel("Game Play in Minutes") +ax.set_yticks([]) +plt.show() +``` + + +![png](media/steamGames/output_17_0.png) + + +When zooming in on the distribution, we see that nearly half of all the purchased games go un-opened. + +```python +ax = plt.gca() +ax.set_title('Game Play Distribution') +ax.boxplot(games_df['time']/60, vert=False,manage_ticks=False, notch=True) +plt.xlabel("Game Play in Hours") +ax.set_yticks([]) +ax.set_xlim([0, 10]) +plt.show() +``` + + +![png](media/steamGames/output_19_0.png) + + +Viewing the aggregate pool of hours in particular game data is insightful; however, comparing different games against each other is more interesting. +In pandas, after we create a grouping on a column, we can aggregate it into metrics such as max, min, mean, etc. +I am also sorting the data I get by count since we are more interested in "popular" games. + + +```python +stats_df = (games_df.groupby("gamename") + .agg({'time': ['count', "min", 'max', 'mean']}) + .sort_values(by=('time', 'count'))) +stats_df +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
time
countminmaxmean
gamename
龙魂时刻1141414.000000
gryphon knight epic1000.000000
growing pains1000.000000
shoppy mart: steam edition1000.000000
ground pounders1000.000000
...............
payday 21020840235115.813725
team fortress 2105730409025291.180952
unturned1070169741339.757009
garry's mod121031110320890.314050
counter-strike: global offensive129050663846356.209302
+

9235 rows × 4 columns

+
+ + +To prevent one-off esoteric games that I don't have a lot of data for, throwing off metrics, I am disregarding any games that I have less than ten values for. + + +```python +stats_df = stats_df[stats_df[('time', 'count')] > 10] +stats_df +``` + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
time
countminmaxmean
gamename
serious sam hd: the second encounter11032957.909091
grim fandango remastered11024835.000000
evga precision x1110217662498.181818
f.e.a.r. 2: project origin11029243.272727
transistor110972298.727273
...............
payday 21020840235115.813725
team fortress 2105730409025291.180952
unturned1070169741339.757009
garry's mod121031110320890.314050
counter-strike: global offensive129050663846356.209302
+

701 rows × 4 columns

+
+ + +We see that the average, the playtime per player per game, is about 5 hours. However, as noted before, most purchased games go un-played. + + +```python +ax = plt.gca() +ax.set_title('Game Play Distribution') +ax.boxplot(stats_df[('time', 'mean')]/60, vert=False,manage_ticks=False, notch=True) +plt.xlabel("Mean Game Play in Hours") +ax.set_xlim([0, 40]) +ax.set_yticks([]) +plt.show() +``` + + +![png](media/steamGames/output_25_0.png) + + +I had a hunch that more popular games got played more; however, this dataset is still too small the verify this hunch. + +```python +stats_df.plot.scatter(x=('time', 'count'), y=('time', 'mean')) +``` + +![png](media/steamGames/output_27_1.png) + +```python +We can create a new filtered data frame that only contains the result of a single game to graph it. +``` + + +```python +cc_df = games_df[games_df['gamename'] == "counter-strike: global offensive"] +cc_df +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
gameidtimegamename
13196730742counter-strike: global offensive
1319773016019counter-strike: global offensive
131987301781counter-strike: global offensive
131997300counter-strike: global offensive
132007300counter-strike: global offensive
............
133207303867counter-strike: global offensive
13321730174176counter-strike: global offensive
13322730186988counter-strike: global offensive
13323730103341counter-strike: global offensive
1332473010483counter-strike: global offensive
+

129 rows × 3 columns

+
+ + +It is shocking how many hours certain people play in Counter-Strike. The highest number in the dataset was 8,444 hours or 352 days! + +```python +ax = plt.gca() +ax.set_title('Game Play Distribution for Counter-Strike') +ax.boxplot(cc_df['time']/60, vert=False,manage_ticks=False, notch=True) +plt.xlabel("Game Play in Hours") +ax.set_yticks([]) +plt.show() +``` + +![png](media/steamGames/output_31_0.png) + + +Viewing the distribution for a different game like Unturned, yields a vastly different distribution than Counter-Strike. I believe the key difference is that Counter-Strike gets played competitively, where Unturned is a more leisurely game. Competitive gamers likely skew the distribution of Counter-Strike to be very high. + +```python +u_df = games_df[games_df['gamename'] == "unturned"] +u_df +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
gameidtimegamename
167304930140unturned
168304930723unturned
1693049301002unturned
1703049301002unturned
1713049300unturned
............
26930493097unturned
270304930768unturned
2713049301570unturned
27230493023unturned
273304930115unturned
+

107 rows × 3 columns

+
+ + + + +```python +ax = plt.gca() +ax.set_title('Game Play Distribution for Unturned') +ax.boxplot(u_df['time']/60, vert=False,manage_ticks=False, notch=True) +plt.xlabel("Game Play in Hours") +ax.set_yticks([]) +plt.show() +``` + +![png](media/steamGames/output_34_0.png) + + +Next, I made a data frame just containing the raw data points of games that had an aggregate count of over 80. For the crawl sample size that I did, having a count of 80 would make the game "popular." Since we only have 485 players indexed, having over 80 entries implies that over 17% of people indexed had the game. It is easy to verify that the games returned were very popular by glancing at the results. + + +```python +df1 = games_df[games_df['gamename'].map(games_df['gamename'].value_counts()) > 80] +df1['time'] = df1['time']/60 +df1 +``` + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
gameidtimegamename
1673049302.333333unturned
16830493012.050000unturned
16930493016.700000unturned
17030493016.700000unturned
1713049300.000000unturned
............
2268257808051.883333playerunknown's battlegrounds
2268357808047.616667playerunknown's battlegrounds
2268457808030.650000playerunknown's battlegrounds
22685578080170.083333playerunknown's battlegrounds
22686578080399.950000playerunknown's battlegrounds
+

1099 rows × 3 columns

+
+ + +```python +ax = df1.boxplot(column=["time"], by='gamename', notch=True, vert=False) +fig = ax.get_figure() +fig.suptitle('') +ax.set_title('Play-time Distribution') +plt.xlabel("Hours Played") +ax.set_xlim([0, 2000]) +plt.ylabel("Game") +plt.savefig("playTimes.png", dpi=300, bbox_inches = "tight") +``` + +![png](media/steamGames/output_38_0.png) + +Overall it is fascinating to see how the distributions for different games vary. In the future, I will re-run some of these analytics with even more data and possibly put them on my website as an interactive graph.