Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net

900 lines
20 KiB

  1. Last week I scrapped a bunch of data from the Steam API using my [Steam Graph Project](https://github.com/jrtechs/SteamFriendsGraph).
  2. This project captures steam users, their friends, and the games that they own.
  3. Using the Janus-Graph traversal object, I use the Gremlin graph query language to pull this data.
  4. Since I am storing the hours played in a game as a property on the relationship between a player and a game node, I had to make a "join" statement to get the hours property with the game information in a single query.
  5. ```java
  6. Object o = graph.con.getTraversal()
  7. .V()
  8. .hasLabel(Game.KEY_DB)
  9. .match(
  10. __.as("c").values(Game.KEY_STEAM_GAME_ID).as("gameID"),
  11. __.as("c").values(Game.KEY_GAME_NAME).as("gameName"),
  12. __.as("c").inE(Game.KEY_RELATIONSHIP).values(Game.KEY_PLAY_TIME).as("time")
  13. ).select("gameID", "time", "gameName").toList();
  14. WrappedFileWriter.writeToFile(new Gson().toJson(o).toLowerCase(), "games.json");
  15. ```
  16. Using the game indexing property on the players, I noted that I only ended up wholly indexing the games of 481 players after 8 hours.
  17. ```java
  18. graph.con.getTraversal()
  19. .V()
  20. .hasLabel(SteamGraph.KEY_PLAYER)
  21. .has(SteamGraph.KEY_CRAWLED_GAME_STATUS, 1)
  22. .count().next()
  23. ```
  24. We now transition to Python and Matlptlib to visualize the data exported from our JanusGraph Query as a JSON object.
  25. The dependencies for this [notebook](https://github.com/jrtechs/RandomScripts/tree/master/notebooks) can get installed using pip.
  26. ```python
  27. !pip install pandas
  28. !pip install matplotlib
  29. ```
  30. ```
  31. Collecting pandas
  32. Downloading pandas-1.0.5-cp38-cp38-manylinux1_x86_64.whl (10.0 MB)
  33.  |████████████████████████████████| 10.0 MB 4.3 MB/s eta 0:00:01
  34. [?25hCollecting pytz>=2017.2
  35. Downloading pytz-2020.1-py2.py3-none-any.whl (510 kB)
  36.  |████████████████████████████████| 510 kB 2.9 MB/s eta 0:00:01
  37. [?25hRequirement already satisfied: numpy>=1.13.3 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (1.18.5)
  38. Requirement already satisfied: python-dateutil>=2.6.1 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from pandas) (2.8.1)
  39. Requirement already satisfied: six>=1.5 in /home/jeff/Documents/python/ml/lib/python3.8/site-packages (from python-dateutil>=2.6.1->pandas) (1.15.0)
  40. Installing collected packages: pytz, pandas
  41. Successfully installed pandas-1.0.5 pytz-2020.1
  42. ```
  43. The first thing we are doing is importing our JSON data as a pandas data frame.
  44. Pandas is an open-source data analysis and manipulation tool.
  45. I enjoy pandas because it has native integration with matplotlib and supports operations like aggregations and groupings.
  46. ```python
  47. import matplotlib.pyplot as plt
  48. import pandas as pd
  49. games_df = pd.read_json('games.json')
  50. games_df
  51. ```
  52. <div>
  53. <style scoped>
  54. .dataframe tbody tr th:only-of-type {
  55. vertical-align: middle;
  56. }
  57. .dataframe tbody tr th {
  58. vertical-align: top;
  59. }
  60. .dataframe thead th {
  61. text-align: right;
  62. }
  63. </style>
  64. <table border="1" class="dataframe">
  65. <thead>
  66. <tr style="text-align: right;">
  67. <th></th>
  68. <th>gameid</th>
  69. <th>time</th>
  70. <th>gamename</th>
  71. </tr>
  72. </thead>
  73. <tbody>
  74. <tr>
  75. <th>0</th>
  76. <td>210770</td>
  77. <td>243</td>
  78. <td>sanctum 2</td>
  79. </tr>
  80. <tr>
  81. <th>1</th>
  82. <td>210770</td>
  83. <td>31</td>
  84. <td>sanctum 2</td>
  85. </tr>
  86. <tr>
  87. <th>2</th>
  88. <td>210770</td>
  89. <td>276</td>
  90. <td>sanctum 2</td>
  91. </tr>
  92. <tr>
  93. <th>3</th>
  94. <td>210770</td>
  95. <td>147</td>
  96. <td>sanctum 2</td>
  97. </tr>
  98. <tr>
  99. <th>4</th>
  100. <td>210770</td>
  101. <td>52</td>
  102. <td>sanctum 2</td>
  103. </tr>
  104. <tr>
  105. <th>...</th>
  106. <td>...</td>
  107. <td>...</td>
  108. <td>...</td>
  109. </tr>
  110. <tr>
  111. <th>36212</th>
  112. <td>9800</td>
  113. <td>9</td>
  114. <td>death to spies</td>
  115. </tr>
  116. <tr>
  117. <th>36213</th>
  118. <td>445220</td>
  119. <td>0</td>
  120. <td>avorion</td>
  121. </tr>
  122. <tr>
  123. <th>36214</th>
  124. <td>445220</td>
  125. <td>25509</td>
  126. <td>avorion</td>
  127. </tr>
  128. <tr>
  129. <th>36215</th>
  130. <td>445220</td>
  131. <td>763</td>
  132. <td>avorion</td>
  133. </tr>
  134. <tr>
  135. <th>36216</th>
  136. <td>445220</td>
  137. <td>3175</td>
  138. <td>avorion</td>
  139. </tr>
  140. </tbody>
  141. </table>
  142. <p>36217 rows × 3 columns</p>
  143. </div>
  144. Using the built-in matplotlib wrapper function, we can graph a histogram of the number of hours played in a game.
  145. ```python
  146. ax = games_df.hist(column='time', bins=20, range=(0, 4000))
  147. ax=ax[0][0]
  148. ax.set_title("Game Play Distribution")
  149. ax.set_xlabel("Minutes Played")
  150. ax.set_ylabel("Frequency")
  151. ```
  152. ![png](media/steamGames/output_9_1.png)
  153. Notice that the vast majority of the games are rarely ever played, however, it is skewed to the right with a lot of outliers.
  154. We can change the scale to make it easier to view using the range parameter.
  155. ```python
  156. ax = games_df.hist(column='time', bins=20, range=(0, 100))
  157. ax=ax[0][0]
  158. ax.set_title("Game Play Distribution")
  159. ax.set_xlabel("Minutes Played")
  160. ax.set_ylabel("Frequency")
  161. ```
  162. ![png](media/steamGames/output_11_1.png)
  163. If we remove games that have never been played, the distribution looks more reasonable.
  164. ```python
  165. ax = games_df.hist(column='time', bins=20, range=(2, 100))
  166. ax=ax[0][0]
  167. ax.set_title("Game Play Distribution")
  168. ax.set_xlabel("Minutes Played")
  169. ax.set_ylabel("Frequency")
  170. ```
  171. ![png](media/steamGames/output_13_1.png)
  172. Although histograms are useful, viewing the CDF is often more helpful since it is easier to extract numerical information.
  173. ```python
  174. ax = games_df.hist(column='time',density=True, range=(0, 2000), histtype='step',cumulative=True)
  175. ax=ax[0][0]
  176. ax.set_title("Game Play Distribution")
  177. ax.set_xlabel("Minutes Played")
  178. ax.set_ylabel("Frequency")
  179. ```
  180. ![png](media/steamGames/output_15_1.png)
  181. According to this graph, about 80% of people on steam who own a game, play it under 4 hours. Nearly half of all downloaded or purchased steam games go un-played. This data is a neat example of the legendary 80/20 principle -- aka the Pareto principle. The Pareto principle states that roughly 80% of the effects come from 20% of the causes. IE: 20% of software bugs result in 80% of debugging time.
  182. As mentioned earlier, the time in owned game distribution is heavily skewed to the right.
  183. ```python
  184. ax = plt.gca()
  185. ax.set_title('Game Play Distribution')
  186. ax.boxplot(games_df['time'], vert=False,manage_ticks=False, notch=True)
  187. plt.xlabel("Game Play in Minutes")
  188. ax.set_yticks([])
  189. plt.show()
  190. ```
  191. ![png](media/steamGames/output_17_0.png)
  192. When zooming in on the distribution, we see that nearly half of all the purchased games go un-opened.
  193. ```python
  194. ax = plt.gca()
  195. ax.set_title('Game Play Distribution')
  196. ax.boxplot(games_df['time']/60, vert=False,manage_ticks=False, notch=True)
  197. plt.xlabel("Game Play in Hours")
  198. ax.set_yticks([])
  199. ax.set_xlim([0, 10])
  200. plt.show()
  201. ```
  202. ![png](media/steamGames/output_19_0.png)
  203. Viewing the aggregate pool of hours in particular game data is insightful; however, comparing different games against each other is more interesting.
  204. In pandas, after we create a grouping on a column, we can aggregate it into metrics such as max, min, mean, etc.
  205. I am also sorting the data I get by count since we are more interested in "popular" games.
  206. ```python
  207. stats_df = (games_df.groupby("gamename")
  208. .agg({'time': ['count', "min", 'max', 'mean']})
  209. .sort_values(by=('time', 'count')))
  210. stats_df
  211. ```
  212. <div>
  213. <style scoped>
  214. .dataframe tbody tr th:only-of-type {
  215. vertical-align: middle;
  216. }
  217. .dataframe tbody tr th {
  218. vertical-align: top;
  219. }
  220. .dataframe thead tr th {
  221. text-align: left;
  222. }
  223. .dataframe thead tr:last-of-type th {
  224. text-align: right;
  225. }
  226. </style>
  227. <table border="1" class="dataframe">
  228. <thead>
  229. <tr>
  230. <th></th>
  231. <th colspan="4" halign="left">time</th>
  232. </tr>
  233. <tr>
  234. <th></th>
  235. <th>count</th>
  236. <th>min</th>
  237. <th>max</th>
  238. <th>mean</th>
  239. </tr>
  240. <tr>
  241. <th>gamename</th>
  242. <th></th>
  243. <th></th>
  244. <th></th>
  245. <th></th>
  246. </tr>
  247. </thead>
  248. <tbody>
  249. <tr>
  250. <th>龙魂时刻</th>
  251. <td>1</td>
  252. <td>14</td>
  253. <td>14</td>
  254. <td>14.000000</td>
  255. </tr>
  256. <tr>
  257. <th>gryphon knight epic</th>
  258. <td>1</td>
  259. <td>0</td>
  260. <td>0</td>
  261. <td>0.000000</td>
  262. </tr>
  263. <tr>
  264. <th>growing pains</th>
  265. <td>1</td>
  266. <td>0</td>
  267. <td>0</td>
  268. <td>0.000000</td>
  269. </tr>
  270. <tr>
  271. <th>shoppy mart: steam edition</th>
  272. <td>1</td>
  273. <td>0</td>
  274. <td>0</td>
  275. <td>0.000000</td>
  276. </tr>
  277. <tr>
  278. <th>ground pounders</th>
  279. <td>1</td>
  280. <td>0</td>
  281. <td>0</td>
  282. <td>0.000000</td>
  283. </tr>
  284. <tr>
  285. <th>...</th>
  286. <td>...</td>
  287. <td>...</td>
  288. <td>...</td>
  289. <td>...</td>
  290. </tr>
  291. <tr>
  292. <th>payday 2</th>
  293. <td>102</td>
  294. <td>0</td>
  295. <td>84023</td>
  296. <td>5115.813725</td>
  297. </tr>
  298. <tr>
  299. <th>team fortress 2</th>
  300. <td>105</td>
  301. <td>7</td>
  302. <td>304090</td>
  303. <td>25291.180952</td>
  304. </tr>
  305. <tr>
  306. <th>unturned</th>
  307. <td>107</td>
  308. <td>0</td>
  309. <td>16974</td>
  310. <td>1339.757009</td>
  311. </tr>
  312. <tr>
  313. <th>garry's mod</th>
  314. <td>121</td>
  315. <td>0</td>
  316. <td>311103</td>
  317. <td>20890.314050</td>
  318. </tr>
  319. <tr>
  320. <th>counter-strike: global offensive</th>
  321. <td>129</td>
  322. <td>0</td>
  323. <td>506638</td>
  324. <td>46356.209302</td>
  325. </tr>
  326. </tbody>
  327. </table>
  328. <p>9235 rows × 4 columns</p>
  329. </div>
  330. To prevent one-off esoteric games that I don't have a lot of data for, throwing off metrics, I am disregarding any games that I have less than ten values for.
  331. ```python
  332. stats_df = stats_df[stats_df[('time', 'count')] > 10]
  333. stats_df
  334. ```
  335. <div>
  336. <style scoped>
  337. .dataframe tbody tr th:only-of-type {
  338. vertical-align: middle;
  339. }
  340. .dataframe tbody tr th {
  341. vertical-align: top;
  342. }
  343. .dataframe thead tr th {
  344. text-align: left;
  345. }
  346. .dataframe thead tr:last-of-type th {
  347. text-align: right;
  348. }
  349. </style>
  350. <table border="1" class="dataframe">
  351. <thead>
  352. <tr>
  353. <th></th>
  354. <th colspan="4" halign="left">time</th>
  355. </tr>
  356. <tr>
  357. <th></th>
  358. <th>count</th>
  359. <th>min</th>
  360. <th>max</th>
  361. <th>mean</th>
  362. </tr>
  363. <tr>
  364. <th>gamename</th>
  365. <th></th>
  366. <th></th>
  367. <th></th>
  368. <th></th>
  369. </tr>
  370. </thead>
  371. <tbody>
  372. <tr>
  373. <th>serious sam hd: the second encounter</th>
  374. <td>11</td>
  375. <td>0</td>
  376. <td>329</td>
  377. <td>57.909091</td>
  378. </tr>
  379. <tr>
  380. <th>grim fandango remastered</th>
  381. <td>11</td>
  382. <td>0</td>
  383. <td>248</td>
  384. <td>35.000000</td>
  385. </tr>
  386. <tr>
  387. <th>evga precision x1</th>
  388. <td>11</td>
  389. <td>0</td>
  390. <td>21766</td>
  391. <td>2498.181818</td>
  392. </tr>
  393. <tr>
  394. <th>f.e.a.r. 2: project origin</th>
  395. <td>11</td>
  396. <td>0</td>
  397. <td>292</td>
  398. <td>43.272727</td>
  399. </tr>
  400. <tr>
  401. <th>transistor</th>
  402. <td>11</td>
  403. <td>0</td>
  404. <td>972</td>
  405. <td>298.727273</td>
  406. </tr>
  407. <tr>
  408. <th>...</th>
  409. <td>...</td>
  410. <td>...</td>
  411. <td>...</td>
  412. <td>...</td>
  413. </tr>
  414. <tr>
  415. <th>payday 2</th>
  416. <td>102</td>
  417. <td>0</td>
  418. <td>84023</td>
  419. <td>5115.813725</td>
  420. </tr>
  421. <tr>
  422. <th>team fortress 2</th>
  423. <td>105</td>
  424. <td>7</td>
  425. <td>304090</td>
  426. <td>25291.180952</td>
  427. </tr>
  428. <tr>
  429. <th>unturned</th>
  430. <td>107</td>
  431. <td>0</td>
  432. <td>16974</td>
  433. <td>1339.757009</td>
  434. </tr>
  435. <tr>
  436. <th>garry's mod</th>
  437. <td>121</td>
  438. <td>0</td>
  439. <td>311103</td>
  440. <td>20890.314050</td>
  441. </tr>
  442. <tr>
  443. <th>counter-strike: global offensive</th>
  444. <td>129</td>
  445. <td>0</td>
  446. <td>506638</td>
  447. <td>46356.209302</td>
  448. </tr>
  449. </tbody>
  450. </table>
  451. <p>701 rows × 4 columns</p>
  452. </div>
  453. We see that the average, the playtime per player per game, is about 5 hours. However, as noted before, most purchased games go un-played.
  454. ```python
  455. ax = plt.gca()
  456. ax.set_title('Game Play Distribution')
  457. ax.boxplot(stats_df[('time', 'mean')]/60, vert=False,manage_ticks=False, notch=True)
  458. plt.xlabel("Mean Game Play in Hours")
  459. ax.set_xlim([0, 40])
  460. ax.set_yticks([])
  461. plt.show()
  462. ```
  463. ![png](media/steamGames/output_25_0.png)
  464. I had a hunch that more popular games got played more; however, this dataset is still too small the verify this hunch.
  465. ```python
  466. stats_df.plot.scatter(x=('time', 'count'), y=('time', 'mean'))
  467. ```
  468. ![png](media/steamGames/output_27_1.png)
  469. ```python
  470. We can create a new filtered data frame that only contains the result of a single game to graph it.
  471. ```
  472. ```python
  473. cc_df = games_df[games_df['gamename'] == "counter-strike: global offensive"]
  474. cc_df
  475. ```
  476. <div>
  477. <style scoped>
  478. .dataframe tbody tr th:only-of-type {
  479. vertical-align: middle;
  480. }
  481. .dataframe tbody tr th {
  482. vertical-align: top;
  483. }
  484. .dataframe thead th {
  485. text-align: right;
  486. }
  487. </style>
  488. <table border="1" class="dataframe">
  489. <thead>
  490. <tr style="text-align: right;">
  491. <th></th>
  492. <th>gameid</th>
  493. <th>time</th>
  494. <th>gamename</th>
  495. </tr>
  496. </thead>
  497. <tbody>
  498. <tr>
  499. <th>13196</th>
  500. <td>730</td>
  501. <td>742</td>
  502. <td>counter-strike: global offensive</td>
  503. </tr>
  504. <tr>
  505. <th>13197</th>
  506. <td>730</td>
  507. <td>16019</td>
  508. <td>counter-strike: global offensive</td>
  509. </tr>
  510. <tr>
  511. <th>13198</th>
  512. <td>730</td>
  513. <td>1781</td>
  514. <td>counter-strike: global offensive</td>
  515. </tr>
  516. <tr>
  517. <th>13199</th>
  518. <td>730</td>
  519. <td>0</td>
  520. <td>counter-strike: global offensive</td>
  521. </tr>
  522. <tr>
  523. <th>13200</th>
  524. <td>730</td>
  525. <td>0</td>
  526. <td>counter-strike: global offensive</td>
  527. </tr>
  528. <tr>
  529. <th>...</th>
  530. <td>...</td>
  531. <td>...</td>
  532. <td>...</td>
  533. </tr>
  534. <tr>
  535. <th>13320</th>
  536. <td>730</td>
  537. <td>3867</td>
  538. <td>counter-strike: global offensive</td>
  539. </tr>
  540. <tr>
  541. <th>13321</th>
  542. <td>730</td>
  543. <td>174176</td>
  544. <td>counter-strike: global offensive</td>
  545. </tr>
  546. <tr>
  547. <th>13322</th>
  548. <td>730</td>
  549. <td>186988</td>
  550. <td>counter-strike: global offensive</td>
  551. </tr>
  552. <tr>
  553. <th>13323</th>
  554. <td>730</td>
  555. <td>103341</td>
  556. <td>counter-strike: global offensive</td>
  557. </tr>
  558. <tr>
  559. <th>13324</th>
  560. <td>730</td>
  561. <td>10483</td>
  562. <td>counter-strike: global offensive</td>
  563. </tr>
  564. </tbody>
  565. </table>
  566. <p>129 rows × 3 columns</p>
  567. </div>
  568. It is shocking how many hours certain people play in Counter-Strike. The highest number in the dataset was 8,444 hours or 352 days!
  569. ```python
  570. ax = plt.gca()
  571. ax.set_title('Game Play Distribution for Counter-Strike')
  572. ax.boxplot(cc_df['time']/60, vert=False,manage_ticks=False, notch=True)
  573. plt.xlabel("Game Play in Hours")
  574. ax.set_yticks([])
  575. plt.show()
  576. ```
  577. ![png](media/steamGames/output_31_0.png)
  578. Viewing the distribution for a different game like Unturned, yields a vastly different distribution than Counter-Strike. I believe the key difference is that Counter-Strike gets played competitively, where Unturned is a more leisurely game. Competitive gamers likely skew the distribution of Counter-Strike to be very high.
  579. ```python
  580. u_df = games_df[games_df['gamename'] == "unturned"]
  581. u_df
  582. ```
  583. <div>
  584. <style scoped>
  585. .dataframe tbody tr th:only-of-type {
  586. vertical-align: middle;
  587. }
  588. .dataframe tbody tr th {
  589. vertical-align: top;
  590. }
  591. .dataframe thead th {
  592. text-align: right;
  593. }
  594. </style>
  595. <table border="1" class="dataframe">
  596. <thead>
  597. <tr style="text-align: right;">
  598. <th></th>
  599. <th>gameid</th>
  600. <th>time</th>
  601. <th>gamename</th>
  602. </tr>
  603. </thead>
  604. <tbody>
  605. <tr>
  606. <th>167</th>
  607. <td>304930</td>
  608. <td>140</td>
  609. <td>unturned</td>
  610. </tr>
  611. <tr>
  612. <th>168</th>
  613. <td>304930</td>
  614. <td>723</td>
  615. <td>unturned</td>
  616. </tr>
  617. <tr>
  618. <th>169</th>
  619. <td>304930</td>
  620. <td>1002</td>
  621. <td>unturned</td>
  622. </tr>
  623. <tr>
  624. <th>170</th>
  625. <td>304930</td>
  626. <td>1002</td>
  627. <td>unturned</td>
  628. </tr>
  629. <tr>
  630. <th>171</th>
  631. <td>304930</td>
  632. <td>0</td>
  633. <td>unturned</td>
  634. </tr>
  635. <tr>
  636. <th>...</th>
  637. <td>...</td>
  638. <td>...</td>
  639. <td>...</td>
  640. </tr>
  641. <tr>
  642. <th>269</th>
  643. <td>304930</td>
  644. <td>97</td>
  645. <td>unturned</td>
  646. </tr>
  647. <tr>
  648. <th>270</th>
  649. <td>304930</td>
  650. <td>768</td>
  651. <td>unturned</td>
  652. </tr>
  653. <tr>
  654. <th>271</th>
  655. <td>304930</td>
  656. <td>1570</td>
  657. <td>unturned</td>
  658. </tr>
  659. <tr>
  660. <th>272</th>
  661. <td>304930</td>
  662. <td>23</td>
  663. <td>unturned</td>
  664. </tr>
  665. <tr>
  666. <th>273</th>
  667. <td>304930</td>
  668. <td>115</td>
  669. <td>unturned</td>
  670. </tr>
  671. </tbody>
  672. </table>
  673. <p>107 rows × 3 columns</p>
  674. </div>
  675. ```python
  676. ax = plt.gca()
  677. ax.set_title('Game Play Distribution for Unturned')
  678. ax.boxplot(u_df['time']/60, vert=False,manage_ticks=False, notch=True)
  679. plt.xlabel("Game Play in Hours")
  680. ax.set_yticks([])
  681. plt.show()
  682. ```
  683. ![png](media/steamGames/output_34_0.png)
  684. Next, I made a data frame just containing the raw data points of games that had an aggregate count of over 80. For the crawl sample size that I did, having a count of 80 would make the game "popular." Since we only have 485 players indexed, having over 80 entries implies that over 17% of people indexed had the game. It is easy to verify that the games returned were very popular by glancing at the results.
  685. ```python
  686. df1 = games_df[games_df['gamename'].map(games_df['gamename'].value_counts()) > 80]
  687. df1['time'] = df1['time']/60
  688. df1
  689. ```
  690. <div>
  691. <style scoped>
  692. .dataframe tbody tr th:only-of-type {
  693. vertical-align: middle;
  694. }
  695. .dataframe tbody tr th {
  696. vertical-align: top;
  697. }
  698. .dataframe thead th {
  699. text-align: right;
  700. }
  701. </style>
  702. <table border="1" class="dataframe">
  703. <thead>
  704. <tr style="text-align: right;">
  705. <th></th>
  706. <th>gameid</th>
  707. <th>time</th>
  708. <th>gamename</th>
  709. </tr>
  710. </thead>
  711. <tbody>
  712. <tr>
  713. <th>167</th>
  714. <td>304930</td>
  715. <td>2.333333</td>
  716. <td>unturned</td>
  717. </tr>
  718. <tr>
  719. <th>168</th>
  720. <td>304930</td>
  721. <td>12.050000</td>
  722. <td>unturned</td>
  723. </tr>
  724. <tr>
  725. <th>169</th>
  726. <td>304930</td>
  727. <td>16.700000</td>
  728. <td>unturned</td>
  729. </tr>
  730. <tr>
  731. <th>170</th>
  732. <td>304930</td>
  733. <td>16.700000</td>
  734. <td>unturned</td>
  735. </tr>
  736. <tr>
  737. <th>171</th>
  738. <td>304930</td>
  739. <td>0.000000</td>
  740. <td>unturned</td>
  741. </tr>
  742. <tr>
  743. <th>...</th>
  744. <td>...</td>
  745. <td>...</td>
  746. <td>...</td>
  747. </tr>
  748. <tr>
  749. <th>22682</th>
  750. <td>578080</td>
  751. <td>51.883333</td>
  752. <td>playerunknown's battlegrounds</td>
  753. </tr>
  754. <tr>
  755. <th>22683</th>
  756. <td>578080</td>
  757. <td>47.616667</td>
  758. <td>playerunknown's battlegrounds</td>
  759. </tr>
  760. <tr>
  761. <th>22684</th>
  762. <td>578080</td>
  763. <td>30.650000</td>
  764. <td>playerunknown's battlegrounds</td>
  765. </tr>
  766. <tr>
  767. <th>22685</th>
  768. <td>578080</td>
  769. <td>170.083333</td>
  770. <td>playerunknown's battlegrounds</td>
  771. </tr>
  772. <tr>
  773. <th>22686</th>
  774. <td>578080</td>
  775. <td>399.950000</td>
  776. <td>playerunknown's battlegrounds</td>
  777. </tr>
  778. </tbody>
  779. </table>
  780. <p>1099 rows × 3 columns</p>
  781. </div>
  782. ```python
  783. ax = df1.boxplot(column=["time"], by='gamename', notch=True, vert=False)
  784. fig = ax.get_figure()
  785. fig.suptitle('')
  786. ax.set_title('Play-time Distribution')
  787. plt.xlabel("Hours Played")
  788. ax.set_xlim([0, 2000])
  789. plt.ylabel("Game")
  790. plt.savefig("playTimes.png", dpi=300, bbox_inches = "tight")
  791. ```
  792. ![png](media/steamGames/output_38_0.png)
  793. Overall it is fascinating to see how the distributions for different games vary. In the future, I will re-run some of these analytics with even more data and possibly put them on my website as an interactive graph.