Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net

763 lines
27 KiB

  1. I worked on this project during Dr. Homans's RIT CSCI-331 class.
  2. # Introduction
  3. This project explores the beautiful and frustrating ways in which we
  4. can use AI to develop systems to solve problems. Asteroids is a
  5. perfect example of a fun learning AI problem because Asteroids is
  6. difficult for humans to play and has open-source frameworks that can
  7. emulate the environment. Using the Open AI gym framework we developed
  8. different AI agents to play Asteroids using various heuristics and ML
  9. techniques. We then created a testbed to run experiments that
  10. determine statistically whether our custom agents out-performs the
  11. random agent.
  12. # Methods and Results
  13. Three agents were developed to play Asteroids. This report is broken
  14. into segments where each agent is explained and its performance is
  15. analyzed.
  16. # Random Agent
  17. The random agent simple takes a random action defined by the action
  18. space. The resulting agent will randomly spin around and shoot
  19. asteroids. Although this random agent is easy to implement, it is
  20. ineffective because moving spastically will cause you to crash into
  21. asteroids. Using this as the baseline for our performance, we can use
  22. this random agent to access whether our agents are better than random
  23. key smashing -- which is my strategy for playing Smash.
  24. ```python
  25. """
  26. ACTION_MEANING = {
  27. 0: "NOOP",
  28. 1: "FIRE",
  29. 2: "UP",
  30. 3: "RIGHT",
  31. 4: "LEFT",
  32. 5: "DOWN",
  33. 6: "UPRIGHT",
  34. 7: "UPLEFT",
  35. 8: "DOWNRIGHT",
  36. 9: "DOWNLEFT",
  37. 10: "UPFIRE",
  38. 11: "RIGHTFIRE",
  39. 12: "LEFTFIRE",
  40. 13: "DOWNFIRE",
  41. 14: "UPRIGHTFIRE",
  42. 15: "UPLEFTFIRE",
  43. 16: "DOWNRIGHTFIRE",
  44. 17: "DOWNLEFTFIRE",
  45. }
  46. """
  47. def act(self, observation, reward, done):
  48. return self.action_space.sample()
  49. ```
  50. ## Test on the Environment Seed
  51. It is always important to know how randomness affects the results of
  52. your experiment. In this agent, there are two sources of
  53. randomness, the first being the seed given for the Gym environment and
  54. the other is in the random function used to select a random action. By
  55. default, the seed of the Gym library is set to zero. This is useful
  56. for testing because if your agent is deterministic, you will always
  57. get the same results. We can seed the environment with the current
  58. time to add more randomness. However, this begs the question: to what
  59. extent does the added randomness change the scores of the game.
  60. Certain seeds in the Gym environment may make the game much
  61. easier/harder to play thus altering the distribution of the score.
  62. A test was derived to compare the scores of the environment in both a
  63. fixed seed and a time set seed. 300 trials of the random agent were ran
  64. in both types of seeded environments.
  65. ![Seed Effect](media/asteroids/randomSeed.png)
  66. ```
  67. Random Agent Time Seed:
  68. mean:1005.6333333333333
  69. max:3220.0
  70. min:110.0
  71. sd:478.32548077178114
  72. median:980.0 n:300
  73. Random Agent Fixed Seed:
  74. mean:1049.3666666666666
  75. max:3320.0
  76. min:110.0
  77. sd:485.90321281323327
  78. median:1080.0
  79. n:300
  80. ```
  81. What is astonishing is that both distributions are nearly identical in
  82. every way. Although the means are slightly different, there appears to
  83. be no apparent difference between the distributions of scores. One
  84. might expect that having more randomness would at least change the
  85. variance of the scores, but none of that has happened.
  86. ```
  87. Random agent vs Random fixed seed
  88. F_onewayResult(
  89. statistic=1.2300971733588375,
  90. pvalue=0.2678339696597312
  91. )
  92. ```
  93. With such a high p-value we can not reject the null hypothesis that
  94. these distributions are statistically different. This is a powerful
  95. conclusion to come to because it allows us to run future experiments
  96. understanding that a specific seed on average will not have a
  97. statistically significant impact on the performance of a random agent.
  98. However, this finding does not help us understand the impact that the
  99. seed has on a fully deterministic agent. It is still possible that a
  100. fully deterministic system will have varying scores on different
  101. environment seeds.
  102. # Reflex Agent
  103. Our reflex agent observes the environment and decides what to do based
  104. on a simple rule set. The reflex agent is broken into three sections:
  105. feature extraction, reflex rules, and performance.
  106. ## Feature Extraction
  107. The largest part of this agent was devoted to parsing the environment
  108. into a more usable form. The feature extraction for this project was
  109. rather difficult since the environment was given as a pixel array and
  110. the screen flashed the asteroids and then the player. Trying to
  111. achieve the best performance with the minimal amount of algorithmic
  112. engineering, this reflex agent parsed 3 things from the environment:
  113. position, direction, closest asteroid.
  114. ### 1: Player Position
  115. Finding the position of the player was relatively easy since you only
  116. had to scan the environment to find pixels of certain RGB values. To
  117. account for the flashing environment, you would just store the
  118. position in the fields of the class so that it is persistent between
  119. action loops. The position of the player would only be updated if a
  120. new player is observed.
  121. ```python
  122. AGENT_RGB = [240, 128, 128]
  123. ```
  124. ### 2: Player Direction
  125. Detecting the position of the player could be made difficult if you
  126. were only going off the RGB values of the player. Although when the
  127. player is upright, it is straight forward, when the player is sideways
  128. things get super difficult.
  129. ```python
  130. action_sequence = [3,3,3,3,3,0, 0,0]
  131. class Agent(object):
  132. def __init__(self, action_space):
  133. self.action_space = action_space
  134. # Defines how the agent should act
  135. def act(self, observation, reward, done):
  136. if len(action_sequence) > 0:
  137. action = action_sequence[0]
  138. action_sequence.remove(action)
  139. return action
  140. return 0
  141. ```
  142. ![Starting Position](media/asteroids/starting.png)
  143. ![4 Turns Right](media/asteroids/4right.png)
  144. ![5 Turns Right](media/asteroids/5right.png)
  145. We created a basic script to observe what the player does when given a
  146. specific sequence of actions. I was pleased to notice that exactly 5
  147. turns to the left/right correlated to a perfect 90 degrees. By keeping
  148. track of our current rotation according to the actions that we have
  149. taken, we can precisely keep track of our current rotational direction
  150. without parsing the horrendous pixel array when the player is
  151. sideways.
  152. ### 3: Position of Closest Asteroid
  153. Asteroids were detected as being any pixel that was not empty (0,0,0)
  154. and not the player (240, 128, 128). Using a simple single pass through
  155. the environment matrix, we were able to detect the closest asteroid to
  156. the latest known position of the player.
  157. ## Agent Reflex
  158. Based on my actual strategy for asteroids, this agent stays in the
  159. middle of the screen and shoots at the closest asteroid to it.
  160. ```python
  161. def act(self, observation, reward, done):
  162. observation = np.array(observation)
  163. self.updateState(observation)
  164. dirOfAstroid = math.atan2(self.closestRow-self.row, self.closestCol- self.col)
  165. dirOfAstroid = self.deWarpAngle(dirOfAstroid)
  166. self.shotLast = not self.shotLast
  167. if self.shotLast:
  168. return 1 # fire
  169. if self.currentDirection - dirOfAstroid < 0:
  170. self.updateDirection(math.pi/10)
  171. if self.shotLast:
  172. return 12 # left fire
  173. return 4 # left
  174. else:
  175. self.updateDirection(-1*math.pi/10)
  176. return 3 # right
  177. ```
  178. Despite being a simple agent, this performs well since it can shoot at asteroids before it hits them.
  179. ## Results of Reflex Agent
  180. In this trial, 200 tests of both the random agent and the reflex agent
  181. were observed while setting the seed of the environment to the current
  182. time. The seed was randomly set in this scenario since the reflex
  183. agent is fully deterministic and would perform identically in each trial
  184. otherwise.
  185. ![histogram](media/asteroids/reflexPerformance.png)
  186. The histogram depicts that the reflex agent on average performs
  187. significantly better than the random agent. What is fascinating to
  188. note is that even though the agent's actions are deterministic, the
  189. seed of the environment created a large amount of variance in the
  190. scores observed. It is arguably misleading to only provide a single
  191. score for an agent as its performance because the environment seed has
  192. a large impact on the non-random agent's scores.
  193. ```
  194. Reflex Agent:
  195. mean:2385.25
  196. max:8110.0
  197. min:530.0
  198. sd:1066.217115553863
  199. median:2250.0
  200. n:200
  201. Random Agent:
  202. mean:976.15
  203. max:2030.0
  204. min:110.0
  205. sd:425.2712987023695
  206. median:980.0
  207. n:200
  208. ```
  209. One thing that is interesting about comparing the two distributions is
  210. that the reflex agent has a much larger standard deviation in its
  211. scores than the random agent. It is also interesting to note that the
  212. reflex agent's worst performance was significantly better than the
  213. random agent's worst performance. Also, the best performance of the
  214. reflex agent shatters the best performance of the random agent.
  215. ```
  216. Random agent vs reflex
  217. F_onewayResult(
  218. statistic=299.86689786081956,
  219. pvalue=1.777062051091977e-50
  220. )
  221. ```
  222. Since we took such a sample size of two hundred, and the populations
  223. were significantly different, we got a p score of nearly zero
  224. (1.77e-50). With a p-value like this, we can say with nearly 100%
  225. confidence (with rounding) that these two populations are different
  226. and that the reflex agent out-performs the random agent.
  227. # Genetic Algorithm
  228. Genetic algorithms employ the same tactics used in natural selection
  229. to find an optimal solution to an optimization problem. Genetic
  230. algorithms are often used in high dimensional problems where the
  231. optimal solutions are not apparent. Genetic algorithms are commonly
  232. used to tune the hyper-parameters of a program. However, this
  233. algorithm can be used in any scenario where you have a function that
  234. defines how well a solution is.
  235. In the scenario of asteroids, we can employ genetic algorithms to find
  236. the optimal sequence of moves to make to achieve the highest score
  237. possible. The chromosomes are well defined as the sequence of actions
  238. to loop through and the fitness function is simply the score that the
  239. agent achieves.
  240. ## Algorithm Implementation
  241. The actual implementation of the genetic algorithm was pretty straight
  242. forward, the agent simply looped through a sequence of events where
  243. each event represents a gene on the chromosome.
  244. ```python
  245. class Agent(object):
  246. """Very Basic GA Agent"""
  247. def __init__(self, action_space, chromosome):
  248. self.action_space = action_space
  249. self.chromosome = chromosome
  250. self.index = 0
  251. # You should modify this function
  252. def act(self, observation, reward, done):
  253. if self.index >= len(self.chromosome)-1:
  254. self.index = 0
  255. else:
  256. self.index = self.index + 1
  257. return self.chromosome[self.index]
  258. ```
  259. Rather than using a library, a simple home-brewed genetic algorithm
  260. was created from scratch. The basic algorithm essentially is in a loop
  261. that runs functions necessary to iterate through each generation.
  262. Each generation can be broken apart into a few steps:
  263. - selection: removes the worst-performing chromosomes
  264. - mating: uses crossover to create new chromosomes
  265. - mutation: adds randomness to the chromosome
  266. - fitness: evaluates the performance of each chromosome
  267. In roughly 100 lines of python, a basic genetic algorithm was crafted.
  268. ```python
  269. AVAILABLE_COMMANDS = [0,1,2,3,4]
  270. def generateRandomChromosome(chromosomeLength):
  271. chrom = []
  272. for i in range(0, chromosomeLength):
  273. chrom.append(choice(AVAILABLE_COMMANDS))
  274. return chrom
  275. """
  276. creates a random population
  277. """
  278. def createPopulation(populationSize, chromosomeLength):
  279. pop = []
  280. for i in range(0, populationSize):
  281. pop.append((0,generateRandomChromosome(chromosomeLength)))
  282. return pop
  283. """
  284. computes fitness of population and sorts the array based
  285. on fitness
  286. """
  287. def computeFitness(population):
  288. for i in range(0, len(population)):
  289. population[i] = (calculatePerformance(population[i][1]), population[i][1])
  290. population.sort(key=lambda tup: tup[0], reverse=True) # sorts population in place
  291. """
  292. kills the weakest portion of the population
  293. """
  294. def selection(population, keep):
  295. origSize = len(population)
  296. for i in range(keep, origSize):
  297. population.remove(population[keep])
  298. """
  299. Uses crossover to mate two chromosomes together.
  300. """
  301. def mateBois(chrom1, chrom2):
  302. pivotPoint = randrange(len(chrom1))
  303. bb = []
  304. for i in range(0, pivotPoint):
  305. bb.append(chrom1[i])
  306. for i in range(pivotPoint, len(chrom2)):
  307. bb.append(chrom1[i])
  308. return (0, bb)
  309. """
  310. brings population back up to desired size of population
  311. using crossover mating
  312. """
  313. def mating(population, populationSize):
  314. newBlood = populationSize - len(population)
  315. newbies = []
  316. for i in range(0, newBlood):
  317. newbies.append(mateBois(choice(population)[1],
  318. choice(population)[1]))
  319. population.extend(newbies)
  320. """
  321. Randomly mutates x chromosomes -- excluding best chromosome
  322. """
  323. def mutation(population, mutationRate):
  324. changes = random() * mutationRate * len(population) * len(population[0][1])
  325. for i in range(0, int(changes)):
  326. ind = randrange(len(population) -1) + 1
  327. chrom = randrange(len(population[0][1]))
  328. population[ind][1][chrom] = choice(AVAILABLE_COMMANDS)
  329. """
  330. Computes average score of population
  331. """
  332. def computeAverageScore(population):
  333. total = 0.0
  334. for c in population:
  335. total = total + c[0]
  336. return total/len(population)
  337. def runGeneration(population, populationSize, keep, mutationRate):
  338. selection(population, keep)
  339. mating(population, populationSize)
  340. mutation(population, mutationRate)
  341. computeFitness(population)
  342. """
  343. Runs the genetic algorithm
  344. """
  345. def runGeneticAlgorithm(populationSize, maxGenerations,
  346. chromosomeLength, keep, mutationRate):
  347. population = createPopulation(populationSize, chromosomeLength)
  348. best = []
  349. average = []
  350. generations = range(1, maxGenerations + 1)
  351. for i in range(1, maxGenerations + 1):
  352. print("Generation: " + str(i))
  353. runGeneration(population, populationSize, keep, mutationRate)
  354. a = computeAverageScore(population)
  355. average.append(a)
  356. best.append(population[0][0])
  357. print("Best Score: " + str(population[0][0]))
  358. print("Average Score: " + str(a))
  359. print("Best chromosome: " + str(population[0][1]))
  360. print()
  361. pyplot.plot(generations, best, color='g', label='Best')
  362. pyplot.plot(generations, average, color='orange', label='Average')
  363. pyplot.xlabel("Generations")
  364. pyplot.ylabel("Score")
  365. pyplot.title("Training GA Algorithm")
  366. pyplot.legend()
  367. pyplot.show()
  368. ```
  369. ## Results
  370. ![training](media/asteroids/GA50.png)
  371. ![training](media/asteroids/GA200.png)
  372. ```
  373. Generation: 200
  374. Best Score: 8090.0
  375. Average Score: 2492.6666666666665
  376. Best chromosome: [1, 4, 1, 4, 4, 1, 0, 4, 2, 4, 1, 3, 2, 0, 2, 0, 0, 1, 3, 0, 1, 0, 4, 0, 1, 4, 1, 2, 0, 1, 3, 1, 3, 1, 3, 1, 0, 4, 4, 1, 3, 4, 1, 1, 2, 0, 4, 3, 3, 0]
  377. ```
  378. It is impressive that a simple genetic algorithm can learn how
  379. to perform well when the seed is fixed. When compared to the
  380. random agent which had a max score of 3320 with a fixed seed, the
  381. optimized genetic algorithm shattered the random agents' best
  382. performance by a factor of 2.5.
  383. Since we trained an optimized set of actions to take to achieve a high
  384. score on a specific seed, what would happen if we randomized the seed?
  385. A test was conducted to compare the trained GA agent with 200
  386. generations against the random agent. For both agents, the seed was
  387. randomized by setting it to the current time.
  388. ![200 Trials GA Random Seed](media/asteroids/GAvsRandom.png)
  389. ```
  390. GA Performance Trained on Fixed Seed:
  391. mean:2257.9
  392. max:5600.0
  393. min:530.0
  394. sd:1018.4363455808125
  395. median:2020.0
  396. n:200
  397. ```
  398. ```
  399. Random Random Seed:
  400. mean:1079.45
  401. max:2800.0
  402. min:110.0
  403. sd:498.9340612746338
  404. median:1080.0
  405. n:200
  406. ```
  407. ```
  408. F_onewayResult(
  409. statistic=214.87432376234608,
  410. pvalue=3.289638100969386e-39
  411. )
  412. ```
  413. As expected, the GA agent did not perform as well on random seeds as
  414. it did on the fixed seed that it was trained on. However, the GA was
  415. able to find an action sequence that statistically beat the random
  416. agent as observed in the score distributions above and the extremely
  417. small p-value. Although luck was a part of getting the agent to get a
  418. score of 8k on the seed of zero, the skill that it learned was
  419. somewhat applicable to other seeds. After replaying the video of the
  420. agent play, it just slowly drifts around the screen and shoots at
  421. asteroids in front of it. This has a major advantage over the random
  422. agent since the random agent tends to move very fast and rotate
  423. spastically.
  424. ## Future Work
  425. This algorithm was more or less a last-minute hack to see if I can
  426. make a cool video of a high scoring asteroids agent. Future agents
  427. using genetic algorithms would incorporate reflex to dynamically
  428. respond to the environment. Based on which direction asteroids are in
  429. proximity to the player, the agent could select a different chromosome
  430. of actions to execute. This would potentially yield scores above ten thousand if trained and implemented correctly. Future training
  431. should also incorporate randomness to the seed so that the skills
  432. learned are the most transferable to other random environments.
  433. # Deep Q-Learning Agent
  434. ## Introduction:
  435. The inspiration behind attempting a reinforcement learning agent for
  436. this problem scope is the original DQN paper that came out from
  437. Deepmind, “Playing Atari with Deep Reinforcement Learning.” This paper
  438. showed the potential of utilizing this Deep Q-learning methodology on
  439. a variety of simulated Atari games using one standardized architecture
  440. across all. Reinforcement learning has always been of interest and to
  441. have the opportunity to spend time learning about it while applying
  442. for a class setting was exciting, even if it is out of the scope of
  443. the class presently. It has been an exciting challenge to read through
  444. and implement a research paper to get similar results.
  445. Deep Q-Learning is an extension of the standard Q-Learning algorithm
  446. in which a neural network is used to approximate the optimal
  447. action-value function, Q\*(s,a). The action-value function is the
  448. function that outputs the expected maximum reward given a state and a
  449. policy mapping to actions or distributions of actions. Logically, this
  450. works as the Q function follows the Bellman equation identity, which
  451. states that if the optimal action for the next step state is known,
  452. then the optimal output given an action a’ follows by maximizing the
  453. expected reward of the equation, r+Q*(s',a'). Thus, the reinforcement
  454. learning part comes in the form of a neural network approximating the
  455. optimal action-value function by using the Bellman equation identity
  456. as an iterative update at every time step.
  457. ## Agent architecture:
  458. The basis of the network architecture is a basic convolutional network
  459. with 2 conv layers, a fully connected layer, and then an output layer
  460. of 14 classes with each representing an individual action. The first
  461. layer consists of 16 8x8 filters and takes a stride 4 while the second
  462. has 32 4x4 filters and only takes a stride of 2. Following this layer,
  463. the filters are compressed into a 1-D representation vector of size
  464. 12,672 that’s passed through a fully connected layer of 256 nodes.
  465. All layers sans the output layer are activated using the ReLU function.
  466. The optimization algorithm of choice was the Adam optimizer, using a
  467. learning rate equal to .0001 and default betas of [.9, .99]. The
  468. discount factor, or gamma, related to future expected rewards was set
  469. at .99 and the probability of taking a random action per action step
  470. was linearly annealed from 1.0 down to a fixed .1 after one million
  471. seen frames.
  472. ![Layer code](media/asteroids/code.png)
  473. ## Experience Replay:
  474. One of the main points within the original paper that significantly
  475. helped the training of this network is the introduction of a Replay
  476. Buffer that is used during the training. To break all the
  477. temporal correlation between sequential frames and biasing the
  478. training of a network-based off certain chains of situations, a
  479. historical buffer of transitions is used to sample mini-batches to
  480. train on per time step. Every time an action is made, a tuple
  481. consisting of the current state, the action is taken, the reward gained, and
  482. the subsequent state (s, a, r, s’) is stored into the buffer. And at
  483. every training step, a mini-batch is sampled from the buffer and used
  484. to train the network. This allows the network to be trained in
  485. non-correlated transitions and hopefully train in a more generalized
  486. way to the environment rather than biased to a string of similar
  487. actions.
  488. ## Preprocessing:
  489. One of the first issues that had to be tackled was the issue of the
  490. high dimensionality of the input image and how that information was
  491. duplicated stored in the Replay Buffer. Each observation given from
  492. the environment is a matrix of (210, 160, 3) pixels representing the
  493. RGB pixels within the frame. For time and being
  494. computationally efficient, it was needed to preprocess and reduce the
  495. dimensionality of the observations as a single frame stack (of which
  496. there are two per transition) consists of (4, 3, 210, 160) or 403,000
  497. input features that would have to be dealt with.
  498. ![Before Processing](media/asteroids/dqn_before.png)
  499. Firstly, images are converted into a grayscale image and the
  500. reward/number lives section at the top of the screen is cut out since
  501. it is irrelevant to the network’s vision. Furthermore, the now (4,
  502. 192, 160) matrix was downsampled by taking every other pixel to (4,
  503. 96, 80), resulting in a change from 403,000 input features to only
  504. 30,720 - a substantial reduction in the calculations needed while
  505. maintaining strong input information for the network.
  506. ![After Processing](media/asteroids/dqn_after.png)
  507. ## Training:
  508. Training for the bot was conducted by modifying the main function to
  509. allow games to immediately start after one was finished, to
  510. make continuous training of the agent easier. All the environment
  511. parameters were reset and the temporary attributes of the agent (ie.
  512. current state/next state) were flushed. For the first four frames of a
  513. game, the bot just gathered a stack of frames. And following that, at
  514. every time the next state was compiled and the transition tuple pushed
  515. onto the buffer, as well as a training step for the agent. For the
  516. training step, a random batch was grabbed from the replay buffer and
  517. used to calculate the loss function between actual and expected
  518. Q-values. This was used to calculate the gradients for the
  519. backpropagation of the network.
  520. ## Outcome:
  521. Unfortunately, the result of 48 hours of continuous training, 950
  522. games played, and roughly 1.3 million frames of game footage seen, was
  523. that the agent converged to a suicidal policy that resulted in a
  524. consistent garbage performance.
  525. ![Reward](media/asteroids/reward.png)
  526. The model transitioned to the fixed 90% model action chance around
  527. episode 700, which is exactly where the agent starts to go awry. The
  528. strange part about this is since the random action chance is linearly
  529. annealed over the first million frames, if the agent had continuously
  530. been following a garbage policy, it would’ve been expected that the
  531. rewards would steadily decrease over time as the network takes more
  532. control.
  533. ![Real Trendline](media/asteroids/reward_trendline.png)
  534. Up until that point, the projection of the reward trendline was a
  535. steady rise per the number of episodes. Expanding this out until
  536. 10,000 frames (approximately 10 million frames seen, the same amount
  537. of time the original Deepmind paper trained these bots for), the
  538. projected score is in the realms of 2,400 to 2,500 - which matches up
  539. closely to the well-tuned reflex agent and the GA agent on a random
  540. seed.
  541. ![Endgame Trendline](media/asteroids/reward_projection.png)
  542. It would’ve been exciting to see how the model compared to
  543. our reflex agent had it been able to train consistently up until the
  544. end.
  545. ## Limitations:
  546. There were a fair number of limitations that were present within the
  547. execution and training of this model that possibly contributed to the
  548. slow and unstable training of the network. Differences in the
  549. algorithm from the original paper is that the optimization function
  550. utilized was the Adam optimizer instead of RMSProp and the replay
  551. buffer only took into consideration the previous 50k frames, not the
  552. past one million. It might be possible that the weaker replay buffer
  553. was to blame as the model was continuously fed a sub-optimal within
  554. its past 50,000 frames that caused it to diverge so heavily near the
  555. end.
  556. One issue in preprocessing that might've led the bot astray is using
  557. not using the max pixelwise combination between sequential frames in
  558. order to have each frame include both the asteroids and the player.
  559. Since the Atari (and by extension, this environment simulation)
  560. doesn't render the asteroids and the player sprite all in the same
  561. frame, it is possible that the network was unable to extract any
  562. coherent connection between the alternating frames.
  563. Regarding optimizations built on the DQN algorithm past the original
  564. Deepmind paper, we did not use a policy and a target network in
  565. training. In the original algorithm, the estimation and attempt at
  566. converging to the target policy is unstable due to the target
  567. network’s weights continuously shifting during training. For the
  568. network, it’s hard to converge to something that’s continually
  569. shifting at every time step and leads to very noisy and unstable
  570. training. One optimization that has been proposed for DQN is to have a
  571. policy and target network. At every timestep, the policy network’s
  572. weights are updated with the calculated gradients while the target
  573. network is maintained for a number of steps. This lets the target
  574. policy be still for a few time steps while the network is converging
  575. to it and leads to more stable and guided training.
  576. Perhaps the largest limitation in training was the computational power
  577. used for training. The network was trained on a single GTX1060ti GPU,
  578. which led to just single episodes taking a few minutes to complete. It
  579. would’ve taken an incredibly long time to hit 10 million seen frames
  580. as even just 1.3 million took approximately 48 hours. It’s probable
  581. that our implementation is inefficient in its calculations, however it
  582. is a well known limitation of RL that it is time and computationally
  583. intensive.
  584. ## Deep Q Conclusions:
  585. This was a fun agent and algorithm to implement, even if at present it
  586. has given little to no results back in terms of performance. The plan
  587. is to continue testing and training the agent, even after the
  588. deadline. Reinforcement learning is a complicated and hard to debug
  589. environment, but similarly an exciting challenge due to its potential
  590. for solving and overcoming problems.
  591. # Conclusion
  592. This project demonstrated how fun it can be to train AI agents to play
  593. video games. Although none of our agents are earth-shatteringly
  594. amazing, we were able to use statistical measures to determine that
  595. the reflex and GA agents outperformed the random agent. The GA agent
  596. and the convolutional neural network show very promising and future
  597. work can be used to drastically improve their results.