| @ -0,0 +1,763 @@ | |||||
| I worked on this project during Dr. Homans's RIT CSCI-331 class. | |||||
| # Introduction | |||||
| This project explores the beautiful and frustrating ways in which we | |||||
| can use AI to develop systems to solve problems. Asteroids is a | |||||
| perfect example of a fun learning AI problem because Asteroids is | |||||
| difficult for humans to play and has open-source frameworks that can | |||||
| emulate the environment. Using the Open AI gym framework we developed | |||||
| different AI agents to play Asteroids using various heuristics and ML | |||||
| techniques. We then created a testbed to run experiments that | |||||
| determine statistically whether our custom agents out-performs the | |||||
| random agent. | |||||
| # Methods and Results | |||||
| Three agents were developed to play Asteroids. This report is broken | |||||
| into segments where each agent is explained and its performance is | |||||
| analyzed. | |||||
| # Random Agent | |||||
| The random agent simple takes a random action defined by the action | |||||
| space. The resulting agent will randomly spin around and shoot | |||||
| asteroids. Although this random agent is easy to implement, it is | |||||
| ineffective because moving spastically will cause you to crash into | |||||
| asteroids. Using this as the baseline for our performance, we can use | |||||
| this random agent to access whether our agents are better than random | |||||
| key smashing -- which is my strategy for playing Smash. | |||||
| ```python | |||||
| """ | |||||
| ACTION_MEANING = { | |||||
| 0: "NOOP", | |||||
| 1: "FIRE", | |||||
| 2: "UP", | |||||
| 3: "RIGHT", | |||||
| 4: "LEFT", | |||||
| 5: "DOWN", | |||||
| 6: "UPRIGHT", | |||||
| 7: "UPLEFT", | |||||
| 8: "DOWNRIGHT", | |||||
| 9: "DOWNLEFT", | |||||
| 10: "UPFIRE", | |||||
| 11: "RIGHTFIRE", | |||||
| 12: "LEFTFIRE", | |||||
| 13: "DOWNFIRE", | |||||
| 14: "UPRIGHTFIRE", | |||||
| 15: "UPLEFTFIRE", | |||||
| 16: "DOWNRIGHTFIRE", | |||||
| 17: "DOWNLEFTFIRE", | |||||
| } | |||||
| """ | |||||
| def act(self, observation, reward, done): | |||||
| return self.action_space.sample() | |||||
| ``` | |||||
| ## Test on the Environment Seed | |||||
| It is always important to know how randomness affects the results of | |||||
| your experiment. In this agent, there are two sources of | |||||
| randomness, the first being the seed given for the Gym environment and | |||||
| the other is in the random function used to select a random action. By | |||||
| default, the seed of the Gym library is set to zero. This is useful | |||||
| for testing because if your agent is deterministic, you will always | |||||
| get the same results. We can seed the environment with the current | |||||
| time to add more randomness. However, this begs the question: to what | |||||
| extent does the added randomness change the scores of the game. | |||||
| Certain seeds in the Gym environment may make the game much | |||||
| easier/harder to play thus altering the distribution of the score. | |||||
| A test was derived to compare the scores of the environment in both a | |||||
| fixed seed and a time set seed. 300 trials of the random agent were ran | |||||
| in both types of seeded environments. | |||||
|  | |||||
| ``` | |||||
| Random Agent Time Seed: | |||||
| mean:1005.6333333333333 | |||||
| max:3220.0 | |||||
| min:110.0 | |||||
| sd:478.32548077178114 | |||||
| median:980.0 n:300 | |||||
| Random Agent Fixed Seed: | |||||
| mean:1049.3666666666666 | |||||
| max:3320.0 | |||||
| min:110.0 | |||||
| sd:485.90321281323327 | |||||
| median:1080.0 | |||||
| n:300 | |||||
| ``` | |||||
| What is astonishing is that both distributions are nearly identical in | |||||
| every way. Although the means are slightly different, there appears to | |||||
| be no apparent difference between the distributions of scores. One | |||||
| might expect that having more randomness would at least change the | |||||
| variance of the scores, but none of that has happened. | |||||
| ``` | |||||
| Random agent vs Random fixed seed | |||||
| F_onewayResult( | |||||
| statistic=1.2300971733588375, | |||||
| pvalue=0.2678339696597312 | |||||
| ) | |||||
| ``` | |||||
| With such a high p-value we can not reject the null hypothesis that | |||||
| these distributions are statistically different. This is a powerful | |||||
| conclusion to come to because it allows us to run future experiments | |||||
| understanding that a specific seed on average will not have a | |||||
| statistically significant impact on the performance of a random agent. | |||||
| However, this finding does not help us understand the impact that the | |||||
| seed has on a fully deterministic agent. It is still possible that a | |||||
| fully deterministic system will have varying scores on different | |||||
| environment seeds. | |||||
| # Reflex Agent | |||||
| Our reflex agent observes the environment and decides what to do based | |||||
| on a simple rule set. The reflex agent is broken into three sections: | |||||
| feature extraction, reflex rules, and performance. | |||||
| ## Feature Extraction | |||||
| The largest part of this agent was devoted to parsing the environment | |||||
| into a more usable form. The feature extraction for this project was | |||||
| rather difficult since the environment was given as a pixel array and | |||||
| the screen flashed the asteroids and then the player. Trying to | |||||
| achieve the best performance with the minimal amount of algorithmic | |||||
| engineering, this reflex agent parsed 3 things from the environment: | |||||
| position, direction, closest asteroid. | |||||
| ### 1: Player Position | |||||
| Finding the position of the player was relatively easy since you only | |||||
| had to scan the environment to find pixels of certain RGB values. To | |||||
| account for the flashing environment, you would just store the | |||||
| position in the fields of the class so that it is persistent between | |||||
| action loops. The position of the player would only be updated if a | |||||
| new player is observed. | |||||
| ```python | |||||
| AGENT_RGB = [240, 128, 128] | |||||
| ``` | |||||
| ### 2: Player Direction | |||||
| Detecting the position of the player could be made difficult if you | |||||
| were only going off the RGB values of the player. Although when the | |||||
| player is upright, it is straight forward, when the player is sideways | |||||
| things get super difficult. | |||||
| ```python | |||||
| action_sequence = [3,3,3,3,3,0, 0,0] | |||||
| class Agent(object): | |||||
| def __init__(self, action_space): | |||||
| self.action_space = action_space | |||||
| # Defines how the agent should act | |||||
| def act(self, observation, reward, done): | |||||
| if len(action_sequence) > 0: | |||||
| action = action_sequence[0] | |||||
| action_sequence.remove(action) | |||||
| return action | |||||
| return 0 | |||||
| ``` | |||||
|  | |||||
|  | |||||
|  | |||||
| We created a basic script to observe what the player does when given a | |||||
| specific sequence of actions. I was pleased to notice that exactly 5 | |||||
| turns to the left/right correlated to a perfect 90 degrees. By keeping | |||||
| track of our current rotation according to the actions that we have | |||||
| taken, we can precisely keep track of our current rotational direction | |||||
| without parsing the horrendous pixel array when the player is | |||||
| sideways. | |||||
| ### 3: Position of Closest Asteroid | |||||
| Asteroids were detected as being any pixel that was not empty (0,0,0) | |||||
| and not the player (240, 128, 128). Using a simple single pass through | |||||
| the environment matrix, we were able to detect the closest asteroid to | |||||
| the latest known position of the player. | |||||
| ## Agent Reflex | |||||
| Based on my actual strategy for asteroids, this agent stays in the | |||||
| middle of the screen and shoots at the closest asteroid to it. | |||||
| ```python | |||||
| def act(self, observation, reward, done): | |||||
| observation = np.array(observation) | |||||
| self.updateState(observation) | |||||
| dirOfAstroid = math.atan2(self.closestRow-self.row, self.closestCol- self.col) | |||||
| dirOfAstroid = self.deWarpAngle(dirOfAstroid) | |||||
| self.shotLast = not self.shotLast | |||||
| if self.shotLast: | |||||
| return 1 # fire | |||||
| if self.currentDirection - dirOfAstroid < 0: | |||||
| self.updateDirection(math.pi/10) | |||||
| if self.shotLast: | |||||
| return 12 # left fire | |||||
| return 4 # left | |||||
| else: | |||||
| self.updateDirection(-1*math.pi/10) | |||||
| return 3 # right | |||||
| ``` | |||||
| Despite being a simple agent, this performs well since it can shoot at asteroids before it hits them. | |||||
| ## Results of Reflex Agent | |||||
| In this trial, 200 tests of both the random agent and the reflex agent | |||||
| were observed while setting the seed of the environment to the current | |||||
| time. The seed was randomly set in this scenario since the reflex | |||||
| agent is fully deterministic and would perform identically in each trial | |||||
| otherwise. | |||||
|  | |||||
| The histogram depicts that the reflex agent on average performs | |||||
| significantly better than the random agent. What is fascinating to | |||||
| note is that even though the agent's actions are deterministic, the | |||||
| seed of the environment created a large amount of variance in the | |||||
| scores observed. It is arguably misleading to only provide a single | |||||
| score for an agent as its performance because the environment seed has | |||||
| a large impact on the non-random agent's scores. | |||||
| ``` | |||||
| Reflex Agent: | |||||
| mean:2385.25 | |||||
| max:8110.0 | |||||
| min:530.0 | |||||
| sd:1066.217115553863 | |||||
| median:2250.0 | |||||
| n:200 | |||||
| Random Agent: | |||||
| mean:976.15 | |||||
| max:2030.0 | |||||
| min:110.0 | |||||
| sd:425.2712987023695 | |||||
| median:980.0 | |||||
| n:200 | |||||
| ``` | |||||
| One thing that is interesting about comparing the two distributions is | |||||
| that the reflex agent has a much larger standard deviation in its | |||||
| scores than the random agent. It is also interesting to note that the | |||||
| reflex agent's worst performance was significantly better than the | |||||
| random agent's worst performance. Also, the best performance of the | |||||
| reflex agent shatters the best performance of the random agent. | |||||
| ``` | |||||
| Random agent vs reflex | |||||
| F_onewayResult( | |||||
| statistic=299.86689786081956, | |||||
| pvalue=1.777062051091977e-50 | |||||
| ) | |||||
| ``` | |||||
| Since we took such a sample size of two hundred, and the populations | |||||
| were significantly different, we got a p score of nearly zero | |||||
| (1.77e-50). With a p-value like this, we can say with nearly 100% | |||||
| confidence (with rounding) that these two populations are different | |||||
| and that the reflex agent out-performs the random agent. | |||||
| # Genetic Algorithm | |||||
| Genetic algorithms employ the same tactics used in natural selection | |||||
| to find an optimal solution to an optimization problem. Genetic | |||||
| algorithms are often used in high dimensional problems where the | |||||
| optimal solutions are not apparent. Genetic algorithms are commonly | |||||
| used to tune the hyper-parameters of a program. However, this | |||||
| algorithm can be used in any scenario where you have a function that | |||||
| defines how well a solution is. | |||||
| In the scenario of asteroids, we can employ genetic algorithms to find | |||||
| the optimal sequence of moves to make to achieve the highest score | |||||
| possible. The chromosomes are well defined as the sequence of actions | |||||
| to loop through and the fitness function is simply the score that the | |||||
| agent achieves. | |||||
| ## Algorithm Implementation | |||||
| The actual implementation of the genetic algorithm was pretty straight | |||||
| forward, the agent simply looped through a sequence of events where | |||||
| each event represents a gene on the chromosome. | |||||
| ```python | |||||
| class Agent(object): | |||||
| """Very Basic GA Agent""" | |||||
| def __init__(self, action_space, chromosome): | |||||
| self.action_space = action_space | |||||
| self.chromosome = chromosome | |||||
| self.index = 0 | |||||
| # You should modify this function | |||||
| def act(self, observation, reward, done): | |||||
| if self.index >= len(self.chromosome)-1: | |||||
| self.index = 0 | |||||
| else: | |||||
| self.index = self.index + 1 | |||||
| return self.chromosome[self.index] | |||||
| ``` | |||||
| Rather than using a library, a simple home-brewed genetic algorithm | |||||
| was created from scratch. The basic algorithm essentially is in a loop | |||||
| that runs functions necessary to iterate through each generation. | |||||
| Each generation can be broken apart into a few steps: | |||||
| - selection: removes the worst-performing chromosomes | |||||
| - mating: uses crossover to create new chromosomes | |||||
| - mutation: adds randomness to the chromosome | |||||
| - fitness: evaluates the performance of each chromosome | |||||
| In roughly 100 lines of python, a basic genetic algorithm was crafted. | |||||
| ```python | |||||
| AVAILABLE_COMMANDS = [0,1,2,3,4] | |||||
| def generateRandomChromosome(chromosomeLength): | |||||
| chrom = [] | |||||
| for i in range(0, chromosomeLength): | |||||
| chrom.append(choice(AVAILABLE_COMMANDS)) | |||||
| return chrom | |||||
| """ | |||||
| creates a random population | |||||
| """ | |||||
| def createPopulation(populationSize, chromosomeLength): | |||||
| pop = [] | |||||
| for i in range(0, populationSize): | |||||
| pop.append((0,generateRandomChromosome(chromosomeLength))) | |||||
| return pop | |||||
| """ | |||||
| computes fitness of population and sorts the array based | |||||
| on fitness | |||||
| """ | |||||
| def computeFitness(population): | |||||
| for i in range(0, len(population)): | |||||
| population[i] = (calculatePerformance(population[i][1]), population[i][1]) | |||||
| population.sort(key=lambda tup: tup[0], reverse=True) # sorts population in place | |||||
| """ | |||||
| kills the weakest portion of the population | |||||
| """ | |||||
| def selection(population, keep): | |||||
| origSize = len(population) | |||||
| for i in range(keep, origSize): | |||||
| population.remove(population[keep]) | |||||
| """ | |||||
| Uses crossover to mate two chromosomes together. | |||||
| """ | |||||
| def mateBois(chrom1, chrom2): | |||||
| pivotPoint = randrange(len(chrom1)) | |||||
| bb = [] | |||||
| for i in range(0, pivotPoint): | |||||
| bb.append(chrom1[i]) | |||||
| for i in range(pivotPoint, len(chrom2)): | |||||
| bb.append(chrom1[i]) | |||||
| return (0, bb) | |||||
| """ | |||||
| brings population back up to desired size of population | |||||
| using crossover mating | |||||
| """ | |||||
| def mating(population, populationSize): | |||||
| newBlood = populationSize - len(population) | |||||
| newbies = [] | |||||
| for i in range(0, newBlood): | |||||
| newbies.append(mateBois(choice(population)[1], | |||||
| choice(population)[1])) | |||||
| population.extend(newbies) | |||||
| """ | |||||
| Randomly mutates x chromosomes -- excluding best chromosome | |||||
| """ | |||||
| def mutation(population, mutationRate): | |||||
| changes = random() * mutationRate * len(population) * len(population[0][1]) | |||||
| for i in range(0, int(changes)): | |||||
| ind = randrange(len(population) -1) + 1 | |||||
| chrom = randrange(len(population[0][1])) | |||||
| population[ind][1][chrom] = choice(AVAILABLE_COMMANDS) | |||||
| """ | |||||
| Computes average score of population | |||||
| """ | |||||
| def computeAverageScore(population): | |||||
| total = 0.0 | |||||
| for c in population: | |||||
| total = total + c[0] | |||||
| return total/len(population) | |||||
| def runGeneration(population, populationSize, keep, mutationRate): | |||||
| selection(population, keep) | |||||
| mating(population, populationSize) | |||||
| mutation(population, mutationRate) | |||||
| computeFitness(population) | |||||
| """ | |||||
| Runs the genetic algorithm | |||||
| """ | |||||
| def runGeneticAlgorithm(populationSize, maxGenerations, | |||||
| chromosomeLength, keep, mutationRate): | |||||
| population = createPopulation(populationSize, chromosomeLength) | |||||
| best = [] | |||||
| average = [] | |||||
| generations = range(1, maxGenerations + 1) | |||||
| for i in range(1, maxGenerations + 1): | |||||
| print("Generation: " + str(i)) | |||||
| runGeneration(population, populationSize, keep, mutationRate) | |||||
| a = computeAverageScore(population) | |||||
| average.append(a) | |||||
| best.append(population[0][0]) | |||||
| print("Best Score: " + str(population[0][0])) | |||||
| print("Average Score: " + str(a)) | |||||
| print("Best chromosome: " + str(population[0][1])) | |||||
| print() | |||||
| pyplot.plot(generations, best, color='g', label='Best') | |||||
| pyplot.plot(generations, average, color='orange', label='Average') | |||||
| pyplot.xlabel("Generations") | |||||
| pyplot.ylabel("Score") | |||||
| pyplot.title("Training GA Algorithm") | |||||
| pyplot.legend() | |||||
| pyplot.show() | |||||
| ``` | |||||
| ## Results | |||||
|  | |||||
|  | |||||
| ``` | |||||
| Generation: 200 | |||||
| Best Score: 8090.0 | |||||
| Average Score: 2492.6666666666665 | |||||
| Best chromosome: [1, 4, 1, 4, 4, 1, 0, 4, 2, 4, 1, 3, 2, 0, 2, 0, 0, 1, 3, 0, 1, 0, 4, 0, 1, 4, 1, 2, 0, 1, 3, 1, 3, 1, 3, 1, 0, 4, 4, 1, 3, 4, 1, 1, 2, 0, 4, 3, 3, 0] | |||||
| ``` | |||||
| It is impressive that a simple genetic algorithm can learn how | |||||
| to perform well when the seed is fixed. When compared to the | |||||
| random agent which had a max score of 3320 with a fixed seed, the | |||||
| optimized genetic algorithm shattered the random agents' best | |||||
| performance by a factor of 2.5. | |||||
| Since we trained an optimized set of actions to take to achieve a high | |||||
| score on a specific seed, what would happen if we randomized the seed? | |||||
| A test was conducted to compare the trained GA agent with 200 | |||||
| generations against the random agent. For both agents, the seed was | |||||
| randomized by setting it to the current time. | |||||
|  | |||||
| ``` | |||||
| GA Performance Trained on Fixed Seed: | |||||
| mean:2257.9 | |||||
| max:5600.0 | |||||
| min:530.0 | |||||
| sd:1018.4363455808125 | |||||
| median:2020.0 | |||||
| n:200 | |||||
| ``` | |||||
| ``` | |||||
| Random Random Seed: | |||||
| mean:1079.45 | |||||
| max:2800.0 | |||||
| min:110.0 | |||||
| sd:498.9340612746338 | |||||
| median:1080.0 | |||||
| n:200 | |||||
| ``` | |||||
| ``` | |||||
| F_onewayResult( | |||||
| statistic=214.87432376234608, | |||||
| pvalue=3.289638100969386e-39 | |||||
| ) | |||||
| ``` | |||||
| As expected, the GA agent did not perform as well on random seeds as | |||||
| it did on the fixed seed that it was trained on. However, the GA was | |||||
| able to find an action sequence that statistically beat the random | |||||
| agent as observed in the score distributions above and the extremely | |||||
| small p-value. Although luck was a part of getting the agent to get a | |||||
| score of 8k on the seed of zero, the skill that it learned was | |||||
| somewhat applicable to other seeds. After replaying the video of the | |||||
| agent play, it just slowly drifts around the screen and shoots at | |||||
| asteroids in front of it. This has a major advantage over the random | |||||
| agent since the random agent tends to move very fast and rotate | |||||
| spastically. | |||||
| ## Future Work | |||||
| This algorithm was more or less a last-minute hack to see if I can | |||||
| make a cool video of a high scoring asteroids agent. Future agents | |||||
| using genetic algorithms would incorporate reflex to dynamically | |||||
| respond to the environment. Based on which direction asteroids are in | |||||
| proximity to the player, the agent could select a different chromosome | |||||
| of actions to execute. This would potentially yield scores above ten thousand if trained and implemented correctly. Future training | |||||
| should also incorporate randomness to the seed so that the skills | |||||
| learned are the most transferable to other random environments. | |||||
| # Deep Q-Learning Agent | |||||
| ## Introduction: | |||||
| The inspiration behind attempting a reinforcement learning agent for | |||||
| this problem scope is the original DQN paper that came out from | |||||
| Deepmind, “Playing Atari with Deep Reinforcement Learning.” This paper | |||||
| showed the potential of utilizing this Deep Q-learning methodology on | |||||
| a variety of simulated Atari games using one standardized architecture | |||||
| across all. Reinforcement learning has always been of interest and to | |||||
| have the opportunity to spend time learning about it while applying | |||||
| for a class setting was exciting, even if it is out of the scope of | |||||
| the class presently. It has been an exciting challenge to read through | |||||
| and implement a research paper to get similar results. | |||||
| Deep Q-Learning is an extension of the standard Q-Learning algorithm | |||||
| in which a neural network is used to approximate the optimal | |||||
| action-value function, Q\*(s,a). The action-value function is the | |||||
| function that outputs the expected maximum reward given a state and a | |||||
| policy mapping to actions or distributions of actions. Logically, this | |||||
| works as the Q function follows the Bellman equation identity, which | |||||
| states that if the optimal action for the next step state is known, | |||||
| then the optimal output given an action a’ follows by maximizing the | |||||
| expected reward of the equation, r+Q*(s',a'). Thus, the reinforcement | |||||
| learning part comes in the form of a neural network approximating the | |||||
| optimal action-value function by using the Bellman equation identity | |||||
| as an iterative update at every time step. | |||||
| ## Agent architecture: | |||||
| The basis of the network architecture is a basic convolutional network | |||||
| with 2 conv layers, a fully connected layer, and then an output layer | |||||
| of 14 classes with each representing an individual action. The first | |||||
| layer consists of 16 8x8 filters and takes a stride 4 while the second | |||||
| has 32 4x4 filters and only takes a stride of 2. Following this layer, | |||||
| the filters are compressed into a 1-D representation vector of size | |||||
| 12,672 that’s passed through a fully connected layer of 256 nodes. | |||||
| All layers sans the output layer are activated using the ReLU function. | |||||
| The optimization algorithm of choice was the Adam optimizer, using a | |||||
| learning rate equal to .0001 and default betas of [.9, .99]. The | |||||
| discount factor, or gamma, related to future expected rewards was set | |||||
| at .99 and the probability of taking a random action per action step | |||||
| was linearly annealed from 1.0 down to a fixed .1 after one million | |||||
| seen frames. | |||||
|  | |||||
| ## Experience Replay: | |||||
| One of the main points within the original paper that significantly | |||||
| helped the training of this network is the introduction of a Replay | |||||
| Buffer that is used during the training. To break all the | |||||
| temporal correlation between sequential frames and biasing the | |||||
| training of a network-based off certain chains of situations, a | |||||
| historical buffer of transitions is used to sample mini-batches to | |||||
| train on per time step. Every time an action is made, a tuple | |||||
| consisting of the current state, the action is taken, the reward gained, and | |||||
| the subsequent state (s, a, r, s’) is stored into the buffer. And at | |||||
| every training step, a mini-batch is sampled from the buffer and used | |||||
| to train the network. This allows the network to be trained in | |||||
| non-correlated transitions and hopefully train in a more generalized | |||||
| way to the environment rather than biased to a string of similar | |||||
| actions. | |||||
| ## Preprocessing: | |||||
| One of the first issues that had to be tackled was the issue of the | |||||
| high dimensionality of the input image and how that information was | |||||
| duplicated stored in the Replay Buffer. Each observation given from | |||||
| the environment is a matrix of (210, 160, 3) pixels representing the | |||||
| RGB pixels within the frame. For time and being | |||||
| computationally efficient, it was needed to preprocess and reduce the | |||||
| dimensionality of the observations as a single frame stack (of which | |||||
| there are two per transition) consists of (4, 3, 210, 160) or 403,000 | |||||
| input features that would have to be dealt with. | |||||
|  | |||||
| Firstly, images are converted into a grayscale image and the | |||||
| reward/number lives section at the top of the screen is cut out since | |||||
| it is irrelevant to the network’s vision. Furthermore, the now (4, | |||||
| 192, 160) matrix was downsampled by taking every other pixel to (4, | |||||
| 96, 80), resulting in a change from 403,000 input features to only | |||||
| 30,720 - a substantial reduction in the calculations needed while | |||||
| maintaining strong input information for the network. | |||||
|  | |||||
| ## Training: | |||||
| Training for the bot was conducted by modifying the main function to | |||||
| allow games to immediately start after one was finished, to | |||||
| make continuous training of the agent easier. All the environment | |||||
| parameters were reset and the temporary attributes of the agent (ie. | |||||
| current state/next state) were flushed. For the first four frames of a | |||||
| game, the bot just gathered a stack of frames. And following that, at | |||||
| every time the next state was compiled and the transition tuple pushed | |||||
| onto the buffer, as well as a training step for the agent. For the | |||||
| training step, a random batch was grabbed from the replay buffer and | |||||
| used to calculate the loss function between actual and expected | |||||
| Q-values. This was used to calculate the gradients for the | |||||
| backpropagation of the network. | |||||
| ## Outcome: | |||||
| Unfortunately, the result of 48 hours of continuous training, 950 | |||||
| games played, and roughly 1.3 million frames of game footage seen, was | |||||
| that the agent converged to a suicidal policy that resulted in a | |||||
| consistent garbage performance. | |||||
|  | |||||
| The model transitioned to the fixed 90% model action chance around | |||||
| episode 700, which is exactly where the agent starts to go awry. The | |||||
| strange part about this is since the random action chance is linearly | |||||
| annealed over the first million frames, if the agent had continuously | |||||
| been following a garbage policy, it would’ve been expected that the | |||||
| rewards would steadily decrease over time as the network takes more | |||||
| control. | |||||
|  | |||||
| Up until that point, the projection of the reward trendline was a | |||||
| steady rise per the number of episodes. Expanding this out until | |||||
| 10,000 frames (approximately 10 million frames seen, the same amount | |||||
| of time the original Deepmind paper trained these bots for), the | |||||
| projected score is in the realms of 2,400 to 2,500 - which matches up | |||||
| closely to the well-tuned reflex agent and the GA agent on a random | |||||
| seed. | |||||
|  | |||||
| It would’ve been exciting to see how the model compared to | |||||
| our reflex agent had it been able to train consistently up until the | |||||
| end. | |||||
| ## Limitations: | |||||
| There were a fair number of limitations that were present within the | |||||
| execution and training of this model that possibly contributed to the | |||||
| slow and unstable training of the network. Differences in the | |||||
| algorithm from the original paper is that the optimization function | |||||
| utilized was the Adam optimizer instead of RMSProp and the replay | |||||
| buffer only took into consideration the previous 50k frames, not the | |||||
| past one million. It might be possible that the weaker replay buffer | |||||
| was to blame as the model was continuously fed a sub-optimal within | |||||
| its past 50,000 frames that caused it to diverge so heavily near the | |||||
| end. | |||||
| One issue in preprocessing that might've led the bot astray is using | |||||
| not using the max pixelwise combination between sequential frames in | |||||
| order to have each frame include both the asteroids and the player. | |||||
| Since the Atari (and by extension, this environment simulation) | |||||
| doesn't render the asteroids and the player sprite all in the same | |||||
| frame, it is possible that the network was unable to extract any | |||||
| coherent connection between the alternating frames. | |||||
| Regarding optimizations built on the DQN algorithm past the original | |||||
| Deepmind paper, we did not use a policy and a target network in | |||||
| training. In the original algorithm, the estimation and attempt at | |||||
| converging to the target policy is unstable due to the target | |||||
| network’s weights continuously shifting during training. For the | |||||
| network, it’s hard to converge to something that’s continually | |||||
| shifting at every time step and leads to very noisy and unstable | |||||
| training. One optimization that has been proposed for DQN is to have a | |||||
| policy and target network. At every timestep, the policy network’s | |||||
| weights are updated with the calculated gradients while the target | |||||
| network is maintained for a number of steps. This lets the target | |||||
| policy be still for a few time steps while the network is converging | |||||
| to it and leads to more stable and guided training. | |||||
| Perhaps the largest limitation in training was the computational power | |||||
| used for training. The network was trained on a single GTX1060ti GPU, | |||||
| which led to just single episodes taking a few minutes to complete. It | |||||
| would’ve taken an incredibly long time to hit 10 million seen frames | |||||
| as even just 1.3 million took approximately 48 hours. It’s probable | |||||
| that our implementation is inefficient in its calculations, however it | |||||
| is a well known limitation of RL that it is time and computationally | |||||
| intensive. | |||||
| ## Deep Q Conclusions: | |||||
| This was a fun agent and algorithm to implement, even if at present it | |||||
| has given little to no results back in terms of performance. The plan | |||||
| is to continue testing and training the agent, even after the | |||||
| deadline. Reinforcement learning is a complicated and hard to debug | |||||
| environment, but similarly an exciting challenge due to its potential | |||||
| for solving and overcoming problems. | |||||
| # Conclusion | |||||
| This project demonstrated how fun it can be to train AI agents to play | |||||
| video games. Although none of our agents are earth-shatteringly | |||||
| amazing, we were able to use statistical measures to determine that | |||||
| the reflex and GA agents outperformed the random agent. The GA agent | |||||
| and the convolutional neural network show very promising and future | |||||
| work can be used to drastically improve their results. | |||||