jrtechs
/
jrtechs-NodeJSBlog
mirror of https://github.com/jrtechs/NodeJSBlog.git

I worked on this project during Dr. Homans's RIT CSCI-331 class.
# Introduction

This project explores the beautiful and frustrating ways in which wecan use AI to develop systems to solve problems. Asteroids is aperfect example of a fun learning AI problem because Asteroids isdifficult for humans to play and has open-source frameworks that canemulate the environment. Using the Open AI gym framework we developeddifferent AI agents to play Asteroids using various heuristics and MLtechniques.  We then created a testbed to run experiments thatdetermine statistically whether our custom agents out-performs therandom agent.

# Methods and Results

Three agents were developed to play Asteroids. This report is brokeninto segments where each agent is explained and its performance isanalyzed. 
# Random Agent

The random agent simple takes a random action defined by the actionspace. The resulting agent will randomly spin around and shootasteroids. Although this random agent is easy to implement, it isineffective because moving spastically will cause you to crash intoasteroids.  Using this as the baseline for our performance, we can usethis random agent to access whether our agents are better than randomkey smashing -- which is my strategy for playing Smash. 

```python"""ACTION_MEANING = {    0: "NOOP",    1: "FIRE",    2: "UP",    3: "RIGHT",    4: "LEFT",    5: "DOWN",    6: "UPRIGHT",    7: "UPLEFT",    8: "DOWNRIGHT",    9: "DOWNLEFT",    10: "UPFIRE",    11: "RIGHTFIRE",    12: "LEFTFIRE",    13: "DOWNFIRE",    14: "UPRIGHTFIRE",    15: "UPLEFTFIRE",    16: "DOWNRIGHTFIRE",    17: "DOWNLEFTFIRE",}"""def act(self, observation, reward, done):    return self.action_space.sample()```
## Test on the Environment Seed

It is always important to know how randomness affects the results ofyour experiment. In this agent, there are two sources ofrandomness, the first being the seed given for the Gym environment andthe other is in the random function used to select a random action. Bydefault, the seed of the Gym library is set to zero. This is usefulfor testing because if your agent is deterministic, you will alwaysget the same results. We can seed the environment with the currenttime to add more randomness. However, this begs the question: to whatextent does the added randomness change the scores of the game. Certain seeds in the Gym environment may make the game mucheasier/harder to play thus altering the distribution of the score. 

A test was derived to compare the scores of the environment in both afixed seed and a time set seed. 300 trials of the random agent were ranin both types of seeded environments. 
![Seed Effect](media/asteroids/randomSeed.png)

```Random Agent Time Seed:     mean:1005.6333333333333     max:3220.0     min:110.0     sd:478.32548077178114     median:980.0 n:300
Random Agent Fixed Seed:     mean:1049.3666666666666     max:3320.0     min:110.0     sd:485.90321281323327     median:1080.0     n:300```
What is astonishing is that both distributions are nearly identical inevery way. Although the means are slightly different, there appears tobe no apparent difference between the distributions of scores. Onemight expect that having more randomness would at least change thevariance of the scores, but none of that has happened.
```Random agent vs Random fixed seedF_onewayResult(    statistic=1.2300971733588375,     pvalue=0.2678339696597312)```
With such a high p-value we can not reject the null hypothesis thatthese distributions are statistically different. This is a powerfulconclusion to come to because it allows us to run future experimentsunderstanding that a specific seed on average will not have astatistically significant impact on the performance of a random agent.However, this finding does not help us understand the impact that theseed has on a fully deterministic agent. It is still possible that afully deterministic system will have varying scores on differentenvironment seeds. 

# Reflex Agent

Our reflex agent observes the environment and decides what to do basedon a simple rule set. The reflex agent is broken into three sections:feature extraction, reflex rules, and performance. 
## Feature Extraction

The largest part of this agent was devoted to parsing the environmentinto a more usable form. The feature extraction for this project wasrather difficult since the environment was given as a pixel array andthe screen flashed the asteroids and then the player. Trying toachieve the best performance with the minimal amount of algorithmicengineering, this reflex agent parsed 3 things from the environment:position, direction, closest asteroid. 

### 1: Player Position

Finding the position of the player was relatively easy since you onlyhad to scan the environment to find pixels of certain RGB values. Toaccount for the flashing environment, you would just store theposition in the fields of the class so that it is persistent betweenaction loops. The position of the player would only be updated if anew player is observed. 
```pythonAGENT_RGB = [240, 128, 128] ```
### 2: Player Direction

Detecting the position of the player could be made difficult if youwere only going off the RGB values of the player. Although when theplayer is upright, it is straight forward, when the player is sidewaysthings get super difficult. 

```pythonaction_sequence = [3,3,3,3,3,0, 0,0]
class Agent(object):    def __init__(self, action_space):        self.action_space = action_space
    # Defines how the agent should act    def act(self, observation, reward, done):        if len(action_sequence) > 0:            action = action_sequence[0]            action_sequence.remove(action)            return action        return 0```
![Starting Position](media/asteroids/starting.png)

![4 Turns Right](media/asteroids/4right.png)

![5 Turns Right](media/asteroids/5right.png)

We created a basic script to observe what the player does when given aspecific sequence of actions. I was pleased to notice that exactly 5turns to the left/right correlated to a perfect 90 degrees. By keepingtrack of our current rotation according to the actions that we havetaken, we can precisely keep track of our current rotational directionwithout parsing the horrendous pixel array when the player issideways. 

### 3: Position of Closest Asteroid

Asteroids were detected as being any pixel that was not empty (0,0,0)and not the player (240, 128, 128). Using a simple single pass throughthe environment matrix, we were able to detect the closest asteroid tothe latest known position of the player.
## Agent Reflex

Based on my actual strategy for asteroids, this agent stays in themiddle of the screen and shoots at the closest asteroid to it. 
```pythondef act(self, observation, reward, done):    observation = np.array(observation)    self.updateState(observation)    dirOfAstroid = math.atan2(self.closestRow-self.row, self.closestCol- self.col)
    dirOfAstroid = self.deWarpAngle(dirOfAstroid)
    self.shotLast = not self.shotLast    if self.shotLast:        return 1 # fire    if self.currentDirection - dirOfAstroid < 0:        self.updateDirection(math.pi/10)        if self.shotLast:            return 12 # left fire        return 4 # left    else:        self.updateDirection(-1*math.pi/10)        return 3 # right```
Despite being a simple agent, this performs well since it can shoot at asteroids before it hits them. 
## Results of Reflex Agent

In this trial, 200 tests of both the random agent and the reflex agentwere observed while setting the seed of the environment to the currenttime. The seed was randomly set in this scenario since the reflexagent is fully deterministic and would perform identically in each trialotherwise. 
![histogram](media/asteroids/reflexPerformance.png)

The histogram depicts that the reflex agent on average performssignificantly better than the random agent. What is fascinating tonote is that even though the agent's actions are deterministic, theseed of the environment created a large amount of variance in thescores observed. It is arguably misleading to only provide a singlescore for an agent as its performance because the environment seed hasa large impact on the non-random agent's scores.
```Reflex Agent:     mean:2385.25    max:8110.0     min:530.0     sd:1066.217115553863     median:2250.0     n:200
Random Agent:    mean:976.15     max:2030.0    min:110.0     sd:425.2712987023695     median:980.0     n:200```
One thing that is interesting about comparing the two distributions isthat the reflex agent has a much larger standard deviation in itsscores than the random agent. It is also interesting to note that thereflex agent's worst performance was significantly better than therandom agent's worst performance. Also, the best performance of thereflex agent shatters the best performance of the random agent. 
```Random agent vs reflexF_onewayResult(    statistic=299.86689786081956,     pvalue=1.777062051091977e-50)```
Since we took such a sample size of two hundred, and the populationswere significantly different, we got a p score of nearly zero(1.77e-50). With a p-value like this, we can say with nearly 100%confidence (with rounding) that these two populations are differentand that the reflex agent out-performs the random agent. 


# Genetic Algorithm

Genetic algorithms employ the same tactics used in natural selectionto find an optimal solution to an optimization problem. Geneticalgorithms are often used in high dimensional problems where theoptimal solutions are not apparent. Genetic algorithms are commonlyused to tune the hyper-parameters of a program. However, thisalgorithm can be used in any scenario where you have a function thatdefines how well a solution is. 
In the scenario of asteroids, we can employ genetic algorithms to findthe optimal sequence of moves to make to achieve the highest scorepossible. The chromosomes are well defined as the sequence of actionsto loop through and the fitness function is simply the score that theagent achieves. 
## Algorithm Implementation

The actual implementation of the genetic algorithm was pretty straightforward, the agent simply looped through a sequence of events whereeach event represents a gene on the chromosome. 

```pythonclass Agent(object):    """Very Basic GA Agent"""    def __init__(self, action_space, chromosome):        self.action_space = action_space        self.chromosome = chromosome        self.index = 0
    # You should modify this function    def act(self, observation, reward, done):        if self.index >= len(self.chromosome)-1:            self.index = 0        else:            self.index = self.index + 1        return self.chromosome[self.index]```

Rather than using a library, a simple home-brewed genetic algorithmwas created from scratch. The basic algorithm essentially is in a loopthat runs functions necessary to iterate through each generation.Each generation can be broken apart into a few steps: 
- selection: removes the worst-performing chromosomes- mating: uses crossover to create new chromosomes- mutation: adds randomness to the chromosome- fitness: evaluates the performance of each chromosome
In roughly 100 lines of python, a basic genetic algorithm was crafted.
```pythonAVAILABLE_COMMANDS = [0,1,2,3,4]

def generateRandomChromosome(chromosomeLength):    chrom = []    for i in range(0, chromosomeLength):        chrom.append(choice(AVAILABLE_COMMANDS))    return chrom

"""creates a random population"""def createPopulation(populationSize, chromosomeLength):    pop = []    for i in range(0, populationSize):        pop.append((0,generateRandomChromosome(chromosomeLength)))    return pop

"""computes fitness of population and sorts the array basedon fitness"""def computeFitness(population):    for i in range(0, len(population)):        population[i] = (calculatePerformance(population[i][1]), population[i][1])    population.sort(key=lambda tup: tup[0], reverse=True) # sorts population in place

"""kills the weakest portion of the population"""def selection(population, keep):    origSize = len(population)    for i in range(keep, origSize):        population.remove(population[keep])


"""Uses crossover to mate two chromosomes together."""def mateBois(chrom1, chrom2):    pivotPoint = randrange(len(chrom1))    bb = []    for i in range(0, pivotPoint):        bb.append(chrom1[i])    for i in range(pivotPoint, len(chrom2)):        bb.append(chrom1[i])    return (0, bb)    

"""brings population back up to desired size of populationusing crossover mating"""def mating(population, populationSize):    newBlood = populationSize - len(population)
    newbies = []    for i in range(0, newBlood):        newbies.append(mateBois(choice(population)[1],                                 choice(population)[1]))    population.extend(newbies)

"""Randomly mutates x chromosomes -- excluding best chromosome"""def mutation(population, mutationRate):    changes = random() * mutationRate * len(population) * len(population[0][1])    for i in range(0, int(changes)):        ind = randrange(len(population) -1) + 1        chrom = randrange(len(population[0][1]))        population[ind][1][chrom] =  choice(AVAILABLE_COMMANDS)

"""Computes average score of population"""def computeAverageScore(population):    total = 0.0    for c in population:        total = total + c[0]    return total/len(population)

def runGeneration(population, populationSize, keep, mutationRate):    selection(population, keep)    mating(population, populationSize)    mutation(population, mutationRate)    computeFitness(population)

"""Runs the genetic algorithm"""def runGeneticAlgorithm(populationSize, maxGenerations,                         chromosomeLength, keep, mutationRate):    population = createPopulation(populationSize, chromosomeLength)
    best = []    average = []    generations = range(1, maxGenerations + 1)
    for i in range(1, maxGenerations + 1):        print("Generation: " + str(i))        runGeneration(population, populationSize, keep, mutationRate)
        a = computeAverageScore(population)        average.append(a)        best.append(population[0][0])
        print("Best Score: " + str(population[0][0]))        print("Average Score: " + str(a))        print("Best chromosome: " + str(population[0][1]))        print()
    pyplot.plot(generations, best, color='g', label='Best')    pyplot.plot(generations, average, color='orange', label='Average')
    pyplot.xlabel("Generations")    pyplot.ylabel("Score")    pyplot.title("Training GA Algorithm")    pyplot.legend()    pyplot.show()```

## Results

![training](media/asteroids/GA50.png)

![training](media/asteroids/GA200.png)

```Generation: 200Best Score: 8090.0Average Score: 2492.6666666666665Best chromosome: [1, 4, 1, 4, 4, 1, 0, 4, 2, 4, 1, 3, 2, 0, 2, 0, 0, 1, 3, 0, 1, 0, 4, 0, 1, 4, 1, 2, 0, 1, 3, 1, 3, 1, 3, 1, 0, 4, 4, 1, 3, 4, 1, 1, 2, 0, 4, 3, 3, 0]```
It is impressive that a simple genetic algorithm can learn howto perform well when the seed is fixed. When compared to therandom agent which had a max score of 3320 with a fixed seed, theoptimized genetic algorithm shattered the random agents' bestperformance by a factor of 2.5. 
Since we trained an optimized set of actions to take to achieve a highscore on a specific seed, what would happen if we randomized the seed?A test was conducted to compare the trained GA agent with 200generations against the random agent. For both agents, the seed wasrandomized by setting it to the current time.
![200 Trials GA Random Seed](media/asteroids/GAvsRandom.png)

```GA Performance Trained on Fixed Seed:    mean:2257.9     max:5600.0    min:530.0     sd:1018.4363455808125     median:2020.0     n:200```

```Random Random Seed:    mean:1079.45     max:2800.0     min:110.0     sd:498.9340612746338     median:1080.0     n:200```
```F_onewayResult(    statistic=214.87432376234608,     pvalue=3.289638100969386e-39)```
As expected, the GA agent did not perform as well on random seeds asit did on the fixed seed that it was trained on. However, the GA wasable to find an action sequence that statistically beat the randomagent as observed in the score distributions above and the extremelysmall p-value. Although luck was a part of getting the agent to get ascore of 8k on the seed of zero, the skill that it learned wassomewhat applicable to other seeds. After replaying the video of theagent play, it just slowly drifts around the screen and shoots atasteroids in front of it. This has a major advantage over the randomagent since the random agent tends to move very fast and rotatespastically. 
## Future Work

This algorithm was more or less a last-minute hack to see if I canmake a cool video of a high scoring asteroids agent. Future agentsusing genetic algorithms would incorporate reflex to dynamicallyrespond to the environment. Based on which direction asteroids are inproximity to the player, the agent could select a different chromosomeof actions to execute. This would potentially yield scores above ten thousand if trained and implemented correctly. Future trainingshould also incorporate randomness to the seed so that the skillslearned are the most transferable to other random environments. 
# Deep Q-Learning Agent

## Introduction:


The inspiration behind attempting a reinforcement learning agent forthis problem scope is the original DQN paper that came out fromDeepmind, “Playing Atari with Deep Reinforcement Learning.” This papershowed the potential of utilizing this Deep Q-learning methodology ona variety of simulated Atari games using one standardized architectureacross all. Reinforcement learning has always been of interest and tohave the opportunity to spend time learning about it while applyingfor a class setting was exciting, even if it is out of the scope ofthe class presently. It has been an exciting challenge to read throughand implement a research paper to get similar results. 
Deep Q-Learning is an extension of the standard Q-Learning algorithmin which a neural network is used to approximate the optimalaction-value function, Q\*(s,a). The action-value function is thefunction that outputs the expected maximum reward given a state and apolicy mapping to actions or distributions of actions. Logically, thisworks as the Q function follows the Bellman equation identity, whichstates that if the optimal action for the next step state is known,then the optimal output given an action a’ follows by maximizing theexpected reward of the equation, r+Q*(s',a'). Thus, the reinforcementlearning part comes in the form of a neural network approximating theoptimal action-value function by using the Bellman equation identityas an iterative update at every time step. 

## Agent architecture: 

The basis of the network architecture is a basic convolutional networkwith 2 conv layers, a fully connected layer, and then an output layerof 14 classes with each representing an individual action. The firstlayer consists of 16 8x8 filters and takes a stride 4 while the secondhas 32 4x4 filters and only takes a stride of 2. Following this layer,the filters are compressed into a 1-D representation vector of size12,672 that’s passed through a fully connected layer of 256 nodes. 
All layers sans the output layer are activated using the ReLU function.The optimization algorithm of choice was the Adam optimizer, using alearning rate equal to .0001 and default betas of [.9, .99]. Thediscount factor, or gamma, related to future expected rewards was setat .99 and the probability of taking a random action per action stepwas linearly annealed from 1.0 down to a fixed .1 after one millionseen frames. 
![Layer code](media/asteroids/code.png)

## Experience Replay:

One of the main points within the original paper that significantlyhelped the training of this network is the introduction of a ReplayBuffer that is used during the training. To break all thetemporal correlation between sequential frames and biasing thetraining of a network-based off certain chains of situations, ahistorical buffer of transitions is used to sample mini-batches totrain on per time step. Every time an action is made, a tupleconsisting of the current state, the action is taken, the reward gained, andthe subsequent state (s, a, r, s’) is stored into the buffer. And atevery training step, a mini-batch is sampled from the buffer and usedto train the network. This allows the network to be trained innon-correlated transitions and hopefully train in a more generalizedway to the environment rather than biased to a string of similaractions. 
## Preprocessing:

One of the first issues that had to be tackled was the issue of thehigh dimensionality of the input image and how that information wasduplicated stored in the Replay Buffer. Each observation given fromthe environment is a matrix of (210, 160, 3) pixels representing theRGB pixels within the frame. For time and beingcomputationally efficient, it was needed to preprocess and reduce thedimensionality of the observations as a single frame stack (of whichthere are two per transition) consists of (4, 3, 210, 160) or 403,000input features that would have to be dealt with. 
![Before Processing](media/asteroids/dqn_before.png)
Firstly, images are converted into a grayscale image and thereward/number lives section at the top of the screen is cut out sinceit is irrelevant to the network’s vision. Furthermore, the now (4,192, 160) matrix was downsampled by taking every other pixel to (4,96, 80), resulting in a change from 403,000 input features to only30,720 - a substantial reduction in the calculations needed whilemaintaining strong input information for the network. 
![After Processing](media/asteroids/dqn_after.png)

## Training:

Training for the bot was conducted by modifying the main function toallow games to immediately start after one was finished, tomake continuous training of the agent easier. All the environmentparameters were reset and the temporary attributes of the agent (ie.current state/next state) were flushed. For the first four frames of agame, the bot just gathered a stack of frames. And following that, atevery time the next state was compiled and the transition tuple pushedonto the buffer, as well as a training step for the agent. For thetraining step, a random batch was grabbed from the replay buffer andused to calculate the loss function between actual and expectedQ-values. This was used to calculate the gradients for thebackpropagation of the network. 

## Outcome:

Unfortunately, the result of 48 hours of continuous training, 950games played, and roughly 1.3 million frames of game footage seen, wasthat the agent converged to a suicidal policy that resulted in aconsistent garbage performance. 
![Reward](media/asteroids/reward.png)

The model transitioned to the fixed 90% model action chance aroundepisode 700, which is exactly where the agent starts to go awry. Thestrange part about this is since the random action chance is linearlyannealed over the first million frames, if the agent had continuouslybeen following a garbage policy, it would’ve been expected that therewards would steadily decrease over time as the network takes morecontrol.
![Real Trendline](media/asteroids/reward_trendline.png)

Up until that point, the projection of the reward trendline was asteady rise per the number of episodes. Expanding this out until10,000 frames (approximately 10 million frames seen, the same amountof time the original Deepmind paper trained these bots for), theprojected score is in the realms of 2,400 to 2,500 - which matches upclosely to the well-tuned reflex agent and the GA agent on a randomseed. 
![Endgame Trendline](media/asteroids/reward_projection.png)

It would’ve been exciting to see how the model compared toour reflex agent had it been able to train consistently up until theend. 
## Limitations:

There were a fair number of limitations that were present within theexecution and training of this model that possibly contributed to theslow and unstable training of the network. Differences in thealgorithm from the original paper is that the optimization functionutilized was the Adam optimizer instead of RMSProp and the replaybuffer only took into consideration the previous 50k frames, not thepast one million. It might be possible that the weaker replay bufferwas to blame as the model was continuously fed a sub-optimal withinits past 50,000 frames that caused it to diverge so heavily near theend. 
One issue in preprocessing that might've led the bot astray is usingnot using the max pixelwise combination between sequential frames inorder to have each frame include both the asteroids and the player.Since the Atari (and by extension, this environment simulation)doesn't render the asteroids and the player sprite all in the sameframe, it is possible that the network was unable to extract anycoherent connection between the alternating frames. 
Regarding optimizations built on the DQN algorithm past the originalDeepmind paper, we did not use a policy and a target network intraining. In the original algorithm, the estimation and attempt atconverging to the target policy is unstable due to the targetnetwork’s weights continuously shifting during training. For thenetwork, it’s hard to converge to something that’s continuallyshifting at every time step and leads to very noisy and unstabletraining. One optimization that has been proposed for DQN is to have apolicy and target network. At every timestep, the policy network’sweights are updated with the calculated gradients while the targetnetwork is maintained for a number of steps. This lets the targetpolicy be still for a few time steps while the network is convergingto it and leads to more stable and guided training. 
Perhaps the largest limitation in training was the computational powerused for training. The network was trained on a single GTX1060ti GPU,which led to just single episodes taking a few minutes to complete. Itwould’ve taken an incredibly long time to hit 10 million seen framesas even just 1.3 million took approximately 48 hours. It’s probablethat our implementation is inefficient in its calculations, however itis a well known limitation of RL that it is time and computationallyintensive. 
## Deep Q Conclusions:

This was a fun agent and algorithm to implement, even if at present ithas given little to no results back in terms of performance. The planis to continue testing and training the agent, even after thedeadline. Reinforcement learning is a complicated and hard to debugenvironment, but similarly an exciting challenge due to its potentialfor solving and overcoming problems. 

# Conclusion

This project demonstrated how fun it can be to train AI agents to playvideo games. Although none of our agents are earth-shatteringlyamazing, we were able to use statistical measures to determine thatthe reflex and GA agents outperformed the random agent. The GA agentand the convolutional neural network show very promising and futurework can be used to drastically improve their results.