|
|
@ -0,0 +1,763 @@ |
|
|
|
I worked on this project during Dr. Homans's RIT CSCI-331 class. |
|
|
|
|
|
|
|
# Introduction |
|
|
|
|
|
|
|
This project explores the beautiful and frustrating ways in which we |
|
|
|
can use AI to develop systems to solve problems. Asteroids is a |
|
|
|
perfect example of a fun learning AI problem because Asteroids is |
|
|
|
difficult for humans to play and has open-source frameworks that can |
|
|
|
emulate the environment. Using the Open AI gym framework we developed |
|
|
|
different AI agents to play Asteroids using various heuristics and ML |
|
|
|
techniques. We then created a testbed to run experiments that |
|
|
|
determine statistically whether our custom agents out-performs the |
|
|
|
random agent. |
|
|
|
|
|
|
|
|
|
|
|
# Methods and Results |
|
|
|
|
|
|
|
Three agents were developed to play Asteroids. This report is broken |
|
|
|
into segments where each agent is explained and its performance is |
|
|
|
analyzed. |
|
|
|
|
|
|
|
# Random Agent |
|
|
|
|
|
|
|
The random agent simple takes a random action defined by the action |
|
|
|
space. The resulting agent will randomly spin around and shoot |
|
|
|
asteroids. Although this random agent is easy to implement, it is |
|
|
|
ineffective because moving spastically will cause you to crash into |
|
|
|
asteroids. Using this as the baseline for our performance, we can use |
|
|
|
this random agent to access whether our agents are better than random |
|
|
|
key smashing -- which is my strategy for playing Smash. |
|
|
|
|
|
|
|
|
|
|
|
```python |
|
|
|
""" |
|
|
|
ACTION_MEANING = { |
|
|
|
0: "NOOP", |
|
|
|
1: "FIRE", |
|
|
|
2: "UP", |
|
|
|
3: "RIGHT", |
|
|
|
4: "LEFT", |
|
|
|
5: "DOWN", |
|
|
|
6: "UPRIGHT", |
|
|
|
7: "UPLEFT", |
|
|
|
8: "DOWNRIGHT", |
|
|
|
9: "DOWNLEFT", |
|
|
|
10: "UPFIRE", |
|
|
|
11: "RIGHTFIRE", |
|
|
|
12: "LEFTFIRE", |
|
|
|
13: "DOWNFIRE", |
|
|
|
14: "UPRIGHTFIRE", |
|
|
|
15: "UPLEFTFIRE", |
|
|
|
16: "DOWNRIGHTFIRE", |
|
|
|
17: "DOWNLEFTFIRE", |
|
|
|
} |
|
|
|
""" |
|
|
|
def act(self, observation, reward, done): |
|
|
|
return self.action_space.sample() |
|
|
|
``` |
|
|
|
|
|
|
|
## Test on the Environment Seed |
|
|
|
|
|
|
|
It is always important to know how randomness affects the results of |
|
|
|
your experiment. In this agent, there are two sources of |
|
|
|
randomness, the first being the seed given for the Gym environment and |
|
|
|
the other is in the random function used to select a random action. By |
|
|
|
default, the seed of the Gym library is set to zero. This is useful |
|
|
|
for testing because if your agent is deterministic, you will always |
|
|
|
get the same results. We can seed the environment with the current |
|
|
|
time to add more randomness. However, this begs the question: to what |
|
|
|
extent does the added randomness change the scores of the game. |
|
|
|
Certain seeds in the Gym environment may make the game much |
|
|
|
easier/harder to play thus altering the distribution of the score. |
|
|
|
|
|
|
|
|
|
|
|
A test was derived to compare the scores of the environment in both a |
|
|
|
fixed seed and a time set seed. 300 trials of the random agent were ran |
|
|
|
in both types of seeded environments. |
|
|
|
|
|
|
|
![Seed Effect](media/asteroids/randomSeed.png) |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
Random Agent Time Seed: |
|
|
|
mean:1005.6333333333333 |
|
|
|
max:3220.0 |
|
|
|
min:110.0 |
|
|
|
sd:478.32548077178114 |
|
|
|
median:980.0 n:300 |
|
|
|
|
|
|
|
Random Agent Fixed Seed: |
|
|
|
mean:1049.3666666666666 |
|
|
|
max:3320.0 |
|
|
|
min:110.0 |
|
|
|
sd:485.90321281323327 |
|
|
|
median:1080.0 |
|
|
|
n:300 |
|
|
|
``` |
|
|
|
|
|
|
|
What is astonishing is that both distributions are nearly identical in |
|
|
|
every way. Although the means are slightly different, there appears to |
|
|
|
be no apparent difference between the distributions of scores. One |
|
|
|
might expect that having more randomness would at least change the |
|
|
|
variance of the scores, but none of that has happened. |
|
|
|
|
|
|
|
``` |
|
|
|
Random agent vs Random fixed seed |
|
|
|
F_onewayResult( |
|
|
|
statistic=1.2300971733588375, |
|
|
|
pvalue=0.2678339696597312 |
|
|
|
) |
|
|
|
``` |
|
|
|
|
|
|
|
With such a high p-value we can not reject the null hypothesis that |
|
|
|
these distributions are statistically different. This is a powerful |
|
|
|
conclusion to come to because it allows us to run future experiments |
|
|
|
understanding that a specific seed on average will not have a |
|
|
|
statistically significant impact on the performance of a random agent. |
|
|
|
However, this finding does not help us understand the impact that the |
|
|
|
seed has on a fully deterministic agent. It is still possible that a |
|
|
|
fully deterministic system will have varying scores on different |
|
|
|
environment seeds. |
|
|
|
|
|
|
|
|
|
|
|
# Reflex Agent |
|
|
|
|
|
|
|
Our reflex agent observes the environment and decides what to do based |
|
|
|
on a simple rule set. The reflex agent is broken into three sections: |
|
|
|
feature extraction, reflex rules, and performance. |
|
|
|
|
|
|
|
## Feature Extraction |
|
|
|
|
|
|
|
The largest part of this agent was devoted to parsing the environment |
|
|
|
into a more usable form. The feature extraction for this project was |
|
|
|
rather difficult since the environment was given as a pixel array and |
|
|
|
the screen flashed the asteroids and then the player. Trying to |
|
|
|
achieve the best performance with the minimal amount of algorithmic |
|
|
|
engineering, this reflex agent parsed 3 things from the environment: |
|
|
|
position, direction, closest asteroid. |
|
|
|
|
|
|
|
|
|
|
|
### 1: Player Position |
|
|
|
|
|
|
|
Finding the position of the player was relatively easy since you only |
|
|
|
had to scan the environment to find pixels of certain RGB values. To |
|
|
|
account for the flashing environment, you would just store the |
|
|
|
position in the fields of the class so that it is persistent between |
|
|
|
action loops. The position of the player would only be updated if a |
|
|
|
new player is observed. |
|
|
|
|
|
|
|
```python |
|
|
|
AGENT_RGB = [240, 128, 128] |
|
|
|
``` |
|
|
|
|
|
|
|
### 2: Player Direction |
|
|
|
|
|
|
|
Detecting the position of the player could be made difficult if you |
|
|
|
were only going off the RGB values of the player. Although when the |
|
|
|
player is upright, it is straight forward, when the player is sideways |
|
|
|
things get super difficult. |
|
|
|
|
|
|
|
|
|
|
|
```python |
|
|
|
action_sequence = [3,3,3,3,3,0, 0,0] |
|
|
|
|
|
|
|
class Agent(object): |
|
|
|
def __init__(self, action_space): |
|
|
|
self.action_space = action_space |
|
|
|
|
|
|
|
# Defines how the agent should act |
|
|
|
def act(self, observation, reward, done): |
|
|
|
if len(action_sequence) > 0: |
|
|
|
action = action_sequence[0] |
|
|
|
action_sequence.remove(action) |
|
|
|
return action |
|
|
|
return 0 |
|
|
|
``` |
|
|
|
|
|
|
|
![Starting Position](media/asteroids/starting.png) |
|
|
|
|
|
|
|
|
|
|
|
![4 Turns Right](media/asteroids/4right.png) |
|
|
|
|
|
|
|
|
|
|
|
![5 Turns Right](media/asteroids/5right.png) |
|
|
|
|
|
|
|
|
|
|
|
We created a basic script to observe what the player does when given a |
|
|
|
specific sequence of actions. I was pleased to notice that exactly 5 |
|
|
|
turns to the left/right correlated to a perfect 90 degrees. By keeping |
|
|
|
track of our current rotation according to the actions that we have |
|
|
|
taken, we can precisely keep track of our current rotational direction |
|
|
|
without parsing the horrendous pixel array when the player is |
|
|
|
sideways. |
|
|
|
|
|
|
|
|
|
|
|
### 3: Position of Closest Asteroid |
|
|
|
|
|
|
|
Asteroids were detected as being any pixel that was not empty (0,0,0) |
|
|
|
and not the player (240, 128, 128). Using a simple single pass through |
|
|
|
the environment matrix, we were able to detect the closest asteroid to |
|
|
|
the latest known position of the player. |
|
|
|
|
|
|
|
## Agent Reflex |
|
|
|
|
|
|
|
Based on my actual strategy for asteroids, this agent stays in the |
|
|
|
middle of the screen and shoots at the closest asteroid to it. |
|
|
|
|
|
|
|
```python |
|
|
|
def act(self, observation, reward, done): |
|
|
|
observation = np.array(observation) |
|
|
|
self.updateState(observation) |
|
|
|
dirOfAstroid = math.atan2(self.closestRow-self.row, self.closestCol- self.col) |
|
|
|
|
|
|
|
dirOfAstroid = self.deWarpAngle(dirOfAstroid) |
|
|
|
|
|
|
|
self.shotLast = not self.shotLast |
|
|
|
if self.shotLast: |
|
|
|
return 1 # fire |
|
|
|
if self.currentDirection - dirOfAstroid < 0: |
|
|
|
self.updateDirection(math.pi/10) |
|
|
|
if self.shotLast: |
|
|
|
return 12 # left fire |
|
|
|
return 4 # left |
|
|
|
else: |
|
|
|
self.updateDirection(-1*math.pi/10) |
|
|
|
return 3 # right |
|
|
|
``` |
|
|
|
|
|
|
|
Despite being a simple agent, this performs well since it can shoot at asteroids before it hits them. |
|
|
|
|
|
|
|
## Results of Reflex Agent |
|
|
|
|
|
|
|
In this trial, 200 tests of both the random agent and the reflex agent |
|
|
|
were observed while setting the seed of the environment to the current |
|
|
|
time. The seed was randomly set in this scenario since the reflex |
|
|
|
agent is fully deterministic and would perform identically in each trial |
|
|
|
otherwise. |
|
|
|
|
|
|
|
![histogram](media/asteroids/reflexPerformance.png) |
|
|
|
|
|
|
|
|
|
|
|
The histogram depicts that the reflex agent on average performs |
|
|
|
significantly better than the random agent. What is fascinating to |
|
|
|
note is that even though the agent's actions are deterministic, the |
|
|
|
seed of the environment created a large amount of variance in the |
|
|
|
scores observed. It is arguably misleading to only provide a single |
|
|
|
score for an agent as its performance because the environment seed has |
|
|
|
a large impact on the non-random agent's scores. |
|
|
|
|
|
|
|
``` |
|
|
|
Reflex Agent: |
|
|
|
mean:2385.25 |
|
|
|
max:8110.0 |
|
|
|
min:530.0 |
|
|
|
sd:1066.217115553863 |
|
|
|
median:2250.0 |
|
|
|
n:200 |
|
|
|
|
|
|
|
Random Agent: |
|
|
|
mean:976.15 |
|
|
|
max:2030.0 |
|
|
|
min:110.0 |
|
|
|
sd:425.2712987023695 |
|
|
|
median:980.0 |
|
|
|
n:200 |
|
|
|
``` |
|
|
|
|
|
|
|
One thing that is interesting about comparing the two distributions is |
|
|
|
that the reflex agent has a much larger standard deviation in its |
|
|
|
scores than the random agent. It is also interesting to note that the |
|
|
|
reflex agent's worst performance was significantly better than the |
|
|
|
random agent's worst performance. Also, the best performance of the |
|
|
|
reflex agent shatters the best performance of the random agent. |
|
|
|
|
|
|
|
``` |
|
|
|
Random agent vs reflex |
|
|
|
F_onewayResult( |
|
|
|
statistic=299.86689786081956, |
|
|
|
pvalue=1.777062051091977e-50 |
|
|
|
) |
|
|
|
``` |
|
|
|
|
|
|
|
Since we took such a sample size of two hundred, and the populations |
|
|
|
were significantly different, we got a p score of nearly zero |
|
|
|
(1.77e-50). With a p-value like this, we can say with nearly 100% |
|
|
|
confidence (with rounding) that these two populations are different |
|
|
|
and that the reflex agent out-performs the random agent. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Genetic Algorithm |
|
|
|
|
|
|
|
Genetic algorithms employ the same tactics used in natural selection |
|
|
|
to find an optimal solution to an optimization problem. Genetic |
|
|
|
algorithms are often used in high dimensional problems where the |
|
|
|
optimal solutions are not apparent. Genetic algorithms are commonly |
|
|
|
used to tune the hyper-parameters of a program. However, this |
|
|
|
algorithm can be used in any scenario where you have a function that |
|
|
|
defines how well a solution is. |
|
|
|
|
|
|
|
In the scenario of asteroids, we can employ genetic algorithms to find |
|
|
|
the optimal sequence of moves to make to achieve the highest score |
|
|
|
possible. The chromosomes are well defined as the sequence of actions |
|
|
|
to loop through and the fitness function is simply the score that the |
|
|
|
agent achieves. |
|
|
|
|
|
|
|
## Algorithm Implementation |
|
|
|
|
|
|
|
The actual implementation of the genetic algorithm was pretty straight |
|
|
|
forward, the agent simply looped through a sequence of events where |
|
|
|
each event represents a gene on the chromosome. |
|
|
|
|
|
|
|
|
|
|
|
```python |
|
|
|
class Agent(object): |
|
|
|
"""Very Basic GA Agent""" |
|
|
|
def __init__(self, action_space, chromosome): |
|
|
|
self.action_space = action_space |
|
|
|
self.chromosome = chromosome |
|
|
|
self.index = 0 |
|
|
|
|
|
|
|
# You should modify this function |
|
|
|
def act(self, observation, reward, done): |
|
|
|
if self.index >= len(self.chromosome)-1: |
|
|
|
self.index = 0 |
|
|
|
else: |
|
|
|
self.index = self.index + 1 |
|
|
|
return self.chromosome[self.index] |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
Rather than using a library, a simple home-brewed genetic algorithm |
|
|
|
was created from scratch. The basic algorithm essentially is in a loop |
|
|
|
that runs functions necessary to iterate through each generation. |
|
|
|
Each generation can be broken apart into a few steps: |
|
|
|
|
|
|
|
- selection: removes the worst-performing chromosomes |
|
|
|
- mating: uses crossover to create new chromosomes |
|
|
|
- mutation: adds randomness to the chromosome |
|
|
|
- fitness: evaluates the performance of each chromosome |
|
|
|
|
|
|
|
In roughly 100 lines of python, a basic genetic algorithm was crafted. |
|
|
|
|
|
|
|
```python |
|
|
|
AVAILABLE_COMMANDS = [0,1,2,3,4] |
|
|
|
|
|
|
|
|
|
|
|
def generateRandomChromosome(chromosomeLength): |
|
|
|
chrom = [] |
|
|
|
for i in range(0, chromosomeLength): |
|
|
|
chrom.append(choice(AVAILABLE_COMMANDS)) |
|
|
|
return chrom |
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
creates a random population |
|
|
|
""" |
|
|
|
def createPopulation(populationSize, chromosomeLength): |
|
|
|
pop = [] |
|
|
|
for i in range(0, populationSize): |
|
|
|
pop.append((0,generateRandomChromosome(chromosomeLength))) |
|
|
|
return pop |
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
computes fitness of population and sorts the array based |
|
|
|
on fitness |
|
|
|
""" |
|
|
|
def computeFitness(population): |
|
|
|
for i in range(0, len(population)): |
|
|
|
population[i] = (calculatePerformance(population[i][1]), population[i][1]) |
|
|
|
population.sort(key=lambda tup: tup[0], reverse=True) # sorts population in place |
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
kills the weakest portion of the population |
|
|
|
""" |
|
|
|
def selection(population, keep): |
|
|
|
origSize = len(population) |
|
|
|
for i in range(keep, origSize): |
|
|
|
population.remove(population[keep]) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
Uses crossover to mate two chromosomes together. |
|
|
|
""" |
|
|
|
def mateBois(chrom1, chrom2): |
|
|
|
pivotPoint = randrange(len(chrom1)) |
|
|
|
bb = [] |
|
|
|
for i in range(0, pivotPoint): |
|
|
|
bb.append(chrom1[i]) |
|
|
|
for i in range(pivotPoint, len(chrom2)): |
|
|
|
bb.append(chrom1[i]) |
|
|
|
return (0, bb) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
brings population back up to desired size of population |
|
|
|
using crossover mating |
|
|
|
""" |
|
|
|
def mating(population, populationSize): |
|
|
|
newBlood = populationSize - len(population) |
|
|
|
|
|
|
|
newbies = [] |
|
|
|
for i in range(0, newBlood): |
|
|
|
newbies.append(mateBois(choice(population)[1], |
|
|
|
choice(population)[1])) |
|
|
|
population.extend(newbies) |
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
Randomly mutates x chromosomes -- excluding best chromosome |
|
|
|
""" |
|
|
|
def mutation(population, mutationRate): |
|
|
|
changes = random() * mutationRate * len(population) * len(population[0][1]) |
|
|
|
for i in range(0, int(changes)): |
|
|
|
ind = randrange(len(population) -1) + 1 |
|
|
|
chrom = randrange(len(population[0][1])) |
|
|
|
population[ind][1][chrom] = choice(AVAILABLE_COMMANDS) |
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
Computes average score of population |
|
|
|
""" |
|
|
|
def computeAverageScore(population): |
|
|
|
total = 0.0 |
|
|
|
for c in population: |
|
|
|
total = total + c[0] |
|
|
|
return total/len(population) |
|
|
|
|
|
|
|
|
|
|
|
def runGeneration(population, populationSize, keep, mutationRate): |
|
|
|
selection(population, keep) |
|
|
|
mating(population, populationSize) |
|
|
|
mutation(population, mutationRate) |
|
|
|
computeFitness(population) |
|
|
|
|
|
|
|
|
|
|
|
""" |
|
|
|
Runs the genetic algorithm |
|
|
|
""" |
|
|
|
def runGeneticAlgorithm(populationSize, maxGenerations, |
|
|
|
chromosomeLength, keep, mutationRate): |
|
|
|
population = createPopulation(populationSize, chromosomeLength) |
|
|
|
|
|
|
|
best = [] |
|
|
|
average = [] |
|
|
|
generations = range(1, maxGenerations + 1) |
|
|
|
|
|
|
|
for i in range(1, maxGenerations + 1): |
|
|
|
print("Generation: " + str(i)) |
|
|
|
runGeneration(population, populationSize, keep, mutationRate) |
|
|
|
|
|
|
|
a = computeAverageScore(population) |
|
|
|
average.append(a) |
|
|
|
best.append(population[0][0]) |
|
|
|
|
|
|
|
print("Best Score: " + str(population[0][0])) |
|
|
|
print("Average Score: " + str(a)) |
|
|
|
print("Best chromosome: " + str(population[0][1])) |
|
|
|
print() |
|
|
|
|
|
|
|
pyplot.plot(generations, best, color='g', label='Best') |
|
|
|
pyplot.plot(generations, average, color='orange', label='Average') |
|
|
|
|
|
|
|
pyplot.xlabel("Generations") |
|
|
|
pyplot.ylabel("Score") |
|
|
|
pyplot.title("Training GA Algorithm") |
|
|
|
pyplot.legend() |
|
|
|
pyplot.show() |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Results |
|
|
|
|
|
|
|
![training](media/asteroids/GA50.png) |
|
|
|
|
|
|
|
|
|
|
|
![training](media/asteroids/GA200.png) |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
Generation: 200 |
|
|
|
Best Score: 8090.0 |
|
|
|
Average Score: 2492.6666666666665 |
|
|
|
Best chromosome: [1, 4, 1, 4, 4, 1, 0, 4, 2, 4, 1, 3, 2, 0, 2, 0, 0, 1, 3, 0, 1, 0, 4, 0, 1, 4, 1, 2, 0, 1, 3, 1, 3, 1, 3, 1, 0, 4, 4, 1, 3, 4, 1, 1, 2, 0, 4, 3, 3, 0] |
|
|
|
``` |
|
|
|
|
|
|
|
It is impressive that a simple genetic algorithm can learn how |
|
|
|
to perform well when the seed is fixed. When compared to the |
|
|
|
random agent which had a max score of 3320 with a fixed seed, the |
|
|
|
optimized genetic algorithm shattered the random agents' best |
|
|
|
performance by a factor of 2.5. |
|
|
|
|
|
|
|
Since we trained an optimized set of actions to take to achieve a high |
|
|
|
score on a specific seed, what would happen if we randomized the seed? |
|
|
|
A test was conducted to compare the trained GA agent with 200 |
|
|
|
generations against the random agent. For both agents, the seed was |
|
|
|
randomized by setting it to the current time. |
|
|
|
|
|
|
|
![200 Trials GA Random Seed](media/asteroids/GAvsRandom.png) |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
GA Performance Trained on Fixed Seed: |
|
|
|
mean:2257.9 |
|
|
|
max:5600.0 |
|
|
|
min:530.0 |
|
|
|
sd:1018.4363455808125 |
|
|
|
median:2020.0 |
|
|
|
n:200 |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
``` |
|
|
|
Random Random Seed: |
|
|
|
mean:1079.45 |
|
|
|
max:2800.0 |
|
|
|
min:110.0 |
|
|
|
sd:498.9340612746338 |
|
|
|
median:1080.0 |
|
|
|
n:200 |
|
|
|
``` |
|
|
|
|
|
|
|
``` |
|
|
|
F_onewayResult( |
|
|
|
statistic=214.87432376234608, |
|
|
|
pvalue=3.289638100969386e-39 |
|
|
|
) |
|
|
|
``` |
|
|
|
|
|
|
|
As expected, the GA agent did not perform as well on random seeds as |
|
|
|
it did on the fixed seed that it was trained on. However, the GA was |
|
|
|
able to find an action sequence that statistically beat the random |
|
|
|
agent as observed in the score distributions above and the extremely |
|
|
|
small p-value. Although luck was a part of getting the agent to get a |
|
|
|
score of 8k on the seed of zero, the skill that it learned was |
|
|
|
somewhat applicable to other seeds. After replaying the video of the |
|
|
|
agent play, it just slowly drifts around the screen and shoots at |
|
|
|
asteroids in front of it. This has a major advantage over the random |
|
|
|
agent since the random agent tends to move very fast and rotate |
|
|
|
spastically. |
|
|
|
|
|
|
|
## Future Work |
|
|
|
|
|
|
|
This algorithm was more or less a last-minute hack to see if I can |
|
|
|
make a cool video of a high scoring asteroids agent. Future agents |
|
|
|
using genetic algorithms would incorporate reflex to dynamically |
|
|
|
respond to the environment. Based on which direction asteroids are in |
|
|
|
proximity to the player, the agent could select a different chromosome |
|
|
|
of actions to execute. This would potentially yield scores above ten thousand if trained and implemented correctly. Future training |
|
|
|
should also incorporate randomness to the seed so that the skills |
|
|
|
learned are the most transferable to other random environments. |
|
|
|
|
|
|
|
# Deep Q-Learning Agent |
|
|
|
|
|
|
|
## Introduction: |
|
|
|
|
|
|
|
|
|
|
|
The inspiration behind attempting a reinforcement learning agent for |
|
|
|
this problem scope is the original DQN paper that came out from |
|
|
|
Deepmind, “Playing Atari with Deep Reinforcement Learning.” This paper |
|
|
|
showed the potential of utilizing this Deep Q-learning methodology on |
|
|
|
a variety of simulated Atari games using one standardized architecture |
|
|
|
across all. Reinforcement learning has always been of interest and to |
|
|
|
have the opportunity to spend time learning about it while applying |
|
|
|
for a class setting was exciting, even if it is out of the scope of |
|
|
|
the class presently. It has been an exciting challenge to read through |
|
|
|
and implement a research paper to get similar results. |
|
|
|
|
|
|
|
Deep Q-Learning is an extension of the standard Q-Learning algorithm |
|
|
|
in which a neural network is used to approximate the optimal |
|
|
|
action-value function, Q\*(s,a). The action-value function is the |
|
|
|
function that outputs the expected maximum reward given a state and a |
|
|
|
policy mapping to actions or distributions of actions. Logically, this |
|
|
|
works as the Q function follows the Bellman equation identity, which |
|
|
|
states that if the optimal action for the next step state is known, |
|
|
|
then the optimal output given an action a’ follows by maximizing the |
|
|
|
expected reward of the equation, r+Q*(s',a'). Thus, the reinforcement |
|
|
|
learning part comes in the form of a neural network approximating the |
|
|
|
optimal action-value function by using the Bellman equation identity |
|
|
|
as an iterative update at every time step. |
|
|
|
|
|
|
|
|
|
|
|
## Agent architecture: |
|
|
|
|
|
|
|
The basis of the network architecture is a basic convolutional network |
|
|
|
with 2 conv layers, a fully connected layer, and then an output layer |
|
|
|
of 14 classes with each representing an individual action. The first |
|
|
|
layer consists of 16 8x8 filters and takes a stride 4 while the second |
|
|
|
has 32 4x4 filters and only takes a stride of 2. Following this layer, |
|
|
|
the filters are compressed into a 1-D representation vector of size |
|
|
|
12,672 that’s passed through a fully connected layer of 256 nodes. |
|
|
|
|
|
|
|
All layers sans the output layer are activated using the ReLU function. |
|
|
|
The optimization algorithm of choice was the Adam optimizer, using a |
|
|
|
learning rate equal to .0001 and default betas of [.9, .99]. The |
|
|
|
discount factor, or gamma, related to future expected rewards was set |
|
|
|
at .99 and the probability of taking a random action per action step |
|
|
|
was linearly annealed from 1.0 down to a fixed .1 after one million |
|
|
|
seen frames. |
|
|
|
|
|
|
|
![Layer code](media/asteroids/code.png) |
|
|
|
|
|
|
|
|
|
|
|
## Experience Replay: |
|
|
|
|
|
|
|
One of the main points within the original paper that significantly |
|
|
|
helped the training of this network is the introduction of a Replay |
|
|
|
Buffer that is used during the training. To break all the |
|
|
|
temporal correlation between sequential frames and biasing the |
|
|
|
training of a network-based off certain chains of situations, a |
|
|
|
historical buffer of transitions is used to sample mini-batches to |
|
|
|
train on per time step. Every time an action is made, a tuple |
|
|
|
consisting of the current state, the action is taken, the reward gained, and |
|
|
|
the subsequent state (s, a, r, s’) is stored into the buffer. And at |
|
|
|
every training step, a mini-batch is sampled from the buffer and used |
|
|
|
to train the network. This allows the network to be trained in |
|
|
|
non-correlated transitions and hopefully train in a more generalized |
|
|
|
way to the environment rather than biased to a string of similar |
|
|
|
actions. |
|
|
|
|
|
|
|
## Preprocessing: |
|
|
|
|
|
|
|
One of the first issues that had to be tackled was the issue of the |
|
|
|
high dimensionality of the input image and how that information was |
|
|
|
duplicated stored in the Replay Buffer. Each observation given from |
|
|
|
the environment is a matrix of (210, 160, 3) pixels representing the |
|
|
|
RGB pixels within the frame. For time and being |
|
|
|
computationally efficient, it was needed to preprocess and reduce the |
|
|
|
dimensionality of the observations as a single frame stack (of which |
|
|
|
there are two per transition) consists of (4, 3, 210, 160) or 403,000 |
|
|
|
input features that would have to be dealt with. |
|
|
|
|
|
|
|
![Before Processing](media/asteroids/dqn_before.png) |
|
|
|
|
|
|
|
Firstly, images are converted into a grayscale image and the |
|
|
|
reward/number lives section at the top of the screen is cut out since |
|
|
|
it is irrelevant to the network’s vision. Furthermore, the now (4, |
|
|
|
192, 160) matrix was downsampled by taking every other pixel to (4, |
|
|
|
96, 80), resulting in a change from 403,000 input features to only |
|
|
|
30,720 - a substantial reduction in the calculations needed while |
|
|
|
maintaining strong input information for the network. |
|
|
|
|
|
|
|
![After Processing](media/asteroids/dqn_after.png) |
|
|
|
|
|
|
|
|
|
|
|
## Training: |
|
|
|
|
|
|
|
Training for the bot was conducted by modifying the main function to |
|
|
|
allow games to immediately start after one was finished, to |
|
|
|
make continuous training of the agent easier. All the environment |
|
|
|
parameters were reset and the temporary attributes of the agent (ie. |
|
|
|
current state/next state) were flushed. For the first four frames of a |
|
|
|
game, the bot just gathered a stack of frames. And following that, at |
|
|
|
every time the next state was compiled and the transition tuple pushed |
|
|
|
onto the buffer, as well as a training step for the agent. For the |
|
|
|
training step, a random batch was grabbed from the replay buffer and |
|
|
|
used to calculate the loss function between actual and expected |
|
|
|
Q-values. This was used to calculate the gradients for the |
|
|
|
backpropagation of the network. |
|
|
|
|
|
|
|
|
|
|
|
## Outcome: |
|
|
|
|
|
|
|
Unfortunately, the result of 48 hours of continuous training, 950 |
|
|
|
games played, and roughly 1.3 million frames of game footage seen, was |
|
|
|
that the agent converged to a suicidal policy that resulted in a |
|
|
|
consistent garbage performance. |
|
|
|
|
|
|
|
![Reward](media/asteroids/reward.png) |
|
|
|
|
|
|
|
|
|
|
|
The model transitioned to the fixed 90% model action chance around |
|
|
|
episode 700, which is exactly where the agent starts to go awry. The |
|
|
|
strange part about this is since the random action chance is linearly |
|
|
|
annealed over the first million frames, if the agent had continuously |
|
|
|
been following a garbage policy, it would’ve been expected that the |
|
|
|
rewards would steadily decrease over time as the network takes more |
|
|
|
control. |
|
|
|
|
|
|
|
![Real Trendline](media/asteroids/reward_trendline.png) |
|
|
|
|
|
|
|
|
|
|
|
Up until that point, the projection of the reward trendline was a |
|
|
|
steady rise per the number of episodes. Expanding this out until |
|
|
|
10,000 frames (approximately 10 million frames seen, the same amount |
|
|
|
of time the original Deepmind paper trained these bots for), the |
|
|
|
projected score is in the realms of 2,400 to 2,500 - which matches up |
|
|
|
closely to the well-tuned reflex agent and the GA agent on a random |
|
|
|
seed. |
|
|
|
|
|
|
|
![Endgame Trendline](media/asteroids/reward_projection.png) |
|
|
|
|
|
|
|
|
|
|
|
It would’ve been exciting to see how the model compared to |
|
|
|
our reflex agent had it been able to train consistently up until the |
|
|
|
end. |
|
|
|
|
|
|
|
## Limitations: |
|
|
|
|
|
|
|
There were a fair number of limitations that were present within the |
|
|
|
execution and training of this model that possibly contributed to the |
|
|
|
slow and unstable training of the network. Differences in the |
|
|
|
algorithm from the original paper is that the optimization function |
|
|
|
utilized was the Adam optimizer instead of RMSProp and the replay |
|
|
|
buffer only took into consideration the previous 50k frames, not the |
|
|
|
past one million. It might be possible that the weaker replay buffer |
|
|
|
was to blame as the model was continuously fed a sub-optimal within |
|
|
|
its past 50,000 frames that caused it to diverge so heavily near the |
|
|
|
end. |
|
|
|
|
|
|
|
One issue in preprocessing that might've led the bot astray is using |
|
|
|
not using the max pixelwise combination between sequential frames in |
|
|
|
order to have each frame include both the asteroids and the player. |
|
|
|
Since the Atari (and by extension, this environment simulation) |
|
|
|
doesn't render the asteroids and the player sprite all in the same |
|
|
|
frame, it is possible that the network was unable to extract any |
|
|
|
coherent connection between the alternating frames. |
|
|
|
|
|
|
|
Regarding optimizations built on the DQN algorithm past the original |
|
|
|
Deepmind paper, we did not use a policy and a target network in |
|
|
|
training. In the original algorithm, the estimation and attempt at |
|
|
|
converging to the target policy is unstable due to the target |
|
|
|
network’s weights continuously shifting during training. For the |
|
|
|
network, it’s hard to converge to something that’s continually |
|
|
|
shifting at every time step and leads to very noisy and unstable |
|
|
|
training. One optimization that has been proposed for DQN is to have a |
|
|
|
policy and target network. At every timestep, the policy network’s |
|
|
|
weights are updated with the calculated gradients while the target |
|
|
|
network is maintained for a number of steps. This lets the target |
|
|
|
policy be still for a few time steps while the network is converging |
|
|
|
to it and leads to more stable and guided training. |
|
|
|
|
|
|
|
Perhaps the largest limitation in training was the computational power |
|
|
|
used for training. The network was trained on a single GTX1060ti GPU, |
|
|
|
which led to just single episodes taking a few minutes to complete. It |
|
|
|
would’ve taken an incredibly long time to hit 10 million seen frames |
|
|
|
as even just 1.3 million took approximately 48 hours. It’s probable |
|
|
|
that our implementation is inefficient in its calculations, however it |
|
|
|
is a well known limitation of RL that it is time and computationally |
|
|
|
intensive. |
|
|
|
|
|
|
|
## Deep Q Conclusions: |
|
|
|
|
|
|
|
This was a fun agent and algorithm to implement, even if at present it |
|
|
|
has given little to no results back in terms of performance. The plan |
|
|
|
is to continue testing and training the agent, even after the |
|
|
|
deadline. Reinforcement learning is a complicated and hard to debug |
|
|
|
environment, but similarly an exciting challenge due to its potential |
|
|
|
for solving and overcoming problems. |
|
|
|
|
|
|
|
|
|
|
|
# Conclusion |
|
|
|
|
|
|
|
This project demonstrated how fun it can be to train AI agents to play |
|
|
|
video games. Although none of our agents are earth-shatteringly |
|
|
|
amazing, we were able to use statistical measures to determine that |
|
|
|
the reflex and GA agents outperformed the random agent. The GA agent |
|
|
|
and the convolutional neural network show very promising and future |
|
|
|
work can be used to drastically improve their results. |