In this blog post I examine the ways in which antivirus programs
currently employ machine learning and then go into the  security
vulnerabilities that it brings. 

# ML in the Antivirus Industry

Malware detection falls into two broad categories: static and dynamic
analysis. Static analysis examines the program without actually
running the code. Static analysis looks at things like the file
fingerprints, hashes, reverse engineering, memory artifacts, packer
detection, and debugging.  Static analysis is largely known for
looking up the hashes of the virus against a known database of
viruses. It is super easy to fool signature based malware detection
using simple obfuscation methods. Dynamic analysis is a technique
where you run the program in a sandbox and monitor all the actions
that the virus takes. If you notice that the program is acting
suspicious, it is likely a virus. Suspicious behavior typically
includes things like registry edits and API calls to bad host names.  

Antivirus detection is very difficult, but, probably not for the
reasons you think. The issue isn't writing programs which can detect
these static or dynamic properties of viruses-- that is the easy part.
It is also relatively easy to determine a general rule set for what
makes a program dangerous. You can also easily blacklist suspicious
domains, block malicious activity, and implement a signature based
maleware detection program.  

The real problem is that there are hundreds of thousands of malware
applications and more are created every day. Not only are there tons
of pesky malware applications, there is an absurd amount of normal
programs which we don't want malware applications to block.   It is
impossible for a small team of malware researchers to create a
definitive set of heuristics which can correctly identify all malware
programs. This is where we turn to the field of Machine Learning.
Humans are very bad with big data, but, computers love big data. Most
antivirus companies use machine learning and it has been a large
success so far because it has allowed us to dramatically improve our
ability to detect zero day viruses. 

## Interesting Examples

### Cylance

[Cylance](https://www.cylance.com) uses supervised learning and static analysis to classify files as being malware. 
This product pulls a list of attributes from the file which they can then compare against other known viruses.

### MalwareBytes Anomalous

[Anomalous](https://blog.malwarebytes.com/detections/machinelearning-anomalous-100/) is a machine learning application which simply flags files which appear different from their training set of known normal files.
This does not attempt to classify what makes a virus a virus, but, what makes a normal program a normal program.
Anything which is not a normal program, it alerts you about since it can be a virus.

### Kaspersky

Kaspersky appears to have  done a ton of research into using machine
learning for malware detection. I would highly recommend that you read
their [white
paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf)
on this subject. 

# Why is this a problem?

It turns out that machine learning systems can be easily fooled by
using other machine learning algorithms. A classic example of this is
with image classification. It is easy to use neural networks or
genetic algorithms to generate examples which fool the machine
learning application by learning the weights of the machine  learning
application and then making slight tweaks to your input to give a
false classification. 

![](media/AISaftey/AdversarialExample.png)

Since viruses generation is a non-differentiable problem, people often
use Genetic algorithms for the adversarial network to fool the
antivirus. In other words, you don't want to attempt to calculate the
derivative between two versions of a virus for gradient decent. Since
viruses are high dimensional problems, it turns out that most calc
implementations would actually be inefficient at traversing the search
space to find the global minimum. If you want to learn more about
genetic algorithms, check out my [recent blog
post](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm)
on it. 

# Fooling Antivirus Software

## Genetic Algorithms

There are two major approaches which people have used to generate
antivirus resistant malware with genetic algorithms. The first
approach is to slowly make polymorphic changes to the virus in order
to fool the malware detection. One of the interesting things about
this approach is that you have to have some way of verifying that the
polymorphic behaviors that you apply to the virus don't break its
"virus capabilities". 

An other approach used is to represent a virus as a set of properties.
These properties are everything from the port of attack, the payloads,
obfuscation parameters, etc. The genetic algorithm would simply tweak
the properties of the virus until it found a configuration which
evaded the antivirus program. 

## Reinforcement Learning

A research group at [Endgame](https://www.endgame.com/) recently gave
a [Def Con](https://www.defcon.org/) talk where they presented a
framework which uses reinforcement learning to evade static virus
detection. 

![Reinforcement Learning Diagram](media/AISaftey/Reinforcement_learning_diagram.png)

At a high level, the AI plays a "game" against the antivirus where the
agent can make functionality-preserving mutations to the virus. The
reward for the agent is its ability to not get detected by the
anti-virus. Over time the AI will learn which type of actions will
result in getting detected by the antivirus.  This framework can be
found on [Github](https://github.com/endgameinc/gym-malware). 

# Takeaways

Machine learning is great, but, it needs to be properly defended. As
we start to use machine learning more and more, a large portion of the
cyber security field may shift its focus away from securing systems to
securing big data applications. 

# Resources

- [Blog post on Genetic Algorithms](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm)
- [Kaspersky White Paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf)
- [Windows Defender Use of ML](https://www.microsoft.com/security/blog/2015/11/16/windows-defender-rise-of-the-machine-learning/)
- [Machine Learning Malware Models via Reinforcement Learning (Paper)](https://arxiv.org/abs/1801.08917)
- [Evolvable Malware (Paper)](http://homepage.divms.uiowa.edu/~mshafiq/files/evolvable-malware.pdf)