In this blog post I examine the ways in which antivirus programs currently employ machine learning and then go into the
security vulnerabilities that it brings.
In this blog post I examine the ways in which antivirus programs
currently employ machine learning and then go into the security
vulnerabilities that it brings.
# ML in the Antivirus Industry
# ML in the Antivirus Industry
Malware detection falls into two broad categories: static and dynamic analysis.
Static analysis examines the program without actually running the code.
Static analysis looks at things like the file fingerprints, hashes, reverse engineering, memory artifacts, packer detection, and debugging.
Static analysis is largely known for looking up the hashes of the virus against a known database of viruses.
It is super easy to fool signature based malware detection using simple obfuscation methods.
Dynamic analysis is a technique where you run the program in a sandbox and monitor all the actions that the virus takes.
If you notice that the program is acting suspicious, it is likely a virus.
Suspicious behavior typically includes things like registry edits and API calls to bad host names.
Antivirus detection is very difficult, but, probably not for the reasons you think.
The issue isn't writing programs which can detect these static or dynamic properties of viruses-- that is the easy part.
It is also relatively easy to determine a general rule set for what makes a program dangerous.
You can also easily blacklist suspicious domains, block malicious activity, and implement a signature based maleware detection program.
The real problem is that there are hundreds of thousands of malware applications and more are created every day.
Not only are there tons of pesky malware applications, there is an absurd amount of normal programs which we don't want malware applications to block.
It is impossible for a small team of malware researchers to create a definitive set of heuristics which can correctly identify all malware programs.
This is where we turn to the field of Machine Learning.
Humans are very bad with big data, but, computers love big data.
Most antivirus companies use machine learning and it has been a large success so far because it has allowed us to dramatically improve our ability to detect zero day viruses.
Malware detection falls into two broad categories: static and dynamic
analysis. Static analysis examines the program without actually
running the code. Static analysis looks at things like the file
detection, and debugging. Static analysis is largely known for
looking up the hashes of the virus against a known database of
viruses. It is super easy to fool signature based malware detection
using simple obfuscation methods. Dynamic analysis is a technique
where you run the program in a sandbox and monitor all the actions
that the virus takes. If you notice that the program is acting
suspicious, it is likely a virus. Suspicious behavior typically
includes things like registry edits and API calls to bad host names.
Antivirus detection is very difficult, but, probably not for the
reasons you think. The issue isn't writing programs which can detect
these static or dynamic properties of viruses-- that is the easy part.
It is also relatively easy to determine a general rule set for what
makes a program dangerous. You can also easily blacklist suspicious
domains, block malicious activity, and implement a signature based
maleware detection program.
The real problem is that there are hundreds of thousands of malware
applications and more are created every day. Not only are there tons
of pesky malware applications, there is an absurd amount of normal
programs which we don't want malware applications to block. It is
impossible for a small team of malware researchers to create a
definitive set of heuristics which can correctly identify all malware
programs. This is where we turn to the field of Machine Learning.
Humans are very bad with big data, but, computers love big data. Most
antivirus companies use machine learning and it has been a large
success so far because it has allowed us to dramatically improve our
ability to detect zero day viruses.
## Interesting Examples
## Interesting Examples
@ -39,50 +52,75 @@ Anything which is not a normal program, it alerts you about since it can be a vi
### Kaspersky
### Kaspersky
Kaspersky appears to have done a ton of research into using machine learning for malware detection.
I would highly recommend that you read their [white paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf) on this subject.
Kaspersky appears to have done a ton of research into using machine
learning for malware detection. I would highly recommend that you read
It turns out that machine learning systems can be easily fooled by using other machine learning algorithms.
A classic example of this is with image classification.
It is easy to use neural networks or genetic algorithms to generate examples which fool the machine learning application by learning the weights of the machine
learning application and then making slight tweaks to your input to give a false classification.
It turns out that machine learning systems can be easily fooled by
using other machine learning algorithms. A classic example of this is
with image classification. It is easy to use neural networks or
genetic algorithms to generate examples which fool the machine
learning application by learning the weights of the machine learning
application and then making slight tweaks to your input to give a
false classification.
![](media/AISaftey/AdversarialExample.png)
![](media/AISaftey/AdversarialExample.png)
Since viruses generation is a non-differentiable problem, people often use Genetic algorithms for the adversarial network to fool the antivirus.
In other words, you don't want to attempt to calculate the derivative between two versions of a virus for gradient decent.
Since viruses are high dimensional problems, it turns out that most calc implementations would actually be inefficient at traversing the search space to find the global minimum.
If you want to learn more about genetic algorithms, check out my [recent blog post](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm) on it.
Since viruses generation is a non-differentiable problem, people often
use Genetic algorithms for the adversarial network to fool the
antivirus. In other words, you don't want to attempt to calculate the
derivative between two versions of a virus for gradient decent. Since
viruses are high dimensional problems, it turns out that most calc
implementations would actually be inefficient at traversing the search
space to find the global minimum. If you want to learn more about
There are two major approaches which people have used to generate antivirus resistant malware with genetic algorithms.
The first approach is to slowly make polymorphic changes to the virus in order to fool the malware detection.
One of the interesting things about this approach is that you have to have some way of verifying that the polymorphic behaviors that you apply to the virus don't break its "virus capabilities".
There are two major approaches which people have used to generate
antivirus resistant malware with genetic algorithms. The first
approach is to slowly make polymorphic changes to the virus in order
to fool the malware detection. One of the interesting things about
this approach is that you have to have some way of verifying that the
polymorphic behaviors that you apply to the virus don't break its
"virus capabilities".
An other approach used is to represent a virus as a set of properties.
An other approach used is to represent a virus as a set of properties.
These properties are everything from the port of attack, the payloads, obfuscation parameters, etc.
The genetic algorithm would simply tweak the properties of the virus until it found a configuration which evaded the antivirus program.
These properties are everything from the port of attack, the payloads,
obfuscation parameters, etc. The genetic algorithm would simply tweak
the properties of the virus until it found a configuration which
evaded the antivirus program.
## Reinforcement Learning
## Reinforcement Learning
A research group at [Endgame](https://www.endgame.com/) recently gave a [Def Con](https://www.defcon.org/) talk where they presented a framework which uses reinforcement learning to evade static virus detection.
A research group at [Endgame](https://www.endgame.com/) recently gave
a [Def Con](https://www.defcon.org/) talk where they presented a
framework which uses reinforcement learning to evade static virus
At a high level, the AI plays a "game" against the antivirus where the agent can make functionality-preserving mutations to the virus.
The reward for the agent is its ability to not get detected by the anti-virus.
Over time the AI will learn which type of actions will result in getting detected by the antivirus.
This framework can be found on [Github](https://github.com/endgameinc/gym-malware).
At a high level, the AI plays a "game" against the antivirus where the
agent can make functionality-preserving mutations to the virus. The
reward for the agent is its ability to not get detected by the
anti-virus. Over time the AI will learn which type of actions will
result in getting detected by the antivirus. This framework can be
found on [Github](https://github.com/endgameinc/gym-malware).
# Takeaways
# Takeaways
Machine learning is great, but, it needs to be properly defended.
As we start to use machine learning more and more, a large portion of the cyber security field may shift its focus away from securing systems to securing big data applications.
Machine learning is great, but, it needs to be properly defended. As
we start to use machine learning more and more, a large portion of the
cyber security field may shift its focus away from securing systems to
The complete code for the genetic algorithm and the fancy JavaScript graphs can be found in my [Random Scripts GitHub Repository](https://github.com/jrtechs/RandomScripts).
In the future I may package this into an [npm](https://www.npmjs.com/) package.
The complete code for the genetic algorithm and the fancy JavaScript
graphs can be found in my [Random Scripts GitHub
Repository](https://github.com/jrtechs/RandomScripts). In the future I
may package this into an [npm](https://www.npmjs.com/) package.
R is a programming language designed for statistical analysis and graphics.
Since R has been around since 1992, it has developed a large community and has over [13 thousand packages](https://cran.r-project.org/web/packages/) publicly available.
What is really cool about R is that it is an open source [GNU](http://www.gnu.org/) project.
R is a programming language designed for statistical analysis and
graphics. Since R has been around since 1992, it has developed a large
available. What is really cool about R is that it is an open source
[GNU](http://www.gnu.org/) project.
# R Syntax and Paradigms
# R Syntax and Paradigms
The syntax of R is C esk with its use of curly braces.
The type system of R is similar to Python where it can infer what type you are using.
This "lazy" type system allows for "faster" development since you don't have to worry about declaring types -- this laziness makes it harder to debug and read your code.
The type system of R is rather strange and distinctly different from most other languages.
For starters, integers are represented as vectors of length 1.
These things may feel weird at first, but, R's type system is one of the things that make it a great tool for manipulating data.
The syntax of R is C esk with its use of curly braces. The type
system of R is similar to Python where it can infer what type you are
using. This "lazy" type system allows for "faster" development since
you don't have to worry about declaring types -- this laziness makes
it harder to debug and read your code. The type system of R is rather
strange and distinctly different from most other languages. For
starters, integers are represented as vectors of length 1. These
things may feel weird at first, but, R's type system is one of the
things that make it a great tool for manipulating data.
![R Arrays Start at 1](media/r/arrays.jpg)
![R Arrays Start at 1](media/r/arrays.jpg)
Did I mention that arrays start at 1?
Technically, the thing which we refer to as an array in Java are really vectors in R.
Arrays in R are data objects which can store data in more than two dimensions.
Since R tries to follow mathematical notation, indexing starts at 1 -- just like in linear algebra.
Using zero based indexing makes sense for languages like C because the index is used to get at a particular memory location from a pointer.
Did I mention that arrays start at 1? Technically, the thing which we
refer to as an array in Java are really vectors in R. Arrays in R are
data objects which can store data in more than two dimensions. Since R
tries to follow mathematical notation, indexing starts at 1 -- just
like in linear algebra. Using zero based indexing makes sense for
languages like C because the index is used to get at a particular
memory location from a pointer.
<youtubesrc="s3FozVfd7q4"/>
<youtubesrc="s3FozVfd7q4"/>
I don't have the time to go over the basic syntax of R in a single blog post; however, I feel that this youtube video does a pretty good job.
I don't have the time to go over the basic syntax of R in a single
blog post; however, I feel that this youtube video does a pretty good
job.
# R Markdown
# R Markdown
One of my favorite aspects of R is its markdown language called Rmd.
One of my favorite aspects of R is its markdown language called Rmd.
Rmd is essentially markdown which has can have embedded R scripts run in it.
The Rmd file is compiled down to a markdown file which is converted to either a PDF, HTML file, or a slide show using pandoc.
You can provide options for the pandoc render using a YAMAL header in the Rmd file.
This is an amazing tool for creating reports and writing research papers.
The documents which you create are reproducible since you can share the source code to it.
If the data which you are using changes, you simply have to recompile to document to get an updated view.
You no longer have to re-generate a dozen graphs and update figures and statistics across your document.
Rmd is essentially markdown which has can have embedded R scripts run
in it. The Rmd file is compiled down to a markdown file which is
converted to either a PDF, HTML file, or a slide show using pandoc.
You can provide options for the pandoc render using a YAMAL header in
the Rmd file. This is an amazing tool for creating reports and writing
research papers. The documents which you create are reproducible since
you can share the source code to it. If the data which you are using
changes, you simply have to recompile to document to get an updated
view. You no longer have to re-generate a dozen graphs and update