It is super easy to fool signature based malware detection using simple obfuscation methods.
detection, and debugging. Static analysis is largely known for
Dynamic analysis is a technique where you run the program in a sandbox and monitor all the actions that the virus takes.
looking up the hashes of the virus against a known database of
If you notice that the program is acting suspicious, it is likely a virus.
viruses. It is super easy to fool signature based malware detection
Suspicious behavior typically includes things like registry edits and API calls to bad host names.
using simple obfuscation methods. Dynamic analysis is a technique
where you run the program in a sandbox and monitor all the actions
Antivirus detection is very difficult, but, probably not for the reasons you think.
that the virus takes. If you notice that the program is acting
The issue isn't writing programs which can detect these static or dynamic properties of viruses-- that is the easy part.
suspicious, it is likely a virus. Suspicious behavior typically
It is also relatively easy to determine a general rule set for what makes a program dangerous.
includes things like registry edits and API calls to bad host names.
You can also easily blacklist suspicious domains, block malicious activity, and implement a signature based maleware detection program.
Antivirus detection is very difficult, but, probably not for the
reasons you think. The issue isn't writing programs which can detect
The real problem is that there are hundreds of thousands of malware applications and more are created every day.
these static or dynamic properties of viruses-- that is the easy part.
Not only are there tons of pesky malware applications, there is an absurd amount of normal programs which we don't want malware applications to block.
It is also relatively easy to determine a general rule set for what
It is impossible for a small team of malware researchers to create a definitive set of heuristics which can correctly identify all malware programs.
makes a program dangerous. You can also easily blacklist suspicious
This is where we turn to the field of Machine Learning.
domains, block malicious activity, and implement a signature based
Humans are very bad with big data, but, computers love big data.
maleware detection program.
Most antivirus companies use machine learning and it has been a large success so far because it has allowed us to dramatically improve our ability to detect zero day viruses.
The real problem is that there are hundreds of thousands of malware
applications and more are created every day. Not only are there tons
of pesky malware applications, there is an absurd amount of normal
programs which we don't want malware applications to block. It is
impossible for a small team of malware researchers to create a
definitive set of heuristics which can correctly identify all malware
programs. This is where we turn to the field of Machine Learning.
Humans are very bad with big data, but, computers love big data. Most
antivirus companies use machine learning and it has been a large
success so far because it has allowed us to dramatically improve our
ability to detect zero day viruses.
## Interesting Examples
## Interesting Examples
@ -39,50 +52,75 @@ Anything which is not a normal program, it alerts you about since it can be a vi
### Kaspersky
### Kaspersky
Kaspersky appears to have done a ton of research into using machine learning for malware detection.
Kaspersky appears to have done a ton of research into using machine
I would highly recommend that you read their [white paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf) on this subject.
learning for malware detection. I would highly recommend that you read
It turns out that machine learning systems can be easily fooled by using other machine learning algorithms.
It turns out that machine learning systems can be easily fooled by
A classic example of this is with image classification.
using other machine learning algorithms. A classic example of this is
It is easy to use neural networks or genetic algorithms to generate examples which fool the machine learning application by learning the weights of the machine
with image classification. It is easy to use neural networks or
learning application and then making slight tweaks to your input to give a false classification.
genetic algorithms to generate examples which fool the machine
learning application by learning the weights of the machine learning
application and then making slight tweaks to your input to give a
false classification.


Since viruses generation is a non-differentiable problem, people often use Genetic algorithms for the adversarial network to fool the antivirus.
Since viruses generation is a non-differentiable problem, people often
In other words, you don't want to attempt to calculate the derivative between two versions of a virus for gradient decent.
use Genetic algorithms for the adversarial network to fool the
Since viruses are high dimensional problems, it turns out that most calc implementations would actually be inefficient at traversing the search space to find the global minimum.
antivirus. In other words, you don't want to attempt to calculate the
If you want to learn more about genetic algorithms, check out my [recent blog post](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm) on it.
derivative between two versions of a virus for gradient decent. Since
viruses are high dimensional problems, it turns out that most calc
implementations would actually be inefficient at traversing the search
space to find the global minimum. If you want to learn more about
There are two major approaches which people have used to generate antivirus resistant malware with genetic algorithms.
There are two major approaches which people have used to generate
The first approach is to slowly make polymorphic changes to the virus in order to fool the malware detection.
antivirus resistant malware with genetic algorithms. The first
One of the interesting things about this approach is that you have to have some way of verifying that the polymorphic behaviors that you apply to the virus don't break its "virus capabilities".
approach is to slowly make polymorphic changes to the virus in order
to fool the malware detection. One of the interesting things about
this approach is that you have to have some way of verifying that the
polymorphic behaviors that you apply to the virus don't break its
"virus capabilities".
An other approach used is to represent a virus as a set of properties.
An other approach used is to represent a virus as a set of properties.
These properties are everything from the port of attack, the payloads, obfuscation parameters, etc.
These properties are everything from the port of attack, the payloads,
The genetic algorithm would simply tweak the properties of the virus until it found a configuration which evaded the antivirus program.
obfuscation parameters, etc. The genetic algorithm would simply tweak
the properties of the virus until it found a configuration which
evaded the antivirus program.
## Reinforcement Learning
## Reinforcement Learning
A research group at [Endgame](https://www.endgame.com/) recently gave a [Def Con](https://www.defcon.org/) talk where they presented a framework which uses reinforcement learning to evade static virus detection.
A research group at [Endgame](https://www.endgame.com/) recently gave
a [Def Con](https://www.defcon.org/) talk where they presented a
framework which uses reinforcement learning to evade static virus
At a high level, the AI plays a "game" against the antivirus where the agent can make functionality-preserving mutations to the virus.
At a high level, the AI plays a "game" against the antivirus where the
The reward for the agent is its ability to not get detected by the anti-virus.
agent can make functionality-preserving mutations to the virus. The
Over time the AI will learn which type of actions will result in getting detected by the antivirus.
reward for the agent is its ability to not get detected by the
This framework can be found on [Github](https://github.com/endgameinc/gym-malware).
anti-virus. Over time the AI will learn which type of actions will
result in getting detected by the antivirus. This framework can be
found on [Github](https://github.com/endgameinc/gym-malware).
# Takeaways
# Takeaways
Machine learning is great, but, it needs to be properly defended.
Machine learning is great, but, it needs to be properly defended. As
As we start to use machine learning more and more, a large portion of the cyber security field may shift its focus away from securing systems to securing big data applications.
we start to use machine learning more and more, a large portion of the
cyber security field may shift its focus away from securing systems to
The complete code for the genetic algorithm and the fancy JavaScript graphs can be found in my [Random Scripts GitHub Repository](https://github.com/jrtechs/RandomScripts).
The complete code for the genetic algorithm and the fancy JavaScript
In the future I may package this into an [npm](https://www.npmjs.com/) package.
graphs can be found in my [Random Scripts GitHub
Repository](https://github.com/jrtechs/RandomScripts). In the future I
may package this into an [npm](https://www.npmjs.com/) package.
R is a programming language designed for statistical analysis and graphics.
R is a programming language designed for statistical analysis and
Since R has been around since 1992, it has developed a large community and has over [13 thousand packages](https://cran.r-project.org/web/packages/) publicly available.
graphics. Since R has been around since 1992, it has developed a large
What is really cool about R is that it is an open source [GNU](http://www.gnu.org/) project.
available. What is really cool about R is that it is an open source
[GNU](http://www.gnu.org/) project.
# R Syntax and Paradigms
# R Syntax and Paradigms
The syntax of R is C esk with its use of curly braces.
The syntax of R is C esk with its use of curly braces. The type
The type system of R is similar to Python where it can infer what type you are using.
system of R is similar to Python where it can infer what type you are
This "lazy" type system allows for "faster" development since you don't have to worry about declaring types -- this laziness makes it harder to debug and read your code.
using. This "lazy" type system allows for "faster" development since
The type system of R is rather strange and distinctly different from most other languages.
you don't have to worry about declaring types -- this laziness makes
For starters, integers are represented as vectors of length 1.
it harder to debug and read your code. The type system of R is rather
These things may feel weird at first, but, R's type system is one of the things that make it a great tool for manipulating data.
strange and distinctly different from most other languages. For
starters, integers are represented as vectors of length 1. These
things may feel weird at first, but, R's type system is one of the
things that make it a great tool for manipulating data.


Did I mention that arrays start at 1?
Did I mention that arrays start at 1? Technically, the thing which we
Technically, the thing which we refer to as an array in Java are really vectors in R.
refer to as an array in Java are really vectors in R. Arrays in R are
Arrays in R are data objects which can store data in more than two dimensions.
data objects which can store data in more than two dimensions. Since R
Since R tries to follow mathematical notation, indexing starts at 1 -- just like in linear algebra.
tries to follow mathematical notation, indexing starts at 1 -- just
Using zero based indexing makes sense for languages like C because the index is used to get at a particular memory location from a pointer.
like in linear algebra. Using zero based indexing makes sense for
languages like C because the index is used to get at a particular
memory location from a pointer.
<youtubesrc="s3FozVfd7q4"/>
<youtubesrc="s3FozVfd7q4"/>
I don't have the time to go over the basic syntax of R in a single blog post; however, I feel that this youtube video does a pretty good job.
I don't have the time to go over the basic syntax of R in a single
blog post; however, I feel that this youtube video does a pretty good
job.
# R Markdown
# R Markdown
One of my favorite aspects of R is its markdown language called Rmd.
One of my favorite aspects of R is its markdown language called Rmd.
Rmd is essentially markdown which has can have embedded R scripts run in it.
Rmd is essentially markdown which has can have embedded R scripts run
The Rmd file is compiled down to a markdown file which is converted to either a PDF, HTML file, or a slide show using pandoc.
in it. The Rmd file is compiled down to a markdown file which is
You can provide options for the pandoc render using a YAMAL header in the Rmd file.
converted to either a PDF, HTML file, or a slide show using pandoc.
This is an amazing tool for creating reports and writing research papers.
You can provide options for the pandoc render using a YAMAL header in
The documents which you create are reproducible since you can share the source code to it.
the Rmd file. This is an amazing tool for creating reports and writing
If the data which you are using changes, you simply have to recompile to document to get an updated view.
research papers. The documents which you create are reproducible since
You no longer have to re-generate a dozen graphs and update figures and statistics across your document.
you can share the source code to it. If the data which you are using
changes, you simply have to recompile to document to get an updated
view. You no longer have to re-generate a dozen graphs and update