jrtechs
/
jrtechs-NodeJSBlog
mirror of https://github.com/jrtechs/NodeJSBlog.git

In this blog post I examine the ways in which antivirus programscurrently employ machine learning and then go into the  securityvulnerabilities that it brings. 
# ML in the Antivirus Industry

Malware detection falls into two broad categories: static and dynamicanalysis. Static analysis examines the program without actuallyrunning the code. Static analysis looks at things like the filefingerprints, hashes, reverse engineering, memory artifacts, packerdetection, and debugging.  Static analysis is largely known forlooking up the hashes of the virus against a known database ofviruses. It is super easy to fool signature based malware detectionusing simple obfuscation methods. Dynamic analysis is a techniquewhere you run the program in a sandbox and monitor all the actionsthat the virus takes. If you notice that the program is actingsuspicious, it is likely a virus. Suspicious behavior typicallyincludes things like registry edits and API calls to bad host names.  
Antivirus detection is very difficult, but, probably not for thereasons you think. The issue isn't writing programs which can detectthese static or dynamic properties of viruses-- that is the easy part.It is also relatively easy to determine a general rule set for whatmakes a program dangerous. You can also easily blacklist suspiciousdomains, block malicious activity, and implement a signature basedmaleware detection program.  
The real problem is that there are hundreds of thousands of malwareapplications and more are created every day. Not only are there tonsof pesky malware applications, there is an absurd amount of normalprograms which we don't want malware applications to block.   It isimpossible for a small team of malware researchers to create adefinitive set of heuristics which can correctly identify all malwareprograms. This is where we turn to the field of Machine Learning.Humans are very bad with big data, but, computers love big data. Mostantivirus companies use machine learning and it has been a largesuccess so far because it has allowed us to dramatically improve ourability to detect zero day viruses. 
## Interesting Examples

### Cylance

[Cylance](https://www.cylance.com) uses supervised learning and static analysis to classify files as being malware. This product pulls a list of attributes from the file which they can then compare against other known viruses.
### MalwareBytes Anomalous

[Anomalous](https://blog.malwarebytes.com/detections/machinelearning-anomalous-100/) is a machine learning application which simply flags files which appear different from their training set of known normal files.This does not attempt to classify what makes a virus a virus, but, what makes a normal program a normal program.Anything which is not a normal program, it alerts you about since it can be a virus.
### Kaspersky

Kaspersky appears to have  done a ton of research into using machinelearning for malware detection. I would highly recommend that you readtheir [whitepaper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf)on this subject. 
# Why is this a problem?

It turns out that machine learning systems can be easily fooled byusing other machine learning algorithms. A classic example of this iswith image classification. It is easy to use neural networks orgenetic algorithms to generate examples which fool the machinelearning application by learning the weights of the machine  learningapplication and then making slight tweaks to your input to give afalse classification. 
![](media/AISaftey/AdversarialExample.png)
Since viruses generation is a non-differentiable problem, people oftenuse Genetic algorithms for the adversarial network to fool theantivirus. In other words, you don't want to attempt to calculate thederivative between two versions of a virus for gradient decent. Sinceviruses are high dimensional problems, it turns out that most calcimplementations would actually be inefficient at traversing the searchspace to find the global minimum. If you want to learn more aboutgenetic algorithms, check out my [recent blogpost](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm)on it. 
# Fooling Antivirus Software

## Genetic Algorithms

There are two major approaches which people have used to generateantivirus resistant malware with genetic algorithms. The firstapproach is to slowly make polymorphic changes to the virus in orderto fool the malware detection. One of the interesting things aboutthis approach is that you have to have some way of verifying that thepolymorphic behaviors that you apply to the virus don't break its"virus capabilities". 
An other approach used is to represent a virus as a set of properties.These properties are everything from the port of attack, the payloads,obfuscation parameters, etc. The genetic algorithm would simply tweakthe properties of the virus until it found a configuration whichevaded the antivirus program. 
## Reinforcement Learning

A research group at [Endgame](https://www.endgame.com/) recently gavea [Def Con](https://www.defcon.org/) talk where they presented aframework which uses reinforcement learning to evade static virusdetection. 
![Reinforcement Learning Diagram](media/AISaftey/Reinforcement_learning_diagram.png)
At a high level, the AI plays a "game" against the antivirus where theagent can make functionality-preserving mutations to the virus. Thereward for the agent is its ability to not get detected by theanti-virus. Over time the AI will learn which type of actions willresult in getting detected by the antivirus.  This framework can befound on [Github](https://github.com/endgameinc/gym-malware). 
# Takeaways

Machine learning is great, but, it needs to be properly defended. Aswe start to use machine learning more and more, a large portion of thecyber security field may shift its focus away from securing systems tosecuring big data applications. 
# Resources

- [Blog post on Genetic Algorithms](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm)- [Kaspersky White Paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf)- [Windows Defender Use of ML](https://www.microsoft.com/security/blog/2015/11/16/windows-defender-rise-of-the-machine-learning/)- [Machine Learning Malware Models via Reinforcement Learning (Paper)](https://arxiv.org/abs/1801.08917)- [Evolvable Malware (Paper)](http://homepage.divms.uiowa.edu/~mshafiq/files/evolvable-malware.pdf)