| @ -0,0 +1,45 @@ | |||||
| In this blog post I want to examine the ways in which anti virus programs currently employ machine learning and then go into the potential pitfalls that ML bring. | |||||
| # ML In the Antivirus Industry | |||||
| Most current maleware detection falls into two broad categories: static and dynamic analysis. | |||||
| Static analysis looks at the program without actually running the code. | |||||
| Static analysis typically looks at things like the file fingerprint, virus scanning, reverse engineering, memory artifacts, packer detection, and debugging. | |||||
| Static analysis also encompasses looking up the hashes of the virus against a known database of viruses. | |||||
| However, it is super easy to fool signature based malware detection using simple obfuscation methods. | |||||
| Dynamic analysis is a technique where you run the program in a sandbox and monitor all the actions that the virus takes. | |||||
| If you notice that the program is acting suspicious -ie changing the registry or making suspicious API calls- it is likely a virus. | |||||
| Antivirus detection is very difficult, but, probably not for the reasons you think | |||||
| The issue isn't writing programs which can detect these static or dynamic properties of viruses, that is the easy part. | |||||
| It is also relatively easy to determine a general rule set for what makes a virus a virus. | |||||
| You can easily whitelist suspicious domains, determine that certain file fingerprints hashes, and behaviours are virus like. | |||||
| The real problem is that there are hundreds of thousands of maleware applications and more are created every day. | |||||
| Not only are there tons of pesky maleware applications, there is an absurd amount of normal programs which we don't want maleware applications to block. | |||||
| It is impossible for a small team of maleware researchers to create a definitive set of heuristics which can correctly identify all maleware programs. | |||||
| This is where we turn to the field of Machine Learning. | |||||
| Humans are bad with big data, but, computers absolutely love big data. | |||||
| Most antivirus companies use machine learning and it has been a large success so far because it has allowed us to dramatically improve our ability to detect zero day viruses. | |||||
| ## Interesting Examples | |||||
| ### Cylance | |||||
| [Cylance](https://www.cylance.com) uses supervised learning and static analysis to classify files as being maleware. This product pulls a list of attributes from the file which they can then compare against other known viruses. | |||||
| ### MalwareBytes Anomalous | |||||
| [Anomalous](https://blog.malwarebytes.com/detections/machinelearning-anomalous-100/) is a machine learning application which simply flags files which appear different from their training set of known normal files. | |||||
| This does not attempt to classify what makes a virus a virus, but, what makes a normal program a normal program. | |||||
| Anything which is not a normal program, it alerts you about since it is probably a virus. | |||||
| ### Kaspersky | |||||
| Kaspersky appears to have a ton of research into using machine learning for maleware detection. | |||||
| I would highly recommend that you read their [white paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf) on this subject. | |||||
| # Why is this a problem? | |||||
| It turns out that machine learning systems can be easily fooled by using [Generative Adversarial Networks](https://en.wikipedia.org/wiki/Generative_adversarial_network). Essentially what this boils down to is that you have two | |||||