In this blog post I want to examine the ways in which anti virus programs currently employ machine learning and then go into the potential pitfalls that ML bring.
Most current maleware detection falls into two broad categories: static and dynamic analysis. Static analysis looks at the program without actually running the code. Static analysis typically looks at things like the file fingerprint, virus scanning, reverse engineering, memory artifacts, packer detection, and debugging. Static analysis also encompasses looking up the hashes of the virus against a known database of viruses. However, it is super easy to fool signature based malware detection using simple obfuscation methods. Dynamic analysis is a technique where you run the program in a sandbox and monitor all the actions that the virus takes. If you notice that the program is acting suspicious -ie changing the registry or making suspicious API calls- it is likely a virus.
Antivirus detection is very difficult, but, probably not for the reasons you think The issue isn't writing programs which can detect these static or dynamic properties of viruses, that is the easy part. It is also relatively easy to determine a general rule set for what makes a virus a virus. You can easily whitelist suspicious domains, determine that certain file fingerprints hashes, and behaviours are virus like.
The real problem is that there are hundreds of thousands of maleware applications and more are created every day.
Not only are there tons of pesky maleware applications, there is an absurd amount of normal programs which we don't want maleware applications to block.
It is impossible for a small team of maleware researchers to create a definitive set of heuristics which can correctly identify all maleware programs.
This is where we turn to the field of Machine Learning. Humans are bad with big data, but, computers absolutely love big data. Most antivirus companies use machine learning and it has been a large success so far because it has allowed us to dramatically improve our ability to detect zero day viruses.
Cylance uses supervised learning and static analysis to classify files as being maleware. This product pulls a list of attributes from the file which they can then compare against other known viruses.
Anomalous is a machine learning application which simply flags files which appear different from their training set of known normal files. This does not attempt to classify what makes a virus a virus, but, what makes a normal program a normal program. Anything which is not a normal program, it alerts you about since it is probably a virus.
Kaspersky appears to have a ton of research into using machine learning for maleware detection. I would highly recommend that you read their white paper on this subject.
It turns out that machine learning systems can be easily fooled by using Generative Adversarial Networks. Essentially what this boils down to is that you have two