Personal blog written from scratch using Node.js, Bootstrap, and MySQL. https://jrtechs.net
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

131 lines
6.4 KiB

  1. In this blog post I examine the ways in which antivirus programs
  2. currently employ machine learning and then go into the security
  3. vulnerabilities that it brings.
  4. # ML in the Antivirus Industry
  5. Malware detection falls into two broad categories: static and dynamic
  6. analysis. Static analysis examines the program without actually
  7. running the code. Static analysis looks at things like the file
  8. fingerprints, hashes, reverse engineering, memory artifacts, packer
  9. detection, and debugging. Static analysis is largely known for
  10. looking up the hashes of the virus against a known database of
  11. viruses. It is super easy to fool signature based malware detection
  12. using simple obfuscation methods. Dynamic analysis is a technique
  13. where you run the program in a sandbox and monitor all the actions
  14. that the virus takes. If you notice that the program is acting
  15. suspicious, it is likely a virus. Suspicious behavior typically
  16. includes things like registry edits and API calls to bad host names.
  17. Antivirus detection is very difficult, but, probably not for the
  18. reasons you think. The issue isn't writing programs which can detect
  19. these static or dynamic properties of viruses-- that is the easy part.
  20. It is also relatively easy to determine a general rule set for what
  21. makes a program dangerous. You can also easily blacklist suspicious
  22. domains, block malicious activity, and implement a signature based
  23. maleware detection program.
  24. The real problem is that there are hundreds of thousands of malware
  25. applications and more are created every day. Not only are there tons
  26. of pesky malware applications, there is an absurd amount of normal
  27. programs which we don't want malware applications to block. It is
  28. impossible for a small team of malware researchers to create a
  29. definitive set of heuristics which can correctly identify all malware
  30. programs. This is where we turn to the field of Machine Learning.
  31. Humans are very bad with big data, but, computers love big data. Most
  32. antivirus companies use machine learning and it has been a large
  33. success so far because it has allowed us to dramatically improve our
  34. ability to detect zero day viruses.
  35. ## Interesting Examples
  36. ### Cylance
  37. [Cylance](https://www.cylance.com) uses supervised learning and static analysis to classify files as being malware.
  38. This product pulls a list of attributes from the file which they can then compare against other known viruses.
  39. ### MalwareBytes Anomalous
  40. [Anomalous](https://blog.malwarebytes.com/detections/machinelearning-anomalous-100/) is a machine learning application which simply flags files which appear different from their training set of known normal files.
  41. This does not attempt to classify what makes a virus a virus, but, what makes a normal program a normal program.
  42. Anything which is not a normal program, it alerts you about since it can be a virus.
  43. ### Kaspersky
  44. Kaspersky appears to have done a ton of research into using machine
  45. learning for malware detection. I would highly recommend that you read
  46. their [white
  47. paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf)
  48. on this subject.
  49. # Why is this a problem?
  50. It turns out that machine learning systems can be easily fooled by
  51. using other machine learning algorithms. A classic example of this is
  52. with image classification. It is easy to use neural networks or
  53. genetic algorithms to generate examples which fool the machine
  54. learning application by learning the weights of the machine learning
  55. application and then making slight tweaks to your input to give a
  56. false classification.
  57. ![](media/AISaftey/AdversarialExample.png)
  58. Since viruses generation is a non-differentiable problem, people often
  59. use Genetic algorithms for the adversarial network to fool the
  60. antivirus. In other words, you don't want to attempt to calculate the
  61. derivative between two versions of a virus for gradient decent. Since
  62. viruses are high dimensional problems, it turns out that most calc
  63. implementations would actually be inefficient at traversing the search
  64. space to find the global minimum. If you want to learn more about
  65. genetic algorithms, check out my [recent blog
  66. post](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm)
  67. on it.
  68. # Fooling Antivirus Software
  69. ## Genetic Algorithms
  70. There are two major approaches which people have used to generate
  71. antivirus resistant malware with genetic algorithms. The first
  72. approach is to slowly make polymorphic changes to the virus in order
  73. to fool the malware detection. One of the interesting things about
  74. this approach is that you have to have some way of verifying that the
  75. polymorphic behaviors that you apply to the virus don't break its
  76. "virus capabilities".
  77. An other approach used is to represent a virus as a set of properties.
  78. These properties are everything from the port of attack, the payloads,
  79. obfuscation parameters, etc. The genetic algorithm would simply tweak
  80. the properties of the virus until it found a configuration which
  81. evaded the antivirus program.
  82. ## Reinforcement Learning
  83. A research group at [Endgame](https://www.endgame.com/) recently gave
  84. a [Def Con](https://www.defcon.org/) talk where they presented a
  85. framework which uses reinforcement learning to evade static virus
  86. detection.
  87. ![Reinforcement Learning Diagram](media/AISaftey/Reinforcement_learning_diagram.png)
  88. At a high level, the AI plays a "game" against the antivirus where the
  89. agent can make functionality-preserving mutations to the virus. The
  90. reward for the agent is its ability to not get detected by the
  91. anti-virus. Over time the AI will learn which type of actions will
  92. result in getting detected by the antivirus. This framework can be
  93. found on [Github](https://github.com/endgameinc/gym-malware).
  94. # Takeaways
  95. Machine learning is great, but, it needs to be properly defended. As
  96. we start to use machine learning more and more, a large portion of the
  97. cyber security field may shift its focus away from securing systems to
  98. securing big data applications.
  99. # Resources
  100. - [Blog post on Genetic Algorithms](https://jrtechs.net/data-science/lets-build-a-genetic-algorithm)
  101. - [Kaspersky White Paper](https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdf)
  102. - [Windows Defender Use of ML](https://www.microsoft.com/security/blog/2015/11/16/windows-defender-rise-of-the-machine-learning/)
  103. - [Machine Learning Malware Models via Reinforcement Learning (Paper)](https://arxiv.org/abs/1801.08917)
  104. - [Evolvable Malware (Paper)](http://homepage.divms.uiowa.edu/~mshafiq/files/evolvable-malware.pdf)