Title: Integrating Semantics to Boost Classification Accuracy: A Case
Study in Polymorphic Shellcode Attribution Analysis
Speaker: Deguang Kong, Uni. of Sci. & Tec. of China
(joint work with H.Xi, S.Zhu, Y.Jhi, P.Liu)
Abstract:
Machine learning techniques have been widely used to solve different
kinds of practical issues in different context (e.g., image
annotation, video categorization). However, the existence of high (at
least not very low) classification errors greatly hinders the
applications with rigorous accuracy requirement. For example, in
information security field, shellcode attribution analysis only
tolerates extremely low false positive rate and very low false
negative rate, and the nature of polymorphic shellcode's variability
further troubles the statistic analysis process. One promising
solution for this challenging issue is to integrate the domain
knowledge to boost the classification accuracy. Here in the context of
shellcode analysis, we give a case study to answer these two basic
questions in semantics aware machine learning process, a) What kind of
knowledge can be used? b) How to integrate the knowledge into
classification process? We use the static data-flow analysis
(including static taint analysis) techniques to extract the semantic
characteristics from the shellcode, and then different kinds of
graphical models (e.g., Hidden Markov Model, Conditional Random Field;
Mixture of Markov Model, Markov Random Field) can be used to model the
same type of shellcode for attribution analysis. The initial
experiment result shows that our approach is far better than only
statistic-based approaches, which also refutes the conclusion of
infeasibility of modeling polymorphic shellcode (J. Song, CCS 2007).