Title: Integrating Semantics to Boost Classification Accuracy: A Case Study in Polymorphic Shellcode Attribution Analysis Speaker: Deguang Kong, Uni. of Sci. & Tec. of China (joint work with H.Xi, S.Zhu, Y.Jhi, P.Liu) Abstract: Machine learning techniques have been widely used to solve different kinds of practical issues in different context (e.g., image annotation, video categorization). However, the existence of high (at least not very low) classification errors greatly hinders the applications with rigorous accuracy requirement. For example, in information security field, shellcode attribution analysis only tolerates extremely low false positive rate and very low false negative rate, and the nature of polymorphic shellcode's variability further troubles the statistic analysis process. One promising solution for this challenging issue is to integrate the domain knowledge to boost the classification accuracy. Here in the context of shellcode analysis, we give a case study to answer these two basic questions in semantics aware machine learning process, a) What kind of knowledge can be used? b) How to integrate the knowledge into classification process? We use the static data-flow analysis (including static taint analysis) techniques to extract the semantic characteristics from the shellcode, and then different kinds of graphical models (e.g., Hidden Markov Model, Conditional Random Field; Mixture of Markov Model, Markov Random Field) can be used to model the same type of shellcode for attribution analysis. The initial experiment result shows that our approach is far better than only statistic-based approaches, which also refutes the conclusion of infeasibility of modeling polymorphic shellcode (J. Song, CCS 2007).