Characterizing Malware Samples in the SOREL-20M dataset by Concept Learning

Bachelor's Thesis

Tomáš Bisták

Supervisor: doc. RNDr. Martin Homola, PhD.

Annotation

Concept learning is a relevant explainable AI method which has already been applied to malware characterization, e.g. on the EMBER dataset. This work will focus on the larger SOREL-20M dataset. The main aim is to compare the obtained characterizations, detected noise level, etc., and cross-validate the overall applicability of the methodology.

Aim

  1. Preprocess SOREL-20M into a format suitable for application of concept learning based on an ontology
  2. Apply concept learning (e.g. using DL Learner) on the dataset to obtain malware sample characterizations in form of structured concept descriptions
  3. Compare the results with previous works relying on the EMBER dataset

Plan

August - September 2023

  • study the theoretical foundations of description logics and OWL
  • get an overview of the DL-Learner framework

October - November 2023

  • get familiar with the data provided in the SoReL-20M dataset and the structure of PE files
  • determine whether the aforementioned dataset is compatible with the developed ontology
  • write a script to transform the information about individual files available in SoReL-20M into an OWL knowledge base
  • use the created script to prepare one small knowledge base for calibration purposes

December - January 2023

  • learn how to work with DL-Learner and explore its capabilities
  • run initial experiments with DL-Learner

February 2023

  • study the implementation of DL-Learner in more detail
  • try to improve the algorithms whose performance we will later investigate
  • identify the hyperparameters worth calibrating

March 2023

  • perform calibration experiments with the corrected version of DL-Learner
  • prepare more fractional datasets of various sizes
  • validate the choice of hyperparameter values on newly prepared datasets
  • decide on the structure of the thesis
  • write the introduction, theoretical background, and other overview chapters
  • reconsider the proposed changes (not corrections) to the refinement operator and seek for new ways to improve its efficiency

April 2023

  • run the calibration again, but employing the potentially improved version of the refinement operator (only if we conclude that the we may achieve better results)
  • prepare fractional datasets targeted on a specific category of malware and/or type of a PE file
  • perform calibration experiments on the more focused datasets and validate the hyperparameter values
  • compare the results of all experiments
  • assess the overall performance of all the tested versions of DL-Learner
  • write the remainder of the thesis

May 2023

  • incorporate the feedback from the supervisor into the thesis

Literature

  1. Rudolph, S. (2011). Foundations of description logics. Reasoning Web International Summer School.
  2. Manola, F. & Miller, E. (2004). RDF 1.1 Primer. https://www.w3.org/TR/rdf11-primer/.
  3. Hitzler, P., Krötzsch, M., Parsia, B., Patel-Schneider, P. F., & Rudolph, S. (2009). OWL 2 Web Ontology Language Primer. W3C.
  4. Anderson, H.S. & Roth, P. (2018). Ember: an open dataset for training static pe malware machine learning models. ArXiv, abs/1804.04637.
  5. Harang, R. & Rudd, E.M. (2020). SOREL-20M: A large scale benchmark dataset for malicious PE detection. ArXiv, abs/2012.07634.
  6. Bühmann, L., Lehmann, J., & Westphal, P. (2016). DL-Learner — A framework for inductive learning on the Semantic Web. Journal of Web Semantics, 39.
  7. Lehmann, J., & Hitzler, P. (2009). Concept learning in description logics using refinement operators. Machine Learning, 78, 203-250.
  8. Lehmann, J., Auer, S., Bühmann, L., & Tramp, S. (2011). Class expression learning for ontology engineering. J. Web Semant., 9, 71-81.
  9. Tran, A.C., Dietrich, J., Guesgen, H.W., & Marsland, S.R. (2012). An Approach to Parallel Class Expression Learning. International Web Rule Symposium.
  10. Tran, A.C., Dietrich, J., Guesgen, H.W., & Marsland, S.R. (2012). Two-way Parallel Class Expression Learning. Asian Conference on Machine Learning.
  11. Švec, P., Balogh, Š., & Homola, M. (2021). Experimental Evaluation of Description Logic Concept Learning Algorithms for Static Malware Detection. ICISSP.
  12. Švec, P., Balogh, Š., Homola, M., & Kľuka, J. (2022). Knowledge-Based Dataset for Training PE Malware Detection Models. ArXiv, abs/2301.00153.

Versions