Characterizing Malware Samples in the SOREL-20M dataset by Concept Learning

Annotation

Concept learning is a relevant explainable AI method which has already been applied to malware characterization, e.g. on the EMBER dataset. This work will focus on the larger SOREL-20M dataset. The main aim is to compare the obtained characterizations, detected noise level, etc., and cross-validate the overall applicability of the methodology.

Aim

Preprocess SOREL-20M into a format suitable for application of concept learning based on an ontology
Apply concept learning (e.g. using DL Learner) on the dataset to obtain malware sample characterizations in form of structured concept descriptions
Compare the results with previous works relying on the EMBER dataset

Plan

August - September 2023

study the theoretical foundations of description logics and OWL
get an overview of the DL-Learner framework

October - November 2023

get familiar with the data provided in the SoReL-20M dataset and the structure of PE files
determine whether the aforementioned dataset is compatible with the developed ontology
write a script to transform the information about individual files available in SoReL-20M into an OWL knowledge base
use the created script to prepare one small knowledge base for calibration purposes

December - January 2023

learn how to work with DL-Learner and explore its capabilities
run initial experiments with DL-Learner

February 2023

study the implementation of DL-Learner in more detail
try to improve the algorithms whose performance we will later investigate
identify the hyperparameters worth calibrating

March 2023

perform calibration experiments with the corrected version of DL-Learner
prepare more fractional datasets of various sizes
validate the choice of hyperparameter values on newly prepared datasets
decide on the structure of the thesis
write the introduction, theoretical background, and other overview chapters
reconsider the proposed changes (not corrections) to the refinement operator and seek for new ways to improve its efficiency

April 2023

run the calibration again, but employing the potentially improved version of the refinement operator (only if we conclude that the we may achieve better results)
prepare fractional datasets targeted on a specific category of malware and/or type of a PE file
perform calibration experiments on the more focused datasets and validate the hyperparameter values
compare the results of all experiments
assess the overall performance of all the tested versions of DL-Learner
write the remainder of the thesis

May 2023

incorporate the feedback from the supervisor into the thesis

Literature

Rudolph, S. (2011). Foundations of description logics. Reasoning Web International Summer School.
Manola, F. & Miller, E. (2004). RDF 1.1 Primer. https://www.w3.org/TR/rdf11-primer/.
Hitzler, P., Krötzsch, M., Parsia, B., Patel-Schneider, P. F., & Rudolph, S. (2009). OWL 2 Web Ontology Language Primer. W3C.
Anderson, H.S. & Roth, P. (2018). Ember: an open dataset for training static pe malware machine learning models. ArXiv, abs/1804.04637.
Harang, R. & Rudd, E.M. (2020). SOREL-20M: A large scale benchmark dataset for malicious PE detection. ArXiv, abs/2012.07634.
Bühmann, L., Lehmann, J., & Westphal, P. (2016). DL-Learner — A framework for inductive learning on the Semantic Web. Journal of Web Semantics, 39.
Lehmann, J., & Hitzler, P. (2009). Concept learning in description logics using refinement operators. Machine Learning, 78, 203-250.
Lehmann, J., Auer, S., Bühmann, L., & Tramp, S. (2011). Class expression learning for ontology engineering. J. Web Semant., 9, 71-81.
Tran, A.C., Dietrich, J., Guesgen, H.W., & Marsland, S.R. (2012). An Approach to Parallel Class Expression Learning. International Web Rule Symposium.
Tran, A.C., Dietrich, J., Guesgen, H.W., & Marsland, S.R. (2012). Two-way Parallel Class Expression Learning. Asian Conference on Machine Learning.
Švec, P., Balogh, Š., & Homola, M. (2021). Experimental Evaluation of Description Logic Concept Learning Algorithms for Static Malware Detection. ICISSP.
Švec, P., Balogh, Š., Homola, M., & Kľuka, J. (2022). Knowledge-Based Dataset for Training PE Malware Detection Models. ArXiv, abs/2301.00153.

Versions

Theoretical Background - as of April 11, 2023 (Introduction and Learning in Description Logics subject to change)
Current Thesis - as of May 3, 2023
Current Thesis - as of May 10, 2023