Characterizing Malware Samples in the SOREL-20M dataset by Concept Learning

Week 1: 13.2. - 19.2.

This week we:

explored ways to improve the quality of the fractional datasets used in the process of hyperparameter tuning (i.e. sampling methods and means to measure representativeness).
- outcomes:
  - The sample size should probably be increased (from the current 1000 to about 5000-10000).
  - Balancing the number of executables and DLLs (within both the malware and benign software category of samples) would lead to less biased learning. Maybe it would be even more beneficial to prepare separate datasets for executables and DLLs.
studied and considered various techniques of hyperparameter tuning (grid search, random search, hill climbing, Bayesian optimization).
- outcomes:
  - We will try to employ the Bayesian optimization after closely investigating its applicability in our situation.
enriched the ontology with additional information to guide the search (learning algorithm) towards more meaningful class expressions. More specifically, we added:
- class disjointness assertions,
- cardinality restrictions on object properties.
optimized the learning algorithm by:
- implementing disjunction and conjunction simplification of refinements based on the laws of absorption, idempotence, domination, and identity (despite making the refinement operator incomplete - we will have to tackle this problem later, since by now, we have addressed it only partially).
- improving the checks for equivalence of a refinement to the bottom concept.
- narrowing the search space via more proper configuration (especially by tweaking startClass and useHasValueConstructor options).
- adding an option to specify object properties for which cardinality restriction expressions should be considered.
- changing the way max cardinality restrictions are handled so that only meaningful refinements are generated (leveraging the additional knowledge about object properties in the ontology).
- correcting the heuristic used during the search.

Week 2: 20.2. - 26.2.

This week we:

updated the original implementation of the refinement operator by
- redefining one of the performed operations, in order to ensure the operator generates downward refinements. More specifically, concepts of the form "max n r.D" are now refined to "max n r.E", where D is subsumed by E (previously, the rule was based on finding a concept E satisfying the condition that E is subsumed by D).
- introducing a refinement of the form "C and D", for a given concept "C" and any top concept refinement D, only if C is not an intersection (conjunction), to align the implementation with the theoretical definition proposed in the paper.
- including only the negations of the most specific classes in top refinements, with the same rationale behind as the one for the previously presented change.
slightly redesigned the heuristic guiding the search (in OCEL algorithm) so that it calculates the length bonus in a more meaningful way, first and foremost, cumulatively and w.r.t. the used length metric. We also made the following changes to the bonification rules:
- The bonification for the top concept within an existential quantification or at-least restriction was set to 1/2 of the class length, whereas the top concept enclosed in a universal quantification and the bottom concept in an at-most restriction are each worth a bonus equal to the class length.
- Negations are treated the same as before, however, we decided to drop the bonus intended to mitigate the length of a data property value constraint for now, since we do not consider these types of concepts in the learning process yet.
improved the implementation of the closed world reasoner methods for checking whether an individual belongs to the extension of a concept and for retrieving all such individuals from the domain of the current interpretation. To be more precise:
- The reasoner now correctly handles the cases when the concept in question contains cardinality restrictions.
revised another of the four learning algorithms, whose capabilities we plan to examine, named CELOE (up to this point, we have only been dealing with OCEL).
- outcomes:
  - No changes were required.
extended the implementation of OCEL and CELOE to provide more thorough summary of the solutions (or solution candidates) found, including their test accuracies and the exact times they were discovered.

Week 3: 27.2. - 5.3.

This week we:

integrated two parallel concept learning algorithms which we attempt to utilize into our version of DL-Learner.
corrected and improved the implementation of the parallel algorithms by
- reversing the order of nodes in the search tree and thus polling the first node instead of the last one when seeking for the currently best description (this should result in quicker traversals as ascending iterators for ConcurrentSkipListSet are, according to the documentation, faster).
- allowing them to reach the refinements that have the same length as the original expression.
- letting them to consider also expressions in which disjunctions occur within a universal quantification or a qualified number restriction (we will further investigate the benefits of this change).
discovered that each worker has to query its own reasoner when using logical inferece to verify subsumption assertions to avoid concurrent access to a reasoner's internal variables, resulting in an unpredictable behaviour.
implemented a concurrent closed world reasoner to serve the multiple workers used by the parallel algorithms more effeciently then multiple instances of the built-in closed world reasoner. More specifically, our concurrent version
- materializes the knowledge base only once and shares this materialization among all requests to retrieve the individuals in an extension of a concept and test which individuals belong to this extension.
- maintains a pool of base reasoners (Pellet) to handle other requests, mainly requiring more complex reasoning.
added support for some configuration options of the refinement operator when refining upwards (especially those in which we are interested).
providing the user more details about the quality of the learned concept by performing test evaluating and reporting its accuracy, recall, specificity, FP rate, etc.
inspired by OCEL's implementation, accelerated the calculation of accuracy by keeping track of coverage of individual descriptions in the search tree and exploiting the fact that the refinement operator proceeds in a downward fashion.
implemented a more sophisticated refinement simplification, but eventually decided to discontinue our endeavor to narrow the search space by this means.
inspired by the implementation of the parallel algorithms, slightly modified the behavior of OCEL and CELOE, so that they ask the refinement operator for longer refinements sooner (this will also be subject to a later appropriateness test).
started performing experiments to determine which of the enhancements proposed to include in the final version(s) of DL-Learner.

Week 4: 6.3. - 12.3.

This week we:

concluded that although asking for longer refinements may allow algorithms to reach more complex refinements sooner, it might also result in an unnecessarily large search tree, and thus we decided not to include this change in the final version of any of the four algorithms.
altered the way how the information on the covered negative and positive examples is stored in each tree node in order to reduce memory usage.
removed the additional information on coverage from the nodes of the search tree generated by CELOE since using the original representation caused the algorithm to stop due to an out-of-memory error, whereas the updated one could not deliver the desired speed-up in accuracy calculation.
improved the "some-only-constraint" compliance checks, so that at-least restrictions are considered as existential quantification and expressions containing universal quantification are no longer in top-concept refinements.
added support for custom numeric-value splitter used to determine reasonable values for restrictions on numeric data properties.
prepared multiple fractional datasets targeted on EXEs and DLLs, respectively, as well as datasets containing a mixture of samples from these two classes.
selected the newly introduced features that will be a part of our final version of DL-Learner.

Week 5: 13.3. - 19.3.

This week we:

further compacted the representation of coverage, so that the search tree finally fits into the memory.
made the retrieval of the information on the number of covered positive and negative examples faster.
decided to use other version of ParCELEx (one of the parallel algorithms) in our experiments because the one we had been experimenting before was unable to handle large search trees.
improved the ParCEL algorithm (one of the parallel ones) by
- letting it to discard concepts that are unpromising because all the positive examples they cover are covered by the already found partial definitions.
- instructing it to consider a given concept to be partial definition only if it covers more positive examples than negative.
improved how we prevent the refinement operator from generating concepts of the form "(r only C) and (r only D)" as they are logically equivalent with "r only (C and D)".
rebalanced the datasets containing both EXEs and DLLs. There is now an equal share of EXEs and DLLs among the malware samples as well as among the benign software samples.
commenced the calibration process.

Week 6: 20.3. - 26.3.

This week we:

finished the calibration process with OCEL and CELOE on the mixed dataset.
were working on the chapters dealing with the following (the contents still need to be discussed with the supervisor):
- introduction,
- description logics (without exact definitions and examples),
- concept learning - overview and brief description,
- PE format and SOREL-20M dataset.

Week 7: 27.3. - 2.4.

This week we:

started the calibration process with ParCEL and ParCELEx on the mixed dataset.
were working on our paper for ŠVK (we essentially started the week before by writing the chapters mentioned in Week 6).

Week 8: 3.4. - 9.4.

This week we:

halted the ParCEL and ParCELEx calibration, waiting for the servers to become fully available.
finished our paper for ŠVK.
were modifying and extending the ŠVK paper for the needs of our Bachelor's thesis - mainly theoretical background.

Week 9: 10.4. - 16.4.

This week we:

performed experiments with OCEL and CELOE on the mixed datasets prepared for validation.
started the calibration process with OCEL and CELOE on the EXE-targeted dataset.
added examples in the section on description logics.
created an image with a simplified stucture of the PE Malware Ontology.

Week 10: 17.4. - 23.4.

This week we:

finnished the calibration process with OCEL and CELOE on the EXE-targeted dataset.
calibrated OCEL and CELOE on the DLL-targeted dataset.
resumed the calibration process with ParCEL and ParCELEx on the mixed dataset.

Week 11: 24.4. - 30.4.

This week we:

finnished the calibration process with ParCEL and ParCELEx on the mixed dataset.
partially validated ParCEL and ParCELEx on the mixed dataset.
started the calibration process with ParCEL and ParCELEx on the EXE- and DLL-targeted datasets.
extended and reworked the chapters on
- learning in description logics (definition of the refinement operator, description of algorithms),
- implementation changes (concurrent closed-world reasoner, accuracy calculation, ParCEL/ParCELEx modifications),
- experiments (EXE and DLL test cases, ParCEL and ParCELEx configuration).

Week 12: 1.5. - 7.5.

This week we:

finnished all calibration and validations experiments.
essentially finnished the chapters on learning in description logics and experiments.
started extending and enhancing the chapter on results.

Week 13: 8.5. - 14.5.

This week we:

almost finished chapter on results.
written the majority of chapters with discussion and conclusion.

Weekly Updates

Week 1: 13.2. - 19.2.

Week 2: 20.2. - 26.2.

Week 3: 27.2. - 5.3.

Week 4: 6.3. - 12.3.

Week 5: 13.3. - 19.3.

Week 6: 20.3. - 26.3.

Week 7: 27.3. - 2.4.

Week 8: 3.4. - 9.4.

Week 9: 10.4. - 16.4.

Week 10: 17.4. - 23.4.

Week 11: 24.4. - 30.4.

Week 12: 1.5. - 7.5.

Week 13: 8.5. - 14.5.