- CrowdStrike researchers have developed a next-gen method to train byte-based Transformer blocks that help models “understand” malware files rather than rely on detecting the presence of markers
- During testing, Binary Transformers significantly outperformed traditionally trained models in differentiating between benign and malicious code samples
- The results demonstrate the potential in using Binary Transformers to improve malware detection and classification
In recent years, Transformer models have been the backbone of the revolution within the artificial intelligence sector. They are the basis of large language models (LLMs) and responsible for LLMs’ ability to understand and generate text of a human-like quality. Transformers are able to learn long-range interactions between words and sentences, allowing them to retain high-level concepts and insights into their training data.
CrowdStrike researchers have been working on models that understand malware files in-depth instead of merely inspecting them for the presence of certain markers — after all, files are structured in a manner largely similar to that of sentences. This next-generation, first-in-class approach is trained on vast amounts of malware data in its most basic representation of bytes, allowing maximum adaptability. This has so far led to exciting results, driving malware classification performance and helping us stay one step ahead of the adversary.
We call this innovative new approach to training “Binary Transformers.” It is the starting point for our wider research on foundational models for malware analysis and detection.
Model Architecture
Inspired by literature approaches, we opt for a token-free technique in which we operate directly on file bytes as input, instead of disassembling them into a string representation and then mapping these using a certain vocabulary. This focus on binary information as the most basic representation of files or events reduces complexity and allows us to learn from cases where disassembly is not possible. In addition, when trained on a large enough variety of file types, this approach also promises to yield a classifier that can operate across file types.
Our model first learns an embedding of those byte values, which is subsequently reduced in dimension using a strided convolutional layer. This deliberately makes use of locality (i.e., adjacent bytes often carrying shared meaning). The output of these is fed into a standard attention-based transformer. Given the binary input data modality, we term this approach Binary Transformer.
With byte values as input, employing Binary Transformer results in long sequences of several hundred thousand elements for medium-length files. Since the attention mechanism calculates weights for each byte pair, the memory requirement would generally scale quadratically with the number of bytes. A further tweak to the standard transformer-based architecture is therefore the use of Shifted Window attention. Here, the byte sequence is split into local areas (“bags”), which first calculate attention weights between their members internally, after which attention weights are computed across bags.
One strategy to further limit the length of the byte sequence is to subsample contiguous file sections at training time. We do this with a file’s inherent structure in mind, in order to ensure the model is presented with a representative subset of a file at all times.
Image and audio classification have successfully been demonstrated with large transformer-based model architectures that directly use byte data as input.
Model Performance
We’ve tested our modeling approach on a dataset of benign and malicious shell code samples. As can be seen from the ROC curve below, our new Binary Transformer-trained model (orange and green lines) considerably outperforms a standard tree-based model on the same dataset. Note that the ROC plot is cropped, and also shown with two logarithmic axes.
We achieved this performance with less than a day of training on a single NVIDIA H100 GPU. In comparison, tree-based models usually require a few hours of training on CPU hardware. Where tree-based models usually have a memory footprint of less than 10 megabytes, our Binary Transformer model so far requires several tens of megabytes. With its performance proven through testing, we are now conducting research into model distillation and compression in order to ensure a lightweight sensor application.
Leave a Reply