Malware as images

A CNN trained on the Malimg dataset hits 98% accuracy classifying malware into 25 families. Each sample is a grayscale image generated directly from the raw bytes of a Windows PE binary, one pixel per byte, brightness proportional to value. The classifier never reads a single instruction, never parses a PE header, never executes the binary in a sandbox. It looks at a picture and decides what family the malware belongs to. For a red teamer, that should immediately raise a question: if the classifier only sees pixels, what happens when you change the pixels without breaking the payload?

This is entry 21 in the AI Red Teaming series. The previous entry covered training and evaluation, the metrics and methods used to measure how well a model performs. This entry introduces a practical application of everything the series has covered so far, by applying CNNs to a real security problem and examining the dataset that makes it possible.

From binary to image

The technique behind the Malimg dataset, first proposed by Nataraj et al. at the University of California, Santa Barbara in 2011, is mechanically simple. Take a PE binary, read every byte, and treat each byte as a pixel intensity value. A byte of 0x00 becomes a black pixel, 0xFF becomes white, and everything between maps to the corresponding shade of grey. Arrange these pixels into rows of a fixed width, and the binary becomes a grayscale image whose height varies with the file size.

The conversion is lossless. Every byte in the original binary is encoded in the image, which means the image can reconstruct the binary exactly. But the real value of the technique is not reconstruction. It is that malware binaries from the same family produce images with visibly similar texture patterns. The .text section, the .data section, padding bytes, and encoded resources all create distinct visual regions. Two samples from the Fakerean family, for instance, share recognisable structural patterns that a human analyst can spot and that a CNN can learn to classify automatically.

The Malimg dataset contains 9,339 images across 25 malware families. The families range from adware (Adialer.C) to worms (Allaple.A, Allaple.L) to trojans (Rbot!gen, Swizzor.gen!E). Each family has its own folder, and the number of samples per family varies considerably, which is where the adversarial story begins.

Class distribution is a target map

Downloading and unpacking the dataset is straightforward:

wget https://www.kaggle.com/api/v1/datasets/download/ikrambenabd/malimg-original -O malimg.zip
unzip malimg.zip

The first step before training any classifier is to understand what the data actually looks like. The following code computes and plots the class distribution:

import os
import matplotlib.pyplot as plt
import seaborn as sns

DATA_BASE_PATH = "./malimg_paper_dataset_imgs/"

# compute class distribution
dist = {}
for mlw_class in os.listdir(DATA_BASE_PATH):
    mlw_dir = os.path.join(DATA_BASE_PATH, mlw_class)
    dist[mlw_class] = len(os.listdir(mlw_dir))

# plot
classes = list(dist.keys())
frequencies = list(dist.values())

plt.figure(figsize=(10, 8))
sns.barplot(y=classes, x=frequencies, orient='h')
plt.title("Malware class distribution")
plt.xlabel("Number of samples")
plt.ylabel("Malware family")
plt.tight_layout()
plt.show()

The resulting plot reveals an imbalanced dataset. Allaple.A and Allaple.L dominate with well over a thousand samples each, while families like Skintrim.N and Dialplatform.B have fewer than a hundred. From a defender’s perspective, this imbalance is a data quality issue that might skew the model towards predicting well-represented classes. From a red teamer’s perspective, it is a map of where the model is weakest.

Research by Wang et al. (2021) on adversarial training under class imbalance demonstrated that models trained on imbalanced datasets suffer disproportionately poor performance on underrepresented classes, and that this performance gap widens further under adversarial attack. The mechanism is straightforward: the model has seen fewer examples of the minority class, which means its learned decision boundary for that class is less well-defined and easier to cross with a small perturbation. If you are trying to evade a malware classifier and you have a choice of which family your payload resembles, targeting an underrepresented class gives you a thinner boundary to push against.

This is why exploratory data analysis is not a formality. For the red teamer, the class distribution plot is reconnaissance. It tells you which families the model knows well and which ones it is guessing at.

The constraint that makes this interesting

Adversarial attacks against image classifiers in the computer vision domain are well-understood at this point in the series. FGSM, PGD, and Carlini-Wagner can all craft perturbations that cause a CNN to misclassify an image while keeping the changes imperceptible to a human. But malware image classification introduces a constraint that does not exist in the standard image domain, and it fundamentally changes the attack surface.

Every pixel in a malware image is a byte in a functional binary. Changing a pixel changes a byte in an executable. If that byte sits in the .text section, you have modified an instruction. If it sits in the PE header, you may have corrupted the file’s ability to load at all. The perturbation must produce an image that the classifier misclassifies and a binary that still executes its intended payload. Grosse et al. (2017) formalised this constraint in their work on adversarial perturbations against deep neural networks for malware classification, demonstrating that the loose condition of visual imperceptibility in computer vision is replaced by the strict condition of functional equivalence.

This is the tension that makes malware image classification a genuinely different adversarial problem from classifying cats and dogs. The attacker cannot perturb freely across the entire input space. They are limited to modifying bytes in regions of the binary that do not affect execution: padding bytes, unused sections, appended data after the last section, or slack space within section alignment boundaries. Suciu et al. (2018) showed that even with these constraints, targeted byte modifications to PE files could evade MalConv, a byte-level CNN classifier, and Koch and Begoli (2024) demonstrated that appending just 10,000 gradient-optimised bytes to a binary achieved a 60% evasion rate against the same architecture.

The image representation makes this visually intuitive. The padding regions and appended data appear as distinct blocks of black pixels (zero bytes) in the grayscale image. These are the regions where an attacker has freedom to manipulate pixel values without risking the binary’s functionality. The structural sections of the image, the densely textured areas corresponding to code and data, are the regions where perturbation is dangerous.

Type to search