Malicious documents often contain images of fake program prompts that are designed to convince a user to perform an action, such as disabling Microsoft Office’s read-only mode (Protected View) and enabling macros. Some forms of trickery are more effective than others, often involving an appeal to a sense of urgency or authority. We see threat actors repeatedly reusing a small collection of cues and deceits in their social engineering images, likely because they have proven effective over time.
Since threat actors often reuse or only slightly tweak the social engineering images in their malware campaigns, they leave visual signatures of their activity. In this article, we describe how to track and detect malware families distributed in campaigns involving visually similar malicious documents using perceptual hash algorithms. We have also released a script to demonstrate this technique called graph_similar_document_images.py.
Top Social Engineering Images
From a sample of 250 malicious documents detected in 2019 we identified 32 distinct social engineering images. Table 1 shows the products and organisations that were most frequently imitated.
||Social Engineering Image Variants
|Generic “protected” document
|RSA Secure ID
Table 1 – Top imitated products and companies in social engineering images.
The most common deceits used were claiming that the document was “protected” (13 variants), followed by claiming that the document was incompatible with the version of software used to view the document (10 variants).
Thwarting Non-Comparable Cryptographic Hash Functions
Hashes are one of the most common types of atomic indicators used in threat intelligence. Cryptographic hash functions such as SHA and MD5 are commonly used to provide some level of assurance of data integrity because they are designed to be deterministic, meaning the same input data will result in the same hash value. This property allows these functions to be used to create indicators (i.e. hash values) that identify known malicious files. A second property of these hash functions is that they are intentionally not comparable. Although this is desirable in the design of strong hash functions, it’s an obstacle for tracking social engineering images because the slightest modification to an image will result in a completely different value (this is known as the avalanche effect). Threat actors can deliberately change the hash values of their social engineering images with trivial effort, meaning that tracking the reuse of images using hash values derived from the contents of files isn’t a robust solution.
Example – QakBot Campaign, 2019
Figures 1-3 show how a threat actor programmatically modified social engineering images to thwart tracking using non-comparable hash functions in a campaign from October 2019 that delivered Qakbot, a credential-stealing worm. The threat actor modified each social engineering image by inserting blue ovals (highlighted in red) in random locations, meaning that the images—and the documents which contained them—generate unique MD5 values (Figure 4).
Figure 1 – JPEG image extracted from a malicious Word document used in a campaign delivering QakBot.
Figure 2 – JPEG extracted from a second malicious Word document used in the same campaign.
Figure 3 – Edited JPEG from a third sample showing inserted ovals (light grey).
Figure 4 – Unique MD5 values of visually similar images used in the campaign.
Despite the minor visual differences between the images, these modifications also drastically change the composition of the files due to how JPEG files are encoded, as shown in Figure 5.
Figure 5 – Byte visualisations of the extracted images from Figures 1 and 2 generated with binvis.io.
Perceptual Hash Algorithms
We can overcome the weaknesses of non-comparable hash functions by instead computing the hash values of social engineering images using perceptual hash algorithms, which are comparable. One of the simplest perceptual hash algorithms is the Average Hash which results in an 8 byte hash value. To demonstrate how it works, let’s use Johannes Buchner’s Python library, ImageHash, to compute the Average Hash of one of the social engineering images from the QakBot campaign.
>>> from PIL import Image
>>> import imagehash
>>> a = imagehash.average_hash(Image.open('sample (1).jpg'))
To calculate the Average Hash, you first need to prepare an image by resizing it to 8×8 pixels, then reduce the number of colours by converting it to grayscale. Next you calculate the mean colour value of the image. Finally, for each pixel’s colour value you set a bit to 1 if it’s above the mean, and 0 if it’s below the mean. We can print out a grid to show how this works.
[[False False False False False False False False]
[False False False False False False False False]
[ True True False False False False False False]
[ True True False True False False True True]
[ True True False True True True True False]
[ True True False True True True False False]
[False True False False False False False False]
[False False False False False False False False]]
The hash value 0000c0d3dedc4000 is the hexadecimal representation of each row’s bits from top to bottom. This means that you can identify how similar a pair of average hashes are by computing the distance between the two hashes using a string metric such as Hamming distance. Let’s compute the hash of another sample from the QakBot campaign and then calculate their distance.
>>> b = imagehash.average_hash(Image.open('sample (2).jpg'))
Despite having different MD5 values, the Average Hash values of the two images match.
Applying the Approach with graph_similar_document_images.py
We can use perceptual hashing algorithms and string metrics to identify visually similar malicious documents. As part of this research, we’ve released a script called graph_similar_document_images.py that applies this approach to social engineering images as a detection and identification technique.
The script works by first converting documents into Office Open XML (OOXML) format using LibreOffice so that embedded images can be reliably extracted. This step is necessary because most malicious documents we see use the older Compound File Binary File Format (CFBF). The script then extracts any embedded images, computes their Average Hash values, then calculates the distance between each of the image hashes. Finally, if the distance meets the similarity threshold (87.5% by default), the script graphs the images to create an image hash similarity graph, as shown in Figures 6 and 7.
Figure 6 – Image hash similarity graph generated by graph_similar_document_images.py on a sample set of malicious documents.
Figure 7 – Image hash similarity graph of visually similar social engineering images used in the QakBot campaign.
The script also has a detection mode that identifies images that are visually similar to a blacklist of known-bad image hashes (Figure 8).
Figure 8 – CSV results from graph_similar_document_images.py in detect mode showing matches with social engineering image signatures.
If the malware family or families distributed in a campaign involving malicious documents are known, it’s also possible to associate those families with image hashes. As a result, image hashes become useful indicators for tracking malware campaign and family activity (Table 2).
||Associated Malware Family
Table 2 – Malware families associated with image hashes computed from malicious documents.