Finding Vulnerability Variants at Scale

Filed by Franco Belman (0xFBFBFBFB) on October 15, 2024

While performing a security audit, I discovered a file format vulnerability that took me down an unexpected rabbit hole. The bug was fairly straightforward but what made it interesting was its origin and its variants found across numerous popular projects.

In this post, I’ll guide you through the process used to locate this vulnerability. Then, I’ll explain the method developed to identify its variants at scale in projects such as Chromium, Electron, and WINE among others.

Discovery Phase

jpeg-recompress

During a security audit of a large project I found a standalone binary titled jpeg-recompress. By reading the jpeg-recompress usage instructions, I got an insight into its functionality. This program reads a JPEG file as input and outputs a compressed version of the image. This is done to maintain the image’s visual quality while decreasing its file-size.

testing jpeg-recompress help message

To obtain more information about jpeg-recompress, I searched Github using a small snippet of the program’s help message and found its source.

testing Github search resulting in jpeg-recompress repo

JPEG-Archive

With access to the source code for jpeg-recompress, I familiarized myself with its features and functionality. By reviewing the repo’s documentation I found that this program is part of a software suite titled JPEG-Archive.

JPEG-Archive contains various utilities used to handle JPEG files. The tool’s abilities include JPEG compression, comparison, and hashing. This project was written using C and depends on the image processing library MozJPEG.

libjpeg, libjpeg-turbo, and MozJPEG

MozJPEG and libjpeg-turbo are forks of the libjpeg library. libjpeg is a widely used image processing library written in C, designed for reading and writing JPEG image files.

Both MozJPEG and libjpeg-turbo enhance libjpeg, but they do so in different ways. Libjpeg-turbo prioritizes speed, optimizing the compression and decompression process. In contrast, MozJPEG focuses on achieving higher visual quality and smaller file sizes after image compression.

MozJPEG and libjpeg-turbo are drop-in replacements for libjpeg, as they all have compatible APIs. This compatibility allows any of the three libraries to be used interchangeably when compiling a program.

Fuzzing Phase

Having access to the jpeg-recompress source code, I compiled it with instrumentation using AFL-CC. I then used AFL++ to fuzz the instrumented binary using JPEG files from the go-fuzz-corpus as input. Within one minute of fuzzing the first crash occurred.

testing AFL++ actively fuzzing and displaying crash count

To investigate the origin of the crash, I executed jpeg-recompress with the crashing JPEG as input and used GDB for debugging. GDB indicated that the cause of the crash was a segmentation fault.

The resulting stack trace started at the main function, followed by a call to decodeJpeg. Within decodeJpeg, the memcpy function is invoked, which is the last function called before the crash occurs.

testing Stack trace of the crash

Triaging Phase

Finding Crash Origin

The function decodeJpeg from the stack trace is located in the file util.c. This function’s purpose is to decompress a JPEG and store its bitmap data in a buffer. On line 119, we see the memcpy call that appeared in the stack trace and caused the crash.

testing Function decodeJpeg

Steps Leading to Crash

By examining the code for the decodeJpeg function, we can identify the key steps leading up to the crash:

  1. L104-105: width and height variables are assigned the values of cinfo.output_width and cinfo.output_height. These values coincide with the input JPEG’s dimensions which are defined in its file-header.
  2. L108: row_stride is defined as the product of the image’s width and cinfo.output_components. The value of cinfo.output_components is determined by the color space of the JPEG being processed and can contain one of three values: 1 for grayscale, 3 for RGB, or 4 for CMYK.
  3. L113: row_stride is multiplied by the image’s height to determine the image buffer size, which is then allocated using malloc.
  4. L117-121: We enter a while loop with the aim of populating the image buffer with the contents of the JPEG being processed. This is achieved by using the jpeg_read_scanlines function to write one row of the image’s data into buffer. Then, the contents of buffer are added to the image buffer on each iteration of the loop with memcpy.

testing Process in while loop that transfers JPEG’s decompressed data to image buffer

Vulnerability Analysis

By reviewing the process that preceded the crash, two vulnerabilities were identified: an integer overflow and a heap buffer overflow.

Integer Overflow

An integer overflow occurs when the value of an integer exceeds its limit. This causes its value to wrap around and become a very small or negative number.

Line 113 of util.c reveals that the size of the image buffer is generated by multiplying row_stride and height. row_stride is defined by multiplying width and cinfo.output_components.

*image = malloc(row_stride * (*height));

Knowing this, we can redefine the code as:

*image = malloc(cinfo.output_components * (*width) * (*height));

In this scenario we have control over width and height which are retrieved directly from the JPEG’s header. These values are two bytes in size within the header so they can range from 0x0000 to 0xFFFF.

We can manipulate the width and height values found in the image’s header and set them to large values. When these values are multiplied with cinfo.output_components to compute the image buffer’s size, their product can exceed the maximum size of a 32-bit unsigned integer.

The integer overflow will cause the value of the integer passed to malloc to wrap around and be unexpectedly small. The overflow will result in a buffer that is too small for the contents of the image being processed.

Heap-based Buffer Overflow

A heap overflow occurs when a program writes data beyond the boundaries of a heap buffer, leading to the overwriting of adjacent data.

Vulnerable Code:

while (cinfo.output_scanline < cinfo.output_height) {
    (void) jpeg_read_scanlines(&cinfo, buffer, 1);
    memcpy((void *)((*image) + row_stride * row), buffer[0], row_stride);
    row++;
}

This while-loop is meant to fill the image buffer with the contents of the JPEG being processed. Each iteration of the loop adds a line of the JPEG’s data to the image buffer. Normally, the size of the image buffer correlates with the dimensions of the JPEG.

By abusing the integer overflow described above, we can set the size of the image buffer to be smaller than it’s supposed to be. By making the image buffer too small for the contents of the JPEG, we can overflow it, causing data adjacent to the buffer to be overwritten and the program to crash.

testing Crash in decompression process due to tampered image header

Documentation Review

With a more comprehensive understanding of the vulnerability, I searched through libjpeg’s official examples and documentation for an explanation as to why the library was misused.

The file example.c provides a walkthrough on how the decompression process should be performed. On line 379 a function titled put_scanline_someplace can be found. This leaves it up to the developer to create their own process for storing the JPEG’s data since put_scanline_someplace is a mock function that is not defined.

testing libjpeg documentation

The documentation lacks guidance on how to create the buffer to store decompressed image data. This likely led to the use of attacker-controlled values when calculating the image buffer size in jpeg-recompress. The result was a straightforward combination of vulnerabilities: an integer overflow that caused the improper creation of a buffer, which subsequently overflowed.

Finding Variants

Vulnerability variant analysis is the process of identifying different versions of a vulnerability and assessing their context to understand how each variant can be exploited.

The lack of guidance in the official documentation for libjpeg piqued my interest and led me to hypothesize that these documentation gaps might have caused the vulnerability. If my theory was correct, other projects that relied on libjpeg or one of its forks could also be affected.

Having a clear understanding of the vulnerability in the decompression process, I began looking for variants of it in other projects. Our analysis targeted repos that perform JPEG decompression that may have been incorrectly implemented using one of the three libraries with compatible APIs (libjpeg, libjpeg-turbo, and MozJPEG).

To identify variants of the vulnerability, I employed both automated and manual approaches. For automation, I utilized CodeQL’s variant analysis capabilities. For manual searching, I leveraged Sourcegraph’s code search functionality.

CodeQL

CodeQL is a powerful static analysis tool developed by GitHub for identifying security vulnerabilities, code quality issues, and other types of bugs in codebases. It allows developers to query their code using a query language based on logic programming.

CodeQL treats code as data, enabling users to write custom queries to explore patterns, track data flows, and detect issues across large codebases. We use it along with other tools to identify variants of vulnerabilities in our vulnerability research process at Blackwing.

SourceGraph

SourceGraph is a code search utility that allows its users to search for specific code patterns in repos not only found in Github but also Gitlab and Bitbucket among other platforms.

The goal for the automated process was to cover as much code as possible in search of the pattern that led to the vulnerability. I developed a process that allowed us to scan hundreds of projects simultaneously for this pattern.

BigQuery

The first task was finding repos that implement libjpeg or one of its forks and perform JPEG decompression. To accomplish this I used Google’s BigQuery.

BigQuery is a serverless data storage service accessible via Google Cloud Platform (GCP). BigQuery has precompiled datasets that can be interacted with via GoogleSQL queries. One of their datasets is bigquery-public-data.github_repos, it contains data for more than 2.8 million Github repos. This data includes information about the repo’s commits, files, and file contents.

I wrote a GoogleSQL query to retrieve every repo in the dataset that invokes the function jpeg_start_decompress. If a repo includes this function, it indicates that the project is using one of the target libraries and is performing JPEG decompression. This makes it a prime candidate to search for the vulnerable pattern that was identified.

The query successfully narrowed down the 2.8 million repos in the dataset to the 10,440 that contained the libraries of interest.

SELECT files.repo_name
FROM 'bigquery-public-data.github_repos.contents' AS file_contents
JOIN 'bigquery-public-data.github_repos.files' AS files
ON file_contents.id = files.id
WHERE file_contents.content LIKE '%jpeg_start_decompress%'
GROUP BY files.repo_name;

GoogleSQL query used to find relevant repos

testing Results of GoogleSQL query

Github API

Some GitHub repos have CodeQL databases available to enhance security analysis and code quality. This enables other developers and security researchers to run their own CodeQL queries against the code without needing to set up CodeQL themselves.

Our next task was identifying the repos with an available CodeQL database from our list of 10,440 repos. To accomplish this, we wrote a bash script that took the list of repos found with BigQuery as input, and used Github’s API to check if they had an available CodeQL database. Through this process I was able to identify 831 different repos with a publicly available CodeQL database.

#!/bin/bash

GH_API_TOKEN=""
REPO_LIST=""
OUTPUT_FILE=""
LANGUAGES=("cpp")

TOTAL_LINES=$(wc -l < "$REPO_LIST")
CURRENT_LINE=0

while IFS= read -r REPO
do
   ((CURRENT_LINE++))
   echo "Processing repo $CURRENT_LINE/$TOTAL_LINES: $REPO"
  
   # Iterate over languages we're interested in
   for LANGUAGE in "${LANGUAGES[@]}"
   do
       # Send API request to verify existence of repo's CodeQL database
       RESPONSE=$(curl -s -L \
           -H "Accept: application/vnd.github+json" \
           -H "Authorization: Bearer $GH_API_TOKEN" \
           -H "X-GitHub-Api-Version: 2022-11-28" \
              https://api.github.com/repos/$REPO/code-scanning/codeql/databases/$LANGUAGE)
      
       # If CodeQL database is found then print out the response and save it to a file
       if ! echo "$RESPONSE" | grep -q -e "No database" -e "Not Found" -e "repo access blocked" -e "repo was archived"; then
           echo "$REPO has a CodeQL database available" | tee -a "$OUTPUT_FILE"
           echo "$RESPONSE" | tee -a "$OUTPUT_FILE"
       else
           echo "Filtered out response for repo $REPO with language $LANGUAGE"
       fi
   done
   echo
done < "$REPO_LIST"

Bash script used to find repos with a CodeQL database

CodeQL Query

Lastly, I wrote a CodeQL query that identifies the code pattern that led to the vulnerability in the JPEG decompression process.

// This CodeQL query detects an integer overflow that
// leads to a heap buffer overflow within projects
// that implement libjpeg or one of its forks
import cpp
import semmle.code.cpp.dataflow.DataFlow

from
  MulExpr mult_1, MulExpr mult_2, FunctionCall allocation, Function function,
  DataFlow::Node source1, DataFlow::Node source2, DataFlow::Node source3, DataFlow::Node productNode
where

// Find two variables that are multiplied together
// They should have "output_" or "image_" as part of the name to match either
// output_height, output_width, image_height, or image_width
exists(Expr val1, Expr val2 |
  DataFlow::localFlow(DataFlow::exprNode(val1), DataFlow::exprNode(mult_1.getAnOperand())) and
  DataFlow::localFlow(DataFlow::exprNode(val2), DataFlow::exprNode(mult_1.getAnOperand())) and
  source1.asExpr() = val1 and
  source2.asExpr() = val2 and
  source1 != source2 and
  (mult_1.getAnOperand().(VariableAccess).getTarget().getName().matches("%output_%") or
  mult_1.getAnOperand().(VariableAccess).getTarget().getName().matches("%image_%") )
) and

// Find a third variable that is multiplied with the last two
exists(Expr val3 |
  DataFlow::localFlow(DataFlow::exprNode(val3), DataFlow::exprNode(mult_2.getAnOperand())) and
  DataFlow::localFlow(DataFlow::exprNode(mult_1), DataFlow::exprNode(mult_2.getAnOperand())) and
  source3.asExpr() = val3 and
  source3 != source1 and
  source3 != source2
) and

// Make sure the product of the three variables end up in a memory
// allocation function or a vector resize function call
productNode.asExpr() = mult_2 and
(
  allocation.getTarget().getName().matches("%alloc%")
  or
  allocation.getTarget().getName().matches("%esize%")
) and
DataFlow::localFlow(productNode, DataFlow::exprNode(allocation.getArgument(0))) and

// By detecting the existence of "jpeg_start_decompress"
// we can confirm that the program is going through the
// decompression process we found the vulnerability in
function.getName() = "jpeg_start_decompress"

// Select the allocation function call and its location 
// in code for the results of this query
select allocation, allocation.getLocation().toString()

Multi-repo Variant Analysis

Equipped with a list of repositories containing CodeQL databases and a specific CodeQL query, I began the scanning process. I utilized CodeQL’s multi-repo variant analysis feature to scan all the databases at once using the query.

After analyzing the repositories with available CodeQL databases, 104 instances of the code pattern that led to the vulnerability were identified.

Although the automated scanning process using CodeQL’s variant analysis is efficient in finding specific code patterns, there were only a limited number of repos that had an available CodeQL database. Additionally, I couldn’t search for these patterns in repos stored outside of GitHub.

For the manual search process, I aimed to find the vulnerability in projects where the automated process couldn’t. This included searching through code repositories without a CodeQL database and those hosted on other source management sites like GitLab and BitBucket. To accomplish this, Sourcegraph was utilized.

The approach for using Sourcegraph was simple, I searched for key portions of the vulnerable code pattern that was found. For example, I searched for instances where cinfo.output_width and cinfo.output_height were being multiplied.

I also searched for distinctive function calls used during the JPEG decompression process such as jpeg_start_decompress, jpeg_read_scanlines, and jpeg_finish_decompress.

Although the process of identifying code patterns through string matching is somewhat fragile, the JPEG processing libraries I targeted all share similar naming conventions and implementations. This consistency allowed me to successfully locate the pattern associated with the decompression vulnerability in various projects.

testing Sourcegraph usage example

Conclusion

My work didn’t end with identifying the pattern that led to the vulnerability across various repositories. Reviewing the context in which the pattern existed was still necessary to confirm if the projects were vulnerable.

After reviewing the various repos, over 40 projects were discovered that had variants of a file format vulnerability. Affected software included web browsers, operating systems, and popular image processing libraries. The findings have been reported to the corresponding software vendors and I’m currently working with them to mitigate these issues.

By sharing the methodology for identifying vulnerabilities and their variants at scale, I hope to enable other security researchers to enhance their own audits and contribute to a more secure software ecosystem. A seemingly simple vulnerability can have a much greater impact when its variants are considered, revealing potential risks that might otherwise go unnoticed.

Findings

The following is a list of vulnerable projects that were discovered. Each vulnerability has been reported to the corresponding vendor.

The list isn’t comprehensive, it only includes the projects that have a public report, have been patched or have exceeded Blackwing Intellgence’s 90 day vulnerability reporting disclosure policy.

Blackwing publishes security advisories in our Advisory Database in accordance with our disclosure policy. Once published, advisories for each vulnerability identified during this research will be linked in the findings table below.

SoftwareReport DatePatch DateReport
Chromium3/21/243/27/24Chromium Issue
glmark24/7/24
openMVG4/7/24
Solar2D Game Engine4/7/24
GTK Radiant4/7/24
CImg4/7/247/5/24GitHub Issue
Kodi4/7/244/8/24GitHub Issue
Google’s Squoosh4/7/24Google Issue
C++ Tango4/8/24GitLab Issue
Cocos2D-X4/8/24
DLib4/8/244/9/24
Irrlicht4/8/24
NVIDIA-Texture-Tools4/23/24
GNU-Step4/23/24
ARTOS4/23/24
COVISE4/23/24
ViZDoom4/23/24
Nintendont4/23/24
Raptor4/23/24
SatDump4/23/245/29/24
Tachyon4/23/24
Lightspark4/23/244/24/24GitHub Issue
ArtoolkitX4/23/244/29/24
MiniGUI4/23/24GitHub Issue
Jpeg Archive4/23/24
Kiwi Browser4/23/244/24/24Release Notes
WINE4/24/245/13/24WINE Issue
React OS4/24/24Jira Issue
Cocos Engine4/24/24GitHub Issue
Electron5/4/249/15/24CVE-2024-46993
NW.js5/25/24GitHub Issue
Ravyn OS7/3/24GitHub Issue
Samsung ONE7/3/24Attempt: 7/12/24GitHub Issue
Rive C++7/3/247/12/24

our firm

categories

tags