Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
97 views19 pages

Mgptfuzz

The paper introduces mGPTFuzz, the first fuzzer specifically designed for Matter IoT devices, leveraging a large language model to automate the conversion of the extensive Matter specification into machine-readable information. By conducting blackbox fuzzing, mGPTFuzz successfully identified 147 new bugs across 23 Matter devices, including 61 zero-day vulnerabilities, while a state-of-the-art fuzzer found none. This work highlights the potential of LLM-assisted fuzzing in uncovering both crash and non-crash bugs in IoT applications.

Uploaded by

Ayman Fakri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views19 pages

Mgptfuzz

The paper introduces mGPTFuzz, the first fuzzer specifically designed for Matter IoT devices, leveraging a large language model to automate the conversion of the extensive Matter specification into machine-readable information. By conducting blackbox fuzzing, mGPTFuzz successfully identified 147 new bugs across 23 Matter devices, including 61 zero-day vulnerabilities, while a state-of-the-art fuzzer found none. This work highlights the potential of LLM-assisted fuzzing in uncovering both crash and non-crash bugs in IoT applications.

Uploaded by

Ayman Fakri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

From One Thousand Pages of Specification to

Unveiling Hidden Bugs: Large Language Model


Assisted Fuzzing of Matter IoT Devices
Xiaoyue Ma, Lannan Luo, and Qiang Zeng, George Mason University
https://www.usenix.org/conference/usenixsecurity24/presentation/ma-xiaoyue

This paper is included in the Proceedings of the


33rd USENIX Security Symposium.
August 14–16, 2024 • Philadelphia, PA, USA
978-1-939133-44-1

Open access to the Proceedings of the


33rd USENIX Security Symposium
is sponsored by USENIX.
From One Thousand Pages of Specification to Unveiling Hidden Bugs:
Large Language Model Assisted Fuzzing of Matter IoT Devices

Xiaoyue Ma Lannan Luo Qiang Zeng


George Mason University George Mason University George Mason University

Abstract To find IoT bugs, significant efforts have been dedicated


to emulating IoT firmware to facilitate greybox or white-
Matter is an IoT connectivity standard backed by over two
box fuzzing [11, 16, 37, 76, 77]. Nevertheless, emulating IoT
hundred companies. Since the release of its specification in
firmware remains challenging due to the extensive array of
October 2022, numerous IoT devices have become Matter-
custom and proprietary hardware components. Constructing
compatible. Identifying bugs and vulnerabilities in Matter
a precise emulator is complex and difficult [51, 69]. Conse-
devices is thus an emerging important problem. This paper
quently, blackbox fuzzing emerges as an attractive option and
introduces mGPTFuzz, the first Matter fuzzer in the literature.
has demonstrated noteworthy results [12, 22, 35, 51].
Our approach harnesses the extensive and detailed informa-
tion within the Matter specification to guide the generation of To conduct blackbox fuzzing for Matter, our observation is
test inputs. However, due to the sheer volume of the Matter that the Matter specification contains extensive and detailed
specification, surpassing one thousand pages, manually con- information regarding the device behaviors, such as the valid
verting human-readable content to machine-readable informa- and invalid values for each parameter in a command, the ex-
tion is tedious, time-consuming and error-prone. To overcome pected effect, and the response message after a command
this challenge, we leverage a large language model to success- is executed. It is a promising approach to make use of the
fully automate the conversion process. mGPTFuzz conducts rich information in a specification for test input generation.
stateful analysis, which generates message sequences to un- To build this approach to Matter fuzzing, however, multiple
cover bugs that would be challenging to discover otherwise. challenges arise.
The evaluation involves 23 various Matter devices and discov- Challenge 1 (C1): Sheer volume of specification. The Mat-
ers 147 new bugs, with three CVEs assigned. In comparison, a ter specification describes a unified connection standard that
state-of-the-art IoT fuzzer finds zero bugs from these devices. facilitates interoperability of various devices across vendors.
It is not surprising that the specification contains thorough
details, contained in 1,258 pages. Manually converting the
1 Introduction large amount of human-readable content to machine-readable
information is tedious, time-consuming and error-prone.
Matter is an open, royalty-free IoT connectivity standard,
which is endorsed by over two hundred companies, including Challenge 2 (C2): Stateful bugs. Many IoT commands only
Apple, Google, Amazon, and Samsung [6]. It aims to establish make sense when the device is at a specific state. For example,
a unified standard [66], facilitating interoperability among a command turning off a light has an effect only when the
smart devices of different vendors. With the support of many light is on. A bug that can only be triggered when the device
large technology companies, this unified standard is expected is at a certain state is called a stateful bug [7]. Prior IoT
to completely change the IoT ecology [59]. blackbox fuzzers, such as IoTFuzzer [12] and SNIPUZZ [22],
Since the release of the Matter specification in October usually ignore the impact of device states for bug finding.
2022, Amazon has turned 100 million Echo devices to be How to obtain the complete state space of IoT devices in the
Matter-compatible through a software update [4]. Google application layer is not studied in prior work.
has updated billions of smart devices to support Matter [58]. Challenge 3 (C3): Non-crash bugs. One well-known diffi-
Apple TV and HomePod devices extend support for Matter culty in IoT fuzzing is that, in general, one cannot modify the
through the installation of iOS 16.1 [28]. Given the popularity, firmware and then install the altered firmware onto the device,
discovering bugs and vulnerabilities in Matter devices is an as most IoT devices enforce code-signature checking [31].
emerging important problem. Thus, it is infeasible to collect the program execution informa-

USENIX Association 33rd USENIX Security Symposium 4783


tion inside a device, such as branch coverage, path conditions, security testing team, third-party security analysts, as well as
and function return values. While existing fuzzers use side companies and organizations seeking to assess the security of
channels, e.g., network connection, to infer whether a test their Matter devices before placing trust in them.
input has triggered a crash bug, how to discover non-crash This work makes the following contributions.
bugs, such as logic errors [9], is a known challenge.
• We present the first Matter fuzzer in the literature. It
Challenge 4 (C4): Command coverage. Ideally, a fuzzer can help IoT vendors, security researchers and numerous
should test all the commands of a device. However, manually companies and organizations identify bugs and vulnera-
comprehending the user manual of a device to derive the list bilities from Matter devices.
of commands tends to be imprecise and unscalable. Prior
work, SNIPUZZ [22], collects testing scripts disclosed by IoT • The detailed Matter specification enables a unique re-
vendors, but they are incomplete and very few vendors provide search opportunity but also imposes a challenge. We
them (detailed in Section 4.3). harness LLM to extract information from the sheer vol-
Our work has overcome all the challenges and built the ume of specification, showcasing the effectiveness of
first Matter fuzzer, named mGPTFuzz, in the open literature. LLM-assisted fuzzing on a large scale.
To address C1 and C2, we employ a large language model • mGPTFuzz is a blackbox IoT fuzzer that is able to un-
(LLM) to transform human-readable content in the speci- cover non-crash bugs in the application layer. In addition,
fication to machine-readable information. Furthermore, the semantic information is leveraged to conduct systematic
machine-readable information, including the device states, stateful fuzzing, unveiling stateful bugs.
commands and their relations, is represented in finite-state
• We employ a controller-based fuzzing architecture. This
machines (FSMs), where each node represents a device state
design eliminates the need for reverse engineering any
and each edge a transition due to command execution. By iter-
companion apps or collecting API-testing scripts, and
ating over the FSMs, our fuzzer performs systematic stateful
can derive a complete list of the supported commands of
analysis, effectively identifying stateful bugs.
a device under test.
To handle C3, we propose to leverage command semantics.
For instance, if the execution of a command is expected to • We implement mGPTFuzz and test 23 various Matter de-
modify specific device attributes as per the specification, our vices. The evaluation finds 147 new bugs (including 61
fuzzer queries the corresponding attributes. On the other hand, zero-day vulnerabilities), comprising 5 crash bugs and
if a command is meant to be rejected, an error message is ex- 142 non-crash bugs, with three CVEs assigned.
pected. Precise extraction of semantic information concerning
commands ensures the efficacy of this method. 2 Related Work
Finally, to tackle C4, we are inspired by prior work,
HubFuzzer [35], which makes use of an IoT hub to test 2.1 Large Language Model Assisted Fuzzing
ZigBee/Z-Wave devices. Our observation is that there is a
special role in Matter, controller, which can add (formally, Large Language Models (LLMs) have demonstrated remark-
commission), manage and control Matter devices. Notably, able performance across a wide range of tasks. Very recently,
while the certificate of an ordinary Matter device undergoes researchers have started to leverage LLMs for fuzzing. For ex-
stringent verification by the controller, a Matter device does ample, TitanFuzz [17] and FuzzGPT [18] use LLMs to gen-
not verify the controller equally. Consequently, even an uncer- erate code for testing deep learning libraries. Fuzz4All [68]
tified vendor can create a controller [48]. We thus propose to further demonstrates that code of different languages can
build our fuzzer within a controller. According to the specifi- be generated using LLMs for testing a variety of compil-
cation, when a controller adds a device, the device declares all ers and libraries. LLM4Fuzz [55] employs LLMs to guide
supported commands. Without relying on the device’s man- fuzzing of smart contracts. ChatAFL [36] utilizes LLMs to
ual or testing scripts, our fuzzer can extract the supported process Request for Comments (RFCs) and generate test in-
commands of a device under test from pairing messages. puts. ChatFuzz [26] leverages LLMs to mutate seeds in order
We implement mGPTFuzz1 and conduct an extensive eval- to generate format-conforming inputs.
uation, which involves 23 various Matter devices. It has re- All these works study LLM-assisted greybox fuzzing, as-
vealed 147 new bugs (61 zero-day vulnerabilities), including suming that the code under test can be instrumented, while our
5 crash bugs and 142 non-crash bugs. Three CVEs have been work is the first that studies LLM-assisted blackbox fuzzing.
assigned: CVE-2023-42189, CVE-2023-45955, CVE-2023- Without instrumenting the code (i.e., IoT firmware), how to
45956. In comparison, a state-of-the-art blackbox fuzzer, infer the program execution status to guide fuzzing is a unique
SNIPUZZ [22], finds zero bugs from these devices. challenge resolved in this work. In addition, while these works
mGPTFuzz can be used by IoT vendors that cannot afford a are mostly focused on finding crash bugs, our work also finds
non-crash bugs in a principled way. This stems from our ap-
1 https://iot-fuzz.github.io proach that extracts meticulous information per command and

4784 33rd USENIX Security Symposium USENIX Association


utilizes it to verify the correctness of each command execution. Matter
Finally, this work demonstrates the efficacy of LLM-assisted
fuzzing in an important domain—Internet of Things. TCP UDP

IPv6
2.2 Fuzzing of IoT Firmware
BLE
WiFi Thread Ethernet …
Static analysis of IoT firmware for finding bugs [3, 21, 33, (for commissioning)

38, 53, 60, 61, 78, 80] tends to report many false positives. Figure 1: Protocol stack with Matter.
Emulating firmware can support whitebox or greybox fuzzing,
which dynamically finds IoT vulnerabilities [11, 16, 20, 34, 37, a protocol to generate test messages. However, this process
52, 69, 77]. Despite extensive efforts [32, 74, 75, 79], precisely is tedious, incomplete and error-prone for non-trivial proto-
emulating IoT firmware remains a challenging task [51,69,73]. cols [47]. We leverage LLMs to obtain the specification infor-
Worse, many vendors do not provide firmware images publicly mation, eliminating the need for excessive manual efforts.
and make it difficult to extract unencrypted firmware from
devices [12, 22, 51].
3 Background
Recently, blackbox fuzzing of IoT firmware demonstrates
noteworthy results and becomes an attractive approach [12,
22, 35, 51]. For example, IoTFuzzer [12] and DIANE [51]
3.1 Matter
reverse engineer the companion app of a device under test. Figure 1 shows the protocol stack with Matter [65]. Matter is
For each IoT device, they analyze its companion app to locate built on top of the IP layer and uses it as a common for com-
the code corresponding to IoT functionalities and modify it as municating with IP-based networks, such as WiFi, Ethernet,
strong obfuscation techniques are widely available [8, 71, 72]. and Thread.
However, this requires enormous per-device efforts for ana- A special device, called a controller, such as a smart
lyzing and modifying obfuscated apps [8,19], and the analysis speaker, can add (formally, commission) and manage Matter
may be incomplete in identifying IoT functionalities [22, 35]. devices. To add a device, the controller (or a trusted com-
Furthermore, a Matter device, in general, cannot be controlled missioner, such as the companion app of the controller) first
by the vendor’s companion app. verifies the device’s attestation certificate to make sure it is
SNIPUZZ [22] first collects API-testing scripts disclosed signed by a vetted vendor. Then, an operational certificate
by IoT vendors and then captures network messages when is signed by the controller and sent to the device. Here, the
running these scripts. The network messages are then mutated controller serves as a root certification authority (CA). Matter
to conduct fuzzing. It saves the effort for reverse engineering devices that share the same root CA form a Matter fabric, and
companion apps. However, it has multiple other limitations. a controller can control all the Matter devices in its fabric.
First, very few IoT vendors disclose their API-testing scripts Matter devices in the same fabric conduct secure P2P commu-
(Section 4.3). Second, even when a vendor discloses API- nication with each other through their operational certificates.
testing scripts, they usually do not cover all the supported A Matter device in a fabric is a node, which has a number
commands of a device. Third, SNIPUZZ fails if network mes- of endpoints, each representing a specific functional unit. For
sages are encrypted, while most IoT devices use encryption example, a smart bulb usually has an endpoint about lighting.
for secure communication [12]. An endpoint has one or more clusters, where a cluster is a
HubFuzzer [35] uses a hub to issue test messages to IoT group of related functionalities.
devices, and is limited to fuzzing ZigBee and Z-Wave devices. A cluster contains attributes and commands. (1) An at-
It inspires our controller-based fuzzing architecture. tribute represents a device state. For example, the On/Off
BrakTooth [25] mainly tests the data link and network cluster has an OnOff attribute that indicates the on/off state
layers of Bluetooth Classic, while Matter is a standard focused of the device. An attribute has a data type, such as boolean,
on the application layer. Unlike BrakTooth, which infers state integer, and string, and may be read-only or read-write. (2)
machines about Bluetooth Classic from exchanged packets, A command represents an action that may be performed by
we extract state machines from the specification (Section 4.1). a device. For example, the On/Off cluster provides an On()
In sum, mGPTFuzz is the first fuzzer for Matter, an important command, which sets the OnOff attribute to "1".
IoT standard. It does not need to emulate the firmware, reverse
engineer companion apps, or collect API-testing scripts.
3.2 Large Language Models
2.3 Specification-Guided Fuzzing Emerging Large Language Models (LLMs) have demon-
strated impressive performance on a variety of tasks. These
Specification-guided fuzzing [29, 46, 50] uses manually pre- LLMs are pre-trained on billions of available text. Due to the
pared machine-readable information (or a generator) about extensive training data, LLMs can be directly employed for

USENIX Association 33rd USENIX Security Symposium 4785


“This command (KeySetRemove) SHALL fail with an incorrectly. In sum, the approach does not provide a “stan-
INVALID_COMMAND status code back to the initiator if the dard” that a Matter device should adhere to. We thus choose
GroupKeySetID being removed is 0, which is the Key Set asso- to extract information from the specification.
ciated with the Identity Protection Key (IPK). ” Finally, LLM-assisted fuzzing is promising but not mature
yet. We encounter multiple questions which, to our knowl-
Figure 2: A specification passage ignored by developers.
edge, are not documented in prior LLM-assisted fuzzing work.
specific downstream tasks without undergoing fine-tuning on For example, how to deal with a large text that exceeds the
specialized datasets [10]. This is achieved through prompt token limit; how to generate a complex FSM; how to provide
engineering [64], wherein a task description, along with a few non-trivial context for subsequent queries. Such experiences
task demonstrations, is presented to the LLM. Researchers can help other research in this direction.
have shown that utilizing the paradigm of directly leveraging
Language Models through prompts can already attain state- 4.2 Threat Model
of-the-art performance on downstream tasks [27, 57, 63, 70].
Since the Matter specification is written in a natural lan- System Under Attack and Assumptions. We consider a
guage, LLMs pre-trained on extensive corpora should be capa- smart environment (such as a warehouse, office, or home)
ble of processing the specification. We thus leverage an LLM contains various Matter devices. They interact through user-
to extract information from the Matter specification. While defined automation rules. We assume the devices have vulner-
the approach of mGPTFuzz is general for different LLMs, we abilities like those discussed in our work.
utilize the well-known LLM: GPT-4 [44]. Attacker Model. We consider two types of attackers. (1) A
user in the smart environment, such as a warehouse, can make
4 Overview use of Matter devices under his/her control to launch attacks.
(2) An attacker has no authorized control over any Matter
4.1 Motivation of Using LLM devices in the target environment, but can make use of vulner-
abilities in commissioning [39, 54] to add a malicious device
To extract information from the Matter specification, a into the target Matter fabric. The attacker then makes use of
straightforward approach is to manually read it carefully and the device(s) under his/her control to send exploit commands
draw all the FSMs. However, we use an LLM for the process to vulnerable devices. The goal is to cause a device into a
because of the following reasons. state desired by the attacker, which facilitates subsequent ex-
• To save much tedious manual effort. The specification ploitation, such as unauthorized access and burglaries. For
spans 1,258 pages. While the part describing clusters example, an attacker can crash a light to cause poor illumi-
has 589 pages, the remaining is not useless. For example, nance for surveillance recordings. As another example, given
it covers data types and their ranges, as well as defini- the automation rule “when temp > 80°F, open the window”, by
tions of the symbols used in the cluster description. Such crashing the thermostat, a non-vulnerable smart window may
knowledge is first extracted and then used for subsequent open. Similar attacks are stressed in the literature [13–15].
interactions with LLM for cluster-specific queries. Furthermore, without involving attacks, bugs of critical
• To avoid overlooking important information. Manu- devices, such as heaters, locks, valves, and smoke detectors,
ally extracting information from the specification will can pose hazards to smart homes and their residents [24].
likely neglect important information. For example, Mat-
ter SDK developers omitted the information, illustrated 4.3 Limitations of a SOTA IoT Fuzzer
in Figure 2, that the GroupKeySetID = 0 should not be
deleted (CVE-2023-42189). The many non-crash bugs SNIPUZZ [22] represents a state of the art approach to black-
we found also show that developers omit information. box IoT fuzzing. Below, we discuss why SNIPUZZ does not
• To cope with the quick evolution of the standard. work well for analyzing Matter devices.
Since Version 1.0 was released in October 2022, three Manually Collecting Testing Programs. SNIPUZZ needs to
new versions have been published (V1.1 in May 2023, manually collect API-testing scripts for each device under
V1.2 in October 2023, and V1.3 in May 2024). The test, while only a few vendors disclose them. For example,
automation of knowledge-base extraction can accelerate among the 23 devices involved in our evaluation, only 6 have
the update of the fuzzer. their API-testing scripts publicly available.
Another alternative approach is to extract FSMs from Low Command Coverage. Even for devices that have API-
code [23, 25, 56]. However, it is incomplete. For example, testing scripts available, these scripts typically only cover a
the Matter SDK provides a framework for developing an IoT small portion of commands supported by the devices. Figure 3
device, but it does not stipulate all the details. Second, it may shows the comparison between the number of commands
contain bugs, e.g., due to handling parameter value ranges covered by SNIPUZZ and that by our approach.

4786 33rd USENIX Security Symposium USENIX Association


SONOFF Switch Setting-Up Functionality Matter
SNIPUZZ
Linkind Dimmer Switch Messages Extractor Specification
Our Approach Matter
TP-Link Light Switch Controller
Matter Test Message Fuzzing
Kasa Light Switch Knowledge GPT-4
Device Mutator Base
Response
Vuytret Plug
Tuo Contact Sensor Device State Fuzzing
Monitor Prompts
Tuo Button Policies
Philip Hue Hub
Yeelight Cube Figure 4: Architecture of mGPTFuzz.
Yale Locker
EVE Switch our fuzzer. This way, we can use a controller to test a device,
Aqara Sensor without relying on API-testing scripts or companion apps.
Linkind Bulb
Govee Light Strip
Second, according to the Matter specification, when a de-
Onvis Plug vice is added by the controller, it announces the device types
Sengled Smart bulb along with the supported commands and attributes in the
Eve Door Sensor setting-up messages. This way, we can obtain a complete list
Eve Motion Sensor of the supported commands and attributes from the setting-up
Switchbot Hub2 messages. Hence, high command coverage can be attained.
Nanoleaf Light Strip
Third, the Matter specification contains extensive details,
Orein Smart Lighting
Tapo Plug
regarding the command parameter types, value ranges, the
Kasa Plug expected response of a command, and the state change due to
0 50 100 150 200
a command. Given the relatively new Matter implementation
and the meticulous specification, it is highly improbable that
Figure 3: Commands covered by SNIPUZZ vs. mGPTFuzz. developers have thoroughly digested the specification and
Only 6 out of 23 Matter devices (Kasa Plug, Tapo Plug, adhered to all the details when writing the code. Thus, it is a
Nanoleaf Light Strip, Switchbot Hub2, Govee Light Strip, and promising and valuable idea to check the Matter implementa-
SONOFF Switch) disclosed their API-testing scripts. For the tion against the specification. Given the lengthy specification,
remaining devices, we enhance SNIPUZZ by considering a we leverage a pre-trained large language model to convert
device as an abstract one and counting the commands of an the human-readable content to machine-readable information,
abstract device as being covered by SNIPUZZ. For example, which guides the fuzzing.
a smart switch is abstracted into a binary switch, supporting
Fourth, we do not mutate network messages for fuzzing,
the on() and off() commands.
but modify the code of a controller to generate test messages
in plaintext, which is then encrypted and sent to the device.
Neglecting the Rich Information in Specification. SNIPUZZ Furthermore, we configure a Thread border router in the com-
does not make use of the rich information in specifications puter that runs our custom controller. This way, the controller
for fuzzing, and is unable to detect stateful or non-crash bugs. can test Thread devices, as well as WiFi and Ethernet devices.
Cannot Handle Encrypted Messages. SNIPUZZ mutates test
inputs by modifying the collected network messages. The
approach fails for encrypted communication used by Matter. 4.5 System Architecture

4.4 Goals and Ideas Figure 4 shows the architecture of mGPTFuzz. It contains
the following main components. (1) A custom Matter Con-
We aim to build a Matter fuzzer with the following features: troller commissions Matter devices, sends test messages to
(1) No need to collect API-testing scripts or reverse engineer them, and receives responses. (2) When a Matter device is
companion apps. (2) High command coverage. (3) Making commissioned, it generates a sequence of setting-up mes-
use of the Specification information to guide fuzzing. (4) sages. From these messages, the Functionality Extractor
Working with encrypted communication protocols that sup- component learns the functionalities of the device, such as
port Matter, such as WiFi, Ethernet and Thread. Below, we the supported commands and attributes (Section 5.1). (3) An
present the insights and ideas for constructing these features. LLM is leveraged, through prompt engineering, to convert
First, as a Matter device can be configured to control an- the Matter specification to a Knowledge Base (Section 5.2).
other, we initially attempted to build our fuzzer into a custom (4) According to our rich Fuzzing Policies (Section 5.3), the
Matter device. However, the custom device cannot obtain a Fuzzing Mutator generates test messages (Section 5.4). (5)
legitimate attestation certificate signed by a vetted vendor. The Device State Monitor monitors the IoT device to capture
Our observation is that the certificate of a controller is not bugs and vulnerabilities, and the results are used to further
checked, and thus a custom controller can be built, integrating guide the fuzzing (Section 5.5).

USENIX Association 33rd USENIX Security Symposium 4787


5 Design of mGPTFuzz over the FSMs, our fuzzer performs systematical fuzzing to
detect bugs in devices.
5.1 Learning Functionality of Matter Devices
5.2.1 Information Extraction
There are two parts in the Matter specification: Matter Core
Specification [2] and Matter Application Cluster Specifica- We design prompt engineering to extract information from
tion [1]. The former provides information about the foun- the Matter specification. However, it is known that LLMs can
dational clusters (such as the group key management and be creative and may make up information in their responses.
network diagnose cluster) for establishing and maintaining Worse, given the same prompts, LLMs may produce different
communications. The latter provides information about the outputs across interactions. Thus, a challenge arises: How
application clusters, detailing how devices interact via specific to extract accurate and stable information via LLMs? To
application data and commands. overcome this challenge, we employ three methods.
A cluster represents a group of related functionalities and First, the temperature in an LLM is a parameter that con-
has a unique 2-byte cluster identifier (CID). For instance, trols the randomness of the LLM’s output [43]. A higher
in the Matter Core Specification, the Access Control cluster temperature results in more creative and imaginative text,
(with the CID = 0x001F) sets the rules for managing the while a lower one results in more factual and stable text. We
access control list of a device. As another example, in the aim to obtain factual information extracted from the Matter
Matter Application Cluster Specification, the Level Control specification; thus, temperature=0 is employed. This setting
cluster (with the CID = 0x0008) allows for the regulation of a ensures that the LLM strictly adheres to the factual nature
device’s physical quantity level, such as the brightness of a of the source material for extracting knowledge, providing
bulb or the extension length of a blind. A list of all available stable and consistent information across different queries.
clusters can be found in the two Matter specification. Second, we employ In-context few-shot learning [57] to
ensure the information extracted by LLMs is accurate and fol-
Extracting Supported Clusters of Devices. A sequence of low the specified output format. In-context few-shot learning
setting-up messages are generated when a Matter device con- is an effective strategy for improving the model accuracy by
nects a controller, which contains rich information about the augmenting the context with a small number of examples that
device, including the device id, manufacture code, and sup- illustrate desired inputs and outputs. This approach enriches
ported clusters. Based on the reported information and ac- the context for LLMs, enabling them to better understand the
cording to the two Matter specification, we can learn the syntax of the prompt, recognize output patterns, and accu-
functionalities supported by the device, and determine (1) rately extract information. By employing this technique, we
which commands can control this device, and (2) which at- guide the LLM with examples to accurately extract useful
tributes are supported in the device. As an example, from the information in the desired format.
setting-up messages of the Kasa Plug device, we learn that Third, we employ self-consistency checks [62] to refine and
the device contains two endpoints, where endpoint 0 includes validate the generated responses, ensuring reliability of the re-
10 clusters, and endpoint 1 includes 5 clusters. sults. Even with the methods above employed, the model may
still output answers that contain some stochastic information,
5.2 Learning Knowledge Base via LLM although such instances are rare. We engage multiple conver-
sations with the LLM and consider the majority of consistent
The Matter specification provides a comprehensive descrip- answers as the final results.
tion about commands and attributes, including the data type Due to the token limit of GPT-4, we cannot feed the whole
of each argument for every command, the value range of each specification to it. We notice that each cluster corresponds to
argument, as well as how these commands and attributes mu- one chapter in the specification. We thus segment the clus-
tually influence each another. To support fuzz testing, we ter description part of the specification into multiple pieces,
need to extract critical information from the specification each for one cluster. However, prior to its extension of the
and convert the large amount of human-readable content into token limit in November 2023, one long cluster, DoorLock,
machine-readable information. spanning 67 pages, exceeds the token limit. Thus, we further
We employ an LLM to convert human-readable content segment the content of the DoorLock cluster and query the
from the specification into machine-readable information. We information from each segment one by one. Afterwards, we
first extract information related to commands and attributes concatenate the responses.2
from the specification using the LLM (Section 5.2.1), and
Prompts. There are two types of datatypes: (1) base datatypes,
then query the LLM to represent the information as FSMs
such as uint, int, and bool, and (2) derived datatypes, which
(Section 5.2.2). In an FSM, each node represents a device
are derived from the base datatypes. The base datatypes are
state and each edge represents a transition triggered by a
command. The entire process of generating the knowledge 2 SinceNovember 2023, the token limit has increased to 128,000 tokens,
base takes roughly 15 minutes for all clusters. By iterating which corresponds to around 96,000 words or 192 single-spaced pages.

4788 33rd USENIX Security Symposium USENIX Association


Prompt for Base Datatype Prompt Template for Extracting Cluster Information

Please provide responses to the question, and it is Please provide responses to the questions in the specified
imperative that your responses be strictly based on the text order and format.
provided below.
[Cluster Text].
[Base Datatype Text].
Queries are as follows:
The query is as follows: 1. Derived Datatypes and Corresponding Value Ranges
List the data types in the text and their value ranges in 2. List of Commands
JSON format. 3. List of Attributes
4. List of Command IDs
Figure 5: Prompt for querying base datatypes. 5. List of Attribute IDs

shared across all clusters; thus, we only need to make one Here is an Example Output in JSON format.
query for all the 26 base datatypes, rather than a query for
each cluster. Figure 5 shows the prompt that queries the base [Example Output].
datatypes and their corresponding value ranges.
Figure 6 shows the prompt template for querying infor- Figure 6: Prompt template for querying information of a clus-
mation per cluster, consisting of the Cluster Text, Queries, ter. It is simplified for the sake of presentation. Appendix A
and an Example Output. There are totally 67 clusters. Each provides a verbatim example of assembled prompts.
cluster is queried separately, and the prompt generation is
automated by assembling the prompt template using scripts. Model Response
Given a cluster, the Cluster Text is converted from its chapter
in the specification. Specifically, we use the Optical Char- "OnOff": {
1. "Derived DataTypes and Value Ranges": {
acter Recognition tool [49] to convert the PDF-formatted "StartUpOnOffEnum": {"values":[0,1,2]}
specification into text. },
Derived datatypes only appear in certain clusters. Thus, 2. "Commands": {
for each cluster, we query the derived datatype and the corre- "Off":[], "On":[], "Toggle":[],
"OffWithEffect":["EffectIdentifier":"uint8", "EffectVari-
sponding value range (Query 1 in Figure 6). Given a cluster, ant":"uint8"],
we also need to know its commands and attributes, the data "OnWithRecallGlobalScene":[],
types and value ranges for each command’s argument and "OnWithTimedOff":["OnOffControl":"map8","OnTime":"uint16",
"OffWaitTime":"uint16"]
each attribute, as well as their IDs (Queries 2-5 in Figure 6).
},
Responses. Figure 7 shows an example response for the 3. "Attributes": {
OnOff cluster, which includes five pieces of information (cor- "OnOff":"bool", "GlobalSceneControl":"bool",
"OnTime":"uint16", "OffWaitTime":"uint16",
responding to the five queries in the prompt). Specifically, "StartUpOnOff":"StartUpOnOffEnum"
there is one derived datatype, six commands, and five at- },
tributes for this cluster. For each command, its arguments, the 4. "Command IDs": ["0x00", "0x01", "0x02", "0x40", "0x41",
corresponding datatype, and the command ID are extracted. "0x42"]
5. "Attribute IDs": ["0x0000", "0x4000", "0x4001", "0x4002",
For each attribute, its datatypes and ID are also extracted. In "0x4003"],
particular, StartUpOnOffEnum is a derived datatype in the }
OnOff cluster, serving as the datatype for an argument of the
StartUpOnOff command. Figure 7: Model response for the OnOff cluster.

5.2.2 FSM Generation


context few-shot learning as illustrated in the Shot 1 section of
An FSM is a tuple (Q,Σ,∆,δ), where Q denotes a finite set of Figure 8. Moreover, we provide the datatype knowledge about
states, Σ represents the initial state, ∆ represents the destina- the base datatypes (extracted by the prompt for base datatypes
tion state, and δ stands for the commands that can map Σ to ∆. in Figure 5) and the derived datatypes related to the cluster
Given a cluster, we make a query for each command, and gen- (extracted by Query 1 in Figure 6). This enables the LLM
erate an FSM specific to that command. After that, all FSMs to accurately understand the data range in the second step
are combined to form a comprehensive FSM representing the of the Chain-of-Thought process. In addition, we leverage
entire cluster. Chain-of-Thought prompting [63] to ensure the precision of
Figure 8 shows the prompt template designed to query the information extracted by LLMs (see the Chain-of-Thought
useful information of a command (specified using the com- part in Figure 8). Chain-of-Thought prompting involves struc-
mand ID) to generate FSMs. In this process, we employ in- turing prompts to guide LLMs through a series of logical steps,

USENIX Association 33rd USENIX Security Symposium 4789


Prompt Template for Generating an FSM Through these methods of designing the prompt, we are
able to effectively obtain FSM information. After generating
You will be assigned the role of a Software Testing FSMs for each command within a cluster, we merge all FSMs
Assistant and will receive a portion of the protocol into a comprehensive FSM representing this cluster. There
specification text related to the cluster [Name]. Your are 52 FSMs generated, with a total of 521 states and 522 tran-
primary task is to prepare the Finite State Machine (FSM)
sitions. Note that 15 clusters do not have any command and
test cases for software black-box testing. It is imperative
the involved attributes are read-only (e.g., the BooleanState
that your responses be strictly based on the text provided. If
the text does not contain information relevant to the query, cluster has only one read-only attribute and zero commands).
respond with: ’No’. For the 52 FSMs, the number of states is in the range of [1, 46],
and the number of edges is in the range of [1, 50]. The most
Chain-of-Thought: complex FSM is for the DoorLock cluster, which contains 46
1. Extract the initial and destination states as mentioned states and 50 edges.
in the "Effect" section in the provided cluster text. Example. Figure 9 shows part of the generated FSM, achieved
2. Extract the value range of each command’s argu- by merging multiple FSMs corresponding to all commands
ment to transfer from the initial state to the desti- within the LevelControl cluster. This example contains 10
nation state. Please refer to the content marked as states and 13 edges. Each edge encompasses detailed informa-
"Datatype Knowledge" regarding datatypes and their tion on the state transition process, including the command
value ranges. name and the possible value and data type for each argument.
3. Extract the invalid value of each command’s argument Verifying the Quality of FSMs. Besides self-consistency
and error messages, if specified in the cluster text. checks with multiple queries, we also manually validate the
4. Extract unaccepted values of each state. quality of the FSMs. We first randomly sample 100 out of the
522 transitions. Three authors spent a total of 9 hours manu-
[Cluster Text] ally and independently checking the accuracy of the informa-
tion described in the 100 transitions. We confirm that all the in-
[Datatype Knowledge] formation is accurate. We then pick a cluster, LevelControl,
and check whether all the transitions described in the speci-
According to the text, extract the FSM information for the fication are covered by the FSM, and result is positive. The
command with the ID = [CommandID]. checking demonstrates that the LLM is able to accurately
extract the FSM information.
Shot 1:
[Place an example of generating an FSM here. ]
5.3 Fuzzing Policies
Desired Output Format: Our fuzzer iterates over the FSMs to generate test inputs. The
[Here is the desired output format for the FSM.]
following policies are used.
Figure 8: Prompt template for generating an FSM. Policy 1: For each FSM edge, we (a) change the argument
values to the values specified by edge; (b) if the valid argument
value is a range, provide extreme values (such as min and max
similar to a human thought process, to arrive at the desired of the valid range); (c) provide random valid values excluding
output. This technique proves especially effective in complex the extreme values. Moreover, for each command, we (d)
situations when the straightforward question-to-answer for- change the length of a string-type argument trying to trigger
mat may not produce comprehensive results. Moreover, we buffer overflows; (e) provide empty values to strings to trigger
leverage self-consistency checks [62] to enhance the reliability uninitialized read or null pointer deference; and (f) provide
of the response. NULL or only one element to arrays, sets, or bags to cause null
We observed that even when employing methods like set- pointer deference or out-of-bounds access.
ting the temperature=0 to prevent randomness and creativity,
Policy 2: Changing Argument Types. Given an argument sup-
there are still rare situations that 1) the LLM may produce a
posedly with the data type t, we change its type to a randomly
response whose format does not align with Shot 1, and 2) the
selected one t ′ . For example, for an argument with the String
LLM fabricates information when the provided Cluster Text
type, we change its type to the integer type by replacing a
does not contain information relevant to a query. We attribute
string value with an integer value, to check whether the device
the first issue to the complexity of the output. To ensure the
can handle the special “string”.
LLM response follows the desired format, we add the Desired
Output Format section at the end of the prompt. To address Policy 3: Changing the Number of Arguments. For a command
the second issue, we emphasize not making up information, requiring n arguments, we provide n + 1, n − 1, or 0 arguments.
as shown in the last sentence of the first paragraph in Figure 8. Policy 4: Trying Unsupported Clusters and Commands. Be-

4790 33rd USENIX Security Symposium USENIX Association


CurrentLevel : any MoveToLevel (TargetLevel, TransitionTime) CurrentLevel : Write_Attri_StartupCurrentLevel (TargetLevel) StartupCurrentLevel :
- any: [1, 254] - TargetLevel : uint8 TargetLevel - Level: uint8 TargetLevel
OnOff : False - TransitionTime : uint16 OnOff : False OnOff : False

Write_Attri_OnLevel (TargetLevel =1)


Step (StepMode = 1, StepSize, TransitionTime)
- TargetLevel: uint8 MoveToLevelWithOnOff (TargetLevel, TransitionTime)
- StepMode : 0, 1
- TargetLevel : uint8
- StepSize : uint8
CurrentLevel : any - TransitionTime : uint16
StepWithOnOff (StepMode = 1, StepSize, TransitionTime) - TransitionTime : uint16
- any: [1, 254] - StepMode : 0, 1
OnOff : True - StepSize : uint8
- TransitionTime : uint16 MoveWithOnOff (MoveMode = 1, Rate) CurrentLevel :
CurrentLevel : 1
- MoveMode : 0, 1 TargetLevel
OnOff : False
Move (MoveMode = 0, Rate) - Rate : uint8 OnOff :True
CurrentLevel : - MoveMode : 0, 1
254 - Rate : uint8
OnOff : False MoveToLevelWithOnOff (TargetLevel = 1, TransitionTime) Write_Attri_StartupCurrentLevel (TargetLevel)
- TargetLevel : uint8 MoveToLevel (TargetLevel, TransitionTime) - Level: uint8
- TransitionTime : uint16 - TargetLevel : uint8
MoveToLevelWithOnOff (TargetLevel = 1, TransitionTime)
- TransitionTime : uint16
- TargetLevel : uint8
- TransitionTime : uint16 StartupCurrentLevel :
Step (StepMode = 0, Rate) TargetLevel
CurrentLevel : 1 CurrentLevel : 254
- StepMode : 0, 1 OnOff :True
OnOff : True OnOff : True
- Rate : uint8

Figure 9: The generated FSM for the LevelControl cluster.

sides the supported clusters, we also randomly select a few not adhere to the transition described in the FSMs, a non-
unsupported clusters. For each command in a selected un- crash bug is captured. Specifically, given a valid test message
supported cluster, we generate test messages following the (i.e., valid command/attribute ID and argument values), the
command definition. Through this, we check whether unex- controller expects a response message SUCCESS from the de-
pected commands can cause the device to crash. vice under test and the destination state is also checked by
querying the attribute describing the state. Given an invalid
5.4 Constructing Test Messages one, it expects an error message, such as INVALID_COMMAND.
Given a bug, if the symptom can only be reproduced when
To build a test message, given a command, a straightforward the device is at certain states, it is a stateful bug; otherwise, a
way is to invoke the API in the controller that invokes the com- non-stateful one.
mand. However, such APIs contain various input sanitization.
Consequently, invalid test messages cannot be constructed.
To resolve this issue, our solution is to locate the procedure 6 Evaluation
that packs messages, which we call the (message) packing
procedure. It is invoked by each API to generate messages to This section presents the implementation and the evaluation
be sent to IoT devices. We then remove the input sanitization results. Section 6.1 describes the implementation details. Sec-
in the packing procedures. Note unlike prior work [12,51] that tion 6.2 presents the experimental setup. Section 6.3 summa-
removes sanitization in the companion app of each device, rizes the bug-finding results. Section 6.4 presents the results
our sanitization elimination is a one-time effort. of detecting crash bugs and Section 6.5 non-crash bugs. We
There are two types of commands. compare mGPTFuzz with a state-of-the-art work in Section 6.6.
Finally, Section 6.7 discusses the efficiency.
• Ordinary commands. Each cluster contains zero or more
ordinary commands. The packing procedure, Interaction-
ModelCommands::SendCommand, from the controller 6.1 Implementation
chip-tool [48], is used for generating such commands. We implement a prototype of mGPTFuzz. We utilize an open-
• Write-Attribute can modify the specified cluster at- source tool, chip-tool [48], provided by the Matter Consor-
tribute. The packing procedure, InteractionModelCom- tium, to build our custom controller. We remove the input
mands::WriteAttribute, from chip-tool [48], is used sanitization in its message packing procedures, such that our
for generating such commands. test inputs are not rejected due to sanitization [51].
The controller is able to commission Matter-over-WiFi
and Matter-over-Ethernet devices using the chip-tool
5.5 Device State Monitor code-wifi pairing script. To support the Thread radio
From the perspective of mGPTFuzz, it is trivial to detect a de- communication capabilities, we insert an nRF52840 Micro
vice crash, as a crash causes a disconnection (and a timeout Dev Kit USB Dongle (priced at $21.99 on Amazon [5])
exception for the next test message). To detect a non-crash to our desktop. Moreover, we install the ot-br-posix li-
bug, for each test message, if the response message and the brary, which turns our desktop into an OpenThread Border
destination state, in terms of the involved attribute values, do Router (OTBR) [45]. Subsequently, our custom controller

USENIX Association 33rd USENIX Security Symposium 4791


ID Device Type Vendor Model Firmware Version Protocol
1 Plug Kasa KP125MP4 1.0 Matter over WiFi
2 Plug Tapo P125M 1.0.7 Matter over WiFi
3 Bulb Orein OS0100811267 3.01.26 Matter over WiFi
4 Lightstrip Nanoleaf NF080K03-2LS v3.5.10 Matter over Thread
5 Hub SwitchBot W3202100 v1.0-0.8 Matter over WiFi
6 Motion Sensor Eve 20EBY9901 2.11 Matter over Thread
7 Door Sensor Eve Door 20EBN9901 2.1.1 Matter over Thread
8 Bulb Sengled W41-N15A v22 Matter over WiFi
9 Plug Onvis S4 1.1 Matter over Thread
10 LED Strip Govee H61E1 v3.00.42 Matter over WiFi
11 Smart Bulb Linkind LS0101811266 3.01.26 Matter over WiFi
12 Water Sensor Aqara DW-S02E 1.0 Matter over Thread
13 Switch Eve 20EBU4101 3.2.1 Matter over Thread
14 Locker Yale W41-N15A 1.0 Matter over Thread
15 Cube Smart Lamp Yeelight YLFWD-0009 v1.12.69 Matter over WiFi
16 Hub Philip Hue 453761 v1.59.195909703 Matter over Ethernet
17 Button Tuo TSB3194 1.0 Matter over Thread
18 Contact sensor Tuo TCS-07505 1.0 Matter over Thread
19 Wifi plug Vuytret YX-WS02B v1.0.5 Matter over WiFi
20 Light Switch TP-Link KS225 1.0 Matter over WiFi
21 Light Switch Tapo Tapo S505 1.0 Matter over WiFi
22 Dimmer Switch Linkind B0C74J9FCN 1.0 Matter over WiFi
23 Smart Switch SONOFF MINIR4M 1.0 Matter over WiFi
(a) Device details (b) Photo of devices

Figure 10: IoT devices used in our experiments.

is able to commission Matter-over-Thread devices using the through the device’s companion app. (2) We excluded Hub-
chip-tool code-thread pairing script. Fuzzer, as it only tests ZigBee and ZWave devices. (3) We
For LLM, we use GPT-4-Turbo [44]. The temperature in also excluded fuzzers that are not open source. We thus
an LLM is a parameter that controls the randomness of the picked SNIPUZZ for comparison. Another reason we chose
LLM’s output [42]. A higher temperature results in more SNIPUZZ is because the evaluation of SNIPUZZ shows that it
creative and imaginative text, while a lower one results in outperforms prior work, such as NEMESYS [30], BooFuzz [29]
more accurate and factual text. We aim to obtain precise and DooNA [67].
and factual information extracted from the Matter protocol
specification; thus, temperature=0 is employed. 6.3 Bug Discovery Results
To fuzz-test a device, the only manual effort is to pair it
We divide bugs into two categories: (1) crash bugs, which
with mGPTFuzz. Note that developing mGPTFuzz, including
result in device crashes, and (2) non-crash bugs, which cause
prompt engineering, is a one-time effort.
incorrect behaviors but do not crash devices. From the 23
Matter devices, we discover 147 bugs, including 5 crash bugs
6.2 Experimental Setup and 142 non-crash bugs. Among the 147 bugs, there are 10
Matter Devices Under Test. We acquire 23 popular consumer stateful bugs, where 4 are stateful crash bugs and 6 stateful
Matter IoT devices from both online and offline markets, non-crash bugs. The other 137 bugs can be triggered regard-
covering various brands, such as Philip Hue, Yeelight, and less of the current device state. The detailed results are out-
Yale. The types of the Matter devices include smart switches, lined in Table 1. Among the 147 bugs, 61 bugs lead to a denial
plugs, lighting, lockers, sensors, and hubs. These devices are of service, i.e., the devices crash (CVE-2023-45955 & CVE-
either recommended by Amazon or the best-selling products 2023-45956), or do not respond until they are re-paired with
in supermarkets. Their details are illustrated in Figure 10. the controller (CVE-2023-42189). Given the DoS nature, we
classify the 61 bugs as vulnerabilities.
Testing Environment. Our mGPTFuzz runs on a Ubuntu 20.04
PC with 4.9 GHz Intel® Core(TM) i7 CPU and 32 GB RAM.
6.4 Crash Bugs
We configure the Matter devices in a fully controlled network
to avoid the interference of irrelevant traffic. The 5 identified crash bugs are distributed as follows:
Baseline Method. Blackbox fuzzing of IoT firmware demon-
• One crash bugs exist in the device Nanoleaf Lighting
strates noteworthy results. There are a variety IoT blackbox
NF080K03-2LS (with the device ID = 4), and it has been
fuzzers, such as IoTFuzzer [12], Diane [51], HubFuzer [35],
assigned CVE-2023-45955.
FIoT [78] and SNIPUZZ [22]. (1) We excluded IoTFuzzer and
Diane for comparison because they send test inputs from • Four crash bugs exist in Govee Lighting H61E1 (with
companion apps, while a Matter device cannot be controlled the device ID = 10), which are stateful bugs, requiring

4792 33rd USENIX Security Symposium USENIX Association


Table 1: Summary of bugs detected by mGPTFuzz. (1) UT value of the corresponding argument. If there is more than
stands for Unexpected Transition, meaning the device transits one argument, there will be more than one set of quotation
to an unexpected state. (2) DoS means Denial of Service. Note marks (with each enclosed value indicating the index of each
that all the bugs are missed by the baseline tool SNIPUZZ. argument) as well as the corresponding value.
The fuzzing time is mainly determined by the number of To trigger the bug, mGPTFuzz constructs a command mes-
commands and attributes supported by the device. sage, where the list has only one element with value = 0
Device Crash Bugs Non-crash Bugs Fuzzing (following the fuzzing policy 1 in Section 5.3), the generated
ID # of Bugs Impact # of Bugs Impact Time payload is ["0" : {"0" : 0}].
1 0 - 8 UT, DoS 1.28 h Observation. The device is supposed to reject the aforemen-
2 0 - 8 UT, DoS 1.24 h tioned invalid input. However, it accepts the message. We
3 0 - 10 UT, DoS 1.45 h
observe that the light initially exhibited a flickering behavior,
4 1 DoS 9 UT, DoS 3.00 h
5 0 - 3 UT, DoS 2.01 h followed by a crash.
6 0 - 4 UT, DoS 1.68 h
7 0 - 4 UT, DoS 4.67 h 6.4.2 Stateful Crash Bugs in Govee Lighting Device
8 0 - 7 UT, DoS 4.00 h
9 0 - 9 UT, DoS 3.70 h The four bugs are related to four hidden commands, Move_up
10 4 DoS 12 UT, DoS 4.20 h (uint8), Move_down (uint8), Move_up_OnOff (uint8),
11 0 - 10 UT, DoS 4.55 h and Move_down_OnOff (uint8), which can increase or de-
12 0 - 4 UT, DoS 1.67 h
crease the brightness of the device with or without the OnOff
13 0 - 9 UT, DoS 1.03 h
14 0 - 4 UT, DoS 3.09 h effect at a certain rate. Each command accepts one argument
15 0 - 9 UT, DoS 3.67 h with the data type uint8, which specifies the rate value.
16 0 - 3 UT, DoS 1.37 h Triggering Bugs. If the device is at the required initial state
17 0 - 2 DoS 4.50 h
18 0 - 2 DoS 2.93 h
and then an invalid value of 0 is provided as the command
19 0 - 6 UT, DoS 3.92 h argument (following the fuzzing policy 1 in Section 5.3), the
20 0 - 7 UT, DoS 3.40 h test message makes the device crash. Taking Move_up_OnOff
21 0 - 6 UT, DoS 3.60 h (unit8) as an example (the last row in Table 2), the nor-
22 0 - 2 DoS 3.07 h mal payload is {"0" : 20}, where the value of 20 (with-
23 0 - 4 UT, DoS 3.70 h
out quotes) denotes the rate value. If the brightness level
(CurrentLevel) is the lowest (i.e., 1) and the rate in the com-
the device to be set to a particular state to be triggered. mand is an invalid value of 0, the generated test message
These bugs have been assigned CVE-2023-45956. causes the device to crash. Note 254 represents the highest
brightness value in Matter, and 1 the lowest.
Note that to save CVE resources, given multiple bugs of
It is worth noting that a stateful bug can be triggered only
a device that are related to a group of similar commands or
if the device is first set to a certain initial state. Otherwise, an
exploit messages, only one CVE is requested.
identical exploit input cannot trigger the bug. This exemplifies
Below we discuss these discovered crash bugs. The details
the importance of stateful fuzzing.
of these bugs are summarized in Table 2. A hidden API means
that the command or attribute is not covered in the vendor’s
API-testing scripts or described in its website. 6.5 Non-Crash Bugs
6.4.1 Crash Bug in Nanoleaf Lighting Device From the 23 Matter devices, mGPTFuzz finds 142 non-crash
bugs, 6 of which are stateful non-crash bugs. Detecting non-
This bug is related to a hidden write-attribute command, crash bugs presents a greater challenge compared to crash
Write_Attribute_Binding, within the Binding cluster, bugs, as network connection state, which can be employed
which is used to establish a persistent relationship between an as clues for crash bug detection, is not useful for detecting
endpoint and local/remote endpoints. This command accepts non-crash bugs.
one argument with the data type List in the following two We find two types of non-crash bugs: N1) bugs where the
formats: List[node-id, endpoint-id, cluster-id] or device should reject the corresponding exploit messages but
List[group-id]. accepts and processes them; N2) bugs where the device should
Triggering Bug. In the fourth column of Table 2, for the accept the corresponding exploit messages but mistakenly
sake of presentation simplicity, only the message payload is rejects them. The distribution of the two types of non-crash
displayed. The payload of a message should follow a specific bugs across the tested devices is shown in Table 3.
JSON format. For example, given a payload, {"0" : 1}, the Below we present some cases of non-crashed bugs. The
integer of 0 within the quotation marks indicates the index of details of these cases are summarized in Table 4. Specifically,
the argument, and the value of 1 after the colon represents the Section 6.5.1 and Section 6.5.2 discuss some cases of non-

USENIX Association 33rd USENIX Security Symposium 4793


Table 2: Details of discovered crash bugs.
Device Hidden Normal Message Ð→ Exploit Required
Command Observation CVE
ID API? (only payload is shown) Initial State
["0": {"0" : 20}] Ð→ ["0":{"0" : 0} ]
4 ✓ Write_Attribute_Binding(List[uint16]) Policy 1: Provide an invalid value Any state Device Crashed CVE-2023-45955
0 to the element of the argument
{"0": 10} Ð→ {"0": 0}
10 ✓ Move_down(uint8) Policy 1: Provide an invalid CurrentLevel = 254 Device Crashed CVE-2023-45956
value 0 to the argument
{"0": 20} Ð→ {"0": 0}
10 ✓ Move_up(uint8) Policy 1: Provide an invalid CurrentLevel = 1 Device Crashed CVE-2023-45956
value 0 to the argument
{"0": 20} Ð→ {"0": 0}
10 ✓ Move_down_OnOff(uint8) Policy 1: Provide an invalid CurrentLevel = 254 Device Crashed CVE-2023-45956
value 0 to the argument
{"0": 20} Ð→ {"0": 0}
10 ✓ Move_up_OnOff(uint8) Policy 1: Provide an invalid CurrentLevel = 1 Device Crashed CVE-2023-45956
value 0 to the argument

Table 3: Summary of non-crashed bugs. There are two types the device to become unresponsive and out of service.
of non-crashed bugs. Type N1 refers to bugs where the device (2) If no arguments are provided to the command (follow-
should reject the test messages but instead accepts them. Type ing the fuzzing policy 3 in Section 5.3), the generated test
N2 refers to bugs where the device should accept the test message causes a device to become unresponsive to any sub-
messages but mistakenly rejects them. sequent request, regardless of the current device state.
Device Type N1 Type N2 Device Type N1 Type N2 Analysis. Since the non-crash bugs are present across all the
ID # of Bugs # of Bugs ID # of Bugs # of Bugs
devices, we suspect that they are associated with the Matter
1 8 - 13 9 -
2 8 - 14 4 -
SDK. We thus report the bugs to the Matter SDK developer,
3 9 1 15 9 - and they confirm the bugs and have fixed them promptly (in
4 9 - 16 3 - Matter v1.1), as documented in [40]. Below, we analyze
5 3 - 17 2 - the root cause of the bug and its patch.
6 4 - 18 2 -
7 4 - 19 6 - When a Matter controller sends the KeySetRemove com-
8 7 - 20 7 - mand to a Matter device, the Matter device invokes the
9 9 - 21 6 - KeySetRemoveCallBack function to manage the command
10 12 - 22 2 - and process its payload. The payload specifies the index of
11 9 1 23 4 -
the key set that should be removed. When the index is 0,
12 4 -
the key set is associated with the Identity Protection Key
(IPK). On the other hand, if the index is not specified (i.e., no
crash bugs of Type N1 (Section 6.5.2 focuses on stateful argument is provided to the KeySetRemove command), the
non-crash bugs of Type N1), and Section 6.5.3 Type N2. KeySetRemoveCallBack function automatically assigns the
removal index to 0. The IPK serves as a crucial public key
6.5.1 Non-Crash Bugs of Type N1 in All Matter Devices
utilized by both the Matter device and Matter controller. It
Two non-crash bugs affect all the Matter devices, as shown undergoes verification throughout the entire communication
in Table 2 (with the device ID labeled as All). These bugs between the Matter device and controller to guarantee the in-
have been assigned CVE-2023-42189. They are not stateful tegrity of the communication. If the verification of IPK fails,
bugs, so can be triggered in any device states. Both bugs are any further service request of the device is denied, resulting
related to the hidden command, KeySetRemove (uint16) in a Denial of Service (DoS). Therefore, the IPK should not
(see Figure 2). This command is used to remove a key set be removed in order to maintain the security and functionality
from an entire stack storing all keys, where the index for of the Matter system.
removal is determined by the argument of the command. This However, the vulnerable KeySetRemoveCallBack func-
command accepts one argument with the data type uint16 tion, when provided with a removal index of 0, incorrectly
and its valid value range is [1, 65534]. removes the IPK, and sends a SUCCESS response to the
Triggering Bugs. There are two ways to exploit the hidden controller (or mGPTFuzz in our work). As the IPK is re-
command KeySetRemove() to trigger the non-crash bugs, as moved, the encrypted communication cannot be decrypted
shown in Table 2 (with the device ID labeled as All). and verified, rendering the Matter device unresponsive to
(1) If an invalid value {"0" : 0} is generated for the ar- any subsequent requests. To rectify this issue, the patched
gument (following the fuzzing policy 1 in Section 5.3), no KeySetRemoveCallBack function incorporates this check-
matter what the current device state is, the message causes ing: it first verifies whether the removal index is 0 or not

4794 33rd USENIX Security Symposium USENIX Association


150 120
before executing the key removal action. If the index is 0, it
replies with INVALID_COMMAND as a status code.

Time (Minute)

Time (Minute)
100 80

50 40
6.5.2 Stateful Non-Crash Bugs of Type N1
0 0
We find six Type N1 stateful non-crash bugs in the device 0 1500
Test Case
3000 0 500
Test Case
1000 1500

Govee Lighting H61E1 (with the device ID = 10). These non- (a) Nanoleaf Lightstrip NF080K03-2LS (ID = 4) (b) Eve Motion Sensor 20EBY9901 (ID = 6)

crash bugs are related to three hidden commands, MoveHue,


Figure 11: Efficiency results, where a red dot denotes a crash
MoveSaturation, EnhancedMoveHue. Each command ac-
bug and a green dot denotes a non-crash bug.
cepts two arguments, where the first argument is of the data
type enum and takes a value ∈ [0,1,2,3], and the second one
test message is valid and should be accepted by the device.
is of the data type uint8. Taking MoveHue as an example, the
However, the devices reject the message.
first argument is MoveMode, which determines the direction
of the hue change. Specifically. When MoveMode equals 0, it
indicates stop direction (i.e., no hue change); when MoveMode 6.6 Comparison with Baseline Method
equals 1, the device should increase its hue. When MoveMode We consider SNIPUZZ as the baseline, which represents a state
equals 3, the device should decrease its hue. The second argu- of the art in blackbox fuzzing of IoT devices [22]. For fair
ment Rate specifies the rate of movement per second. comparison, we have extended the capabilities of SNIPUZZ.
Triggering Bugs. The Matter specification explicitly states (1) SNIPUZZ is integrated into our custom controller, so plain-
that (1) a message, where the first argument, MoveMode, is set text messages are presented to SNIPUZZ. Consequently, it is
to 1 (increase) or 3 (decrease), and the second argument Rate able to test Matter devices, which always use encrypted com-
equals 0, is considered as an invalid message, and (2) if this munication. (2) Hidden commands are provided to it. The
invalid message is sent to the device, the device should reject details are described below.
the message and respond INVALID_COMMAND. SNIPUZZ is designed to detect crash bugs, and is not capa-
Our tool mGPTFuzz successfully extracts this critical infor- ble of detecting non-crash bugs. We thus compare the per-
mation from the specification, and finds two ways to trigger formance of crash bugs detection between SNIPUZZ and our
the non-crash bugs for each command, as shown in Table 4 tool mGPTFuzz. The original version of SNIPUZZ cannot fuzz-
(corresponding to the rows where the device ID = 10). Taking test Matter devices. We thus enhance SNIPUZZ by integrating
MoveHue as an example, if the first parameter is assigned a it into our custom controller, so plaintext messages are pre-
value of 1 or 3, the second parameter is set to a value of 0, and sented to SNIPUZZ. We use the enhanced SNIPUZZ to test all
at the same time, the current hue value is set to the maximum the 23 Matter devices. However, after 24 hours of fuzz testing
(i.e., 254), the device accepts the invalid test message. To on each device, no bugs are found by SNIPUZZ.
trigger other non-crash bugs, the details of the test messages SNIPUZZ requires the API-testing programs of IoT devices
are also outlined in Table 4. to collect seed messages and can only test the commands
Observations. According to the Matter specification, the covered by the API-testing programs. As a result, it cannot
aforementioned test messages are invalid, and the expected detect the crash bugs triggered by the hidden commands,
behavior of the device is to reject them and reply with which include all the 5 crash bugs detected by mGPTFuzz.
INVALID_COMMAND. But our observations reveal that upon We then proceed to investigate whether SNIPUZZ could de-
receiving these test messages, the actual behavior of the de- tect these bugs if the corresponding hidden commands were
vice is to accept and process them, resulting in an alteration provided to it. Specifically, for each discovered crash bug, we
of the light color and the hue value changed to 0. provide SNIPUZZ with a message associated with the hidden
command that involves this bug. We then use the snippet de-
6.5.3 Non-Crash Bugs of Type N2 termination algorithm of SNIPUZZ to partition these messages.
The results show SNIPUZZ is unable to accurately determine
We find one Type N2 non-crash bug in each of the two devices, the snippets for any of them. E.g., the message that can trig-
Orein Bulb OS0100811267 (with the device ID = 3) and ger a bug related to the command Move_up (discussed in
Linkind Bulb LS0101811266 (with the device ID = 11). This Section 6.4.2) should be partitioned into 14 snippets, but it is
bug is related to the hidden command MoveColor(int16, inaccurately partitioned into 6 snippets after a 2-hour analysis.
int16), which is used to modify the ColorMode attribute on We further investigate the snippet determination algorithm
a device, prompting it to transition colors continuously at of SNIPUZZ, and have the following findings. The Matter
the specified rates. It accepts two arguments of the data type protocol requires the payload of a Matter message to follow a
int16, which specify the rates of color changes per second. JSON format. However, as SNIPUZZ removes the bytes in a
Triggering Bug and Observations. When both the first and message one by one to generate probe messages, this results
second arguments are assigned a value of 0, the resulting in probe messages not following the JSON format.

USENIX Association 33rd USENIX Security Symposium 4795


Table 4: Some of the discovered non-crash bugs.
Device Hidden Normal Message Ð→ Test Required Device
Command Expected Behavior Actual Behavior
ID API? (only payload is shown) Initial State
{"0": 20} Ð→ {"0": 0}
Policy 1: Provide an invalid
Device should Device was
value 0 to the argument
KeySetRemove(uint16) Any state reject message & Out of
All ✓ {"0": 20} Ð→ {}
respond INVALID_COMMAND Service
(CVE-2023-42189) Policy 3: provide
no argument
{"0": 1, "1": 2} Ð→ {"0": 1, "1": 0}
Policy 1: Provide an invalid
Device should Device state
value 0 to the argument
reject message & was changed to
10 ✓ MoveHue(enum, uint8) {"0": 3, "1": 5} Ð→ {"0": 3, "1": 0} CurrentHue = 254
respond INVALID_COMMAND CurrentHue = 0
Policy 1: Provide an invalid
value 0 to the argument
{"0": 1, "1": 2} Ð→ {"0": 1, "1": 0}
Policy 1: Provide an invalid
Device should Device state
value 0 to the argument
MoveSaturation(enum, uint8) reject message & was changed to
10 ✓ {"0": 3, "1": 5} Ð→ {"0": 3, "1": 0} CurrentSaturation = 254
respond INVALID_COMMAND CurrentSaturation = 0
Policy 1: Provide an invalid
value 0 to the argument
{"0": 1, "1": 2} Ð→ {"0": 1, "1": 0}
Policy 1: Provide an invalid
Device should Device state
value 0 to the argument
10 ✓ EnhancedMoveHue(enum, uint8) reject message & was changed to
{"0": 3, "1": 5} Ð→ {"0": 3, "1": 0} EnhancedCurrentHue = 254
respond INVALID_COMMAND EnhancedCurrentHue = 0
Policy 1: Provide an invalid
value 0 to the argument
{"0": 3, "1": 2} Ð→ {"0": 0, "1": 0}
Device should Device
3, 11 ✓ MoveColor(int16, int16) Policy 1: Provide a Any State
accept message rejected message
value 0 to both arguments

6.7 Efficiency Matter is released under the Apache 2.0 license, permitting
various uses. The specification is publicly accessible on the
In the last column of Table 1, we present the total time spent official website [1, 2]. More importantly, according to the
by mGPTFuzz on testing each device. The longest testing time instructions of ChatGPT [41], ChatGPT does not use con-
is approximately 5 hours for the device with ID = 7. tent from its business offerings such as ChatGPT Team or
We use two devices as examples to illustrate the fuzzing ChatGPT Enterprise to train its models. We utilized ChatGPT
efficiency in term of bugs discovered over time (Y-axis) and Team throughout the study. Therefore, our approach using
over the number of test messages (X-axis). As shown in Fig- ChatGPT does not cause an ethical issue.
ure 11. mGPTFuzz can discover crash bugs and non-crash bugs
efficiently. For the device Nanoleaf Lightstrip NF080K03-2LS 8 Conclusion
shown in Figure 11(a), all bugs are found within 110 minutes
and ≤3100 test message, and the first bug is found within As an industry-wide IoT standard, Matter is expected to com-
10 minutes. For the device Eve Motion Sensor 20EBY9901 pletely change the ecology of smart devices. Thus, fuzzing of
shown in Figure 11(b), all the four bugs are found within 90 Matter devices is an emerging important problem. We present
minutes and ≤1200 test messages. the first Matter fuzzer in the literature. A large language model
is leveraged to transform the human-readable specification,
over one thousand pages, to machine-readable information
7 Discussion in the form of finite state machines (FSMs). Guided by the
FSMs, our blackbox fuzzing is able to find stateful bugs and
Compared to prior work [12, 22, 51], mGPTFuzz is limited to non-crash bugs, as well as crash bugs. We have built a pro-
fuzzing Matter devices. However, given the importance of totype of mGPTFuzz and conducted an extensive evaluation
Matter, it is worth the dedicated effort. Furthermore, the ap- involving 23 Matter devices. It finds 147 new bugs, including
proach of LLM-assisted blackbox fuzzing can be generalized 61 zero-day vulnerabilities with three CVEs assigned.
to other scenarios where the specification is available, such
as Zigbee, Thread and Bluetooth. Acknowledgements
Ethical Considerations and Proactive Harm Prevention.
We have contacted all the vendors regarding the bugs and This work was supported in part by the US National Sci-
vulnerabilities of their products. We have reported the vul- ence Foundation (NSF) under grants CNS-2304720, CNS-
nerability (CVE-2023-42189) to the Matter SDK developer, 2310322, CNS-2309550, and CNS-2309477. It was also sup-
since it impacts all the Matter devices. It has been fixed in ported in part by the Commonwealth Cyber Initiative (CCI).
Matter V1.1. After contacting them, we waited at least 90 The authors would like to thank the anonymous reviewers for
days before reporting the vulnerabilities for CVE assignment. their valuable comments.

4796 33rd USENIX Security Symposium USENIX Association


References for Linux-based embedded firmware. In Network and
Distributed System Security Symposium (NDSS), 2016.
[1] Matter 1.0 application cluster specification, 2022. https:
//csa-iot.org/wp-content/uploads/2022/11/22-27350-0 [12] Jiongyi Chen, Wenrui Diao, Qingchuan Zhao, Chaoshun
01_Matter-1.0-Application-Cluster-Specification.pdf. Zuo, Zhiqiang Lin, XiaoFeng Wang, Wing Cheong Lau,
Menghan Sun, Ronghai Yang, and Kehuan Zhang. IoT-
[2] Matter 1.0 core specification, 2022. https://csa-iot.org Fuzzer: Discovering memory corruptions in IoT through
/wp-content/uploads/2022/11/22-27349-001_Matter-1 App-based fuzzing. In Network and Distributed System
.0-Core-Specification.pdf. Security Symposium (NDSS), 2018.

[3] Zafeer Ahmed, Ibrahim Nadir, Haroon Mahmood, [13] Haotian Chi, Chenglong Fu, Qiang Zeng, and Xiaojiang
Ali Hammad Akbar, and Ghalib Asadullah Shah. Iden- Du. Delay wreaks havoc on your smart home: Delay-
tifying mirai-exploitable vulnerabilities in IoT firmware based automation interference attacks. In Proc. IEEE
through static analysis. In Proc. IEEE International Symposium on Security and Privacy (S&P), 2022.
Conference on Cyber Warfare and Security, 2020.
[14] Haotian Chi, Qiang Zeng, and Xiaojiang Du. Detecting
[4] Amazon Developer. Alexa and matter, 2024. https: and handling IoT interaction threats in multi-platform
//developer.amazon.com/en-US/alexa/matter. multi-control-channel smart homes. In USENIX Security
Symposium (USENIX Security), 2023.
[5] Amazon.com. GeeekPi nRF52840 Micro Dev Dongle.
https://www.amazon.com/GeeekPi-nRF52840-Micro [15] Haotian Chi, Qiang Zeng, Xiaojiang Du, and Jiaping Yu.
-Dev-Dongle/dp/B07MJ12XLG/ref=sr_1_2?crid=OQ Cross-App interference threats in smart homes: Catego-
HJSOLRCRHI&keywords=nRF52840+Dongle&qid= rization, detection and handling. In Proc. IEEE/IFIP
1678331320&sprefix=nrf52840+dongle%2Caps%2C7 International Conference on Dependable Systems and
4&sr=8-2. Networks (DSN), 2020.

[6] Arrow Electronics. Matter solves IoT and smart home [16] Andrei Costin, Apostolis Zarras, and Aurélien Francil-
challenges, 2023. https://www.arrow.com/en/researc lon. Automated dynamic firmware analysis at scale: A
h-and-events/articles/matter-solves-iot-and-smart-h case study on embedded web interfaces. In Proc. ACM
ome-challenges. Asia Conference on Computer and Communications Se-
curity (ASIACCS), 2016.
[7] Vaggelis Atlidakis, Patrice Godefroid, and Marina Pol-
ishchuk. Restler: Stateful REST API fuzzing. In Proc. [17] Yinlin Deng, Chunqiu Steven Xia, Haoran Peng,
IEEE International Conference on Software Engineer- Chenyuan Yang, and Lingming Zhang. Large language
ing (ICSE), 2019. models are zero-shot fuzzers: Fuzzing deep-learning
libraries via large language models. In Proc. ACM SIG-
[8] Lina Berzinskas. Obfuscating Android Apps: Do you SOFT International Symposium on Software Testing and
know your choices for protection?, 2020. https://proa Analysis (ISSTA), 2023.
ndroiddev.com/obfuscation-is-important-do-you-kno
w-your-options-30b3ef396dfe. [18] Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang,
Shizhuo Dylan Zhang, Shujing Yang, and Lingming
[9] Tim Blazytko, Moritz Schlögel, Cornelius Aschermann, Zhang. Large language models are edge-case fuzzers:
Ali Abbasi, Joel Frank, Simon Wörner, and Thorsten Testing deep learning libraries via FuzzGPT. arXiv
Holz. AURORA: Statistical crash analysis for auto- preprint arXiv:2304.02014, 2023.
mated root cause explanation. In USENIX Security Sym-
posium (USENIX Security), 2020. [19] Shuaike Dong, Menghao Li, Wenrui Diao, Xiangyu Liu,
Jian Liu, Zhou Li, Fenghao Xu, Kai Chen, Xiaofeng
[10] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Wang, and Kehuan Zhang. Understanding Android ob-
Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, fuscation techniques: A large-scale investigation in the
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, wild. In Proc. Security and Privacy in Communication
Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Networks (SecureComm), 2018.
Sparks of artificial general intelligence: Early experi-
ments with GPT-4. arXiv:2303.12712, 2023. [20] Xuechao Du, Andong Chen, Boyuan He, Hao Chen, Fan
Zhang, and Yan Chen. AflIoT: Fuzzing on Linux-based
[11] Daming D Chen, Maverick Woo, David Brumley, and IoT device with binary-level instrumentation. Comput-
Manuel Egele. Towards automated dynamic analysis ers & Security, 122:102889, 2022.

USENIX Association 33rd USENIX Security Symposium 4797


[21] Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, IEEE International Conference on Software Engineer-
Brian Testa, and Heng Yin. Scalable graph-based bug ing (ICSE), 2022.
search for firmware images. In Proc. ACM SIGSAC
Conference on Computer and Communications Security [33] Lannan Luo and Qiang Zeng. Solminer: Mining distinct
(CCS), 2016. solutions in programs. In Proc. IEEE International
Conference on Software Engineering (ICSE), 2016.
[22] Xiaotao Feng, Ruoxi Sun, Xiaogang Zhu, Minhui Xue,
Sheng Wen, Dongxi Liu, Surya Nepal, and Yang Xiang. [34] Lannan Luo, Qiang Zeng, Bokai Yang, Fei Zuo, and
Snipuzz: Black-box fuzzing of IoT firmware via mes- Junzhe Wang. Westworld: Fuzzing-assisted remote dy-
sage snippet inference. In Proc. ACM Conference on namic symbolic execution of smart Apps on IoT cloud
Computer and Communications Security (CCS), 2021. platforms. In Annual Computer Security Applications
Conference (ACSAC), 2021.
[23] Paul Fiterau-Brostean, Bengt Jonsson, Robert Merget,
Joeri de Ruiter, Konstantinos Sagonas, and Juraj So- [35] Xiaoyue Ma, Qiang Zeng, Haotian Chi, and Lannan Luo.
morovsky. Analysis of DTLS implementations using No more companion Apps hacking but one dongle: Hub-
protocol state fuzzing. In USENIX Security Symposium based blackbox fuzzing of IoT firmware. In Proc. ACM
(USENIX Security), 2020. International Conference on Mobile Systems, Applica-
tions, and Services (MobiSys), 2023.
[24] Chenglong Fu, Qiang Zeng, and Xiaojiang Du.
HAWatcher: Semantics-aware anomaly detection for [36] Ruijie Meng, Martin Mirchev, Marcel Böhme, and Ab-
appified smart homes. In USENIX Security Symposium hik Roychoudhury. Large language model guided proto-
(USENIX Security), 2021. col fuzzing. In Network and Distributed System Security
Symposium (NDSS), 2024.
[25] Matheus E Garbelini, Vaibhav Bedi, Sudipta Chattopad-
hyay, Sumei Sun, and Ernest Kurniawan. BrakTooth: [37] Alejandro Mera, Bo Feng, Long Lu, and Engin Kirda.
Causing havoc on bluetooth link manager via directed DICE: Automatic emulation of DMA input channels for
fuzzing. In USENIX Security Symposium (USENIX Se- dynamic firmware analysis. In Proc. IEEE Symposium
curity), 2022. on Security and Privacy (S&P), 2021.
[26] Jie Hu, Qian Zhang, and Heng Yin. Augmenting [38] Ibrahim Nadir, Zafeer Ahmad, Haroon Mahmood,
greybox fuzzing with generative AI. arXiv preprint Ghalib Asadullah Shah, Farrukh Shahzad, Muhammad
arXiv:2306.06782, 2023. Umair, Hassam Khan, and Usman Gulzar. An audit-
ing framework for vulnerability analysis of IoT system.
[27] Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee- In Proc. IEEE European Symposium on Security and
Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Privacy Workshops, 2019.
Roy Ka-Wei Lee. LLM-Adapters: An adapter family for
parameter-efficient fine-tuning of large language models. [39] National Institute of Standards and Technology (NIST).
arXiv preprint arXiv:2304.01933, 2023. CVE-2022-25836, 2022. https://nvd.nist.gov/vuln/deta
il/CVE-2022-25836.
[28] Apple Inc. Matter support in ios 16, 2023. https://deve
loper.apple.com/apple-home/matter/. [40] National Institute of Standards and Technology (NIST).
CVE-2023-42189, 2023. https://nvd.nist.gov/vuln/deta
[29] Joshua Pereyda. Boofuzz: Network protocol fuzzing for
il/CVE-2023-42189.
humans, 2017. https://github.com/jtpereyda/boofuzz.
[30] Stephan Kleber, Henning Kopp, and Frank Kargl. [41] OpenAI. What is ChatGPT, 2022. https://help.openai.
NEMESYS: Network message syntax reverse engineer- com/en/articles/6783457-what-is-chatgpt.
ing by analysis of the intrinsic structure of individual [42] OpenAI. API for authentication, 2023. https://platform
messages. In USENIX Workshop on Offensive Technolo- .openai.com/docs/api-reference/authentication.
gies, 2018.
[43] OpenAI. Createtranscription: Temperature parameter,
[31] Platon Kotzias, Srdjan Matic, Richard Rivera, and Juan 2023. https://platform.openai.com/docs/api-reference
Caballero. Certified PUP: abuse in authenticode code /audio/createTranscription#audio-createtranscription-t
signing. In Proc. ACM SIGSAC Conference on Com- emperature.
puter and Communications Security (CCS), 2015.
[32] Wenqiang Li, Jiameng Shi, Fengjun Li, Jingqiang Lin, [44] OpenAI. Gpt-4 and GPT-4 turbo documentation, 2023.
Wei Wang, and Le Guan. µAFL: Non-intrusive feedback- https://platform.openai.com/docs/models/gpt-4-and-g
driven fuzzing for microcontroller firmware. In Proc. pt-4-turbo.

4798 33rd USENIX Security Symposium USENIX Association


[45] OpenThread. ot-br-posix, 2024. https://github.com/ope [57] Simeng Sun, Yang Liu, Dan Iter, Chenguang Zhu, and
nthread/ot-br-posix. Mohit Iyyer. How does in-context learning help prompt
tuning? arXiv preprint arXiv:2302.11521, 2023.
[46] PeachTech. Peach fuzzer community. https://peachtech.
gitlab.io/peach-fuzzer-community/. [58] The Verge. Nest thermostat gains Matter support, works
with Apple home, 2023. https://www.theverge.com/202
[47] Van-Thuan Pham, Marcel Böhme, and Abhik Roychoud- 3/4/18/23687751/nest-thermostat-matter-support-app
hury. AFLNet: A greybox fuzzer for network protocols. le-home.
In Proc. IEEE International Conference on Software
Testing, Validation and Verification (ICST), 2020. [59] Jennifer Pattison Tuohy. Matter’s plan to save the smart
home, 2021. https://www.theverge.com/22787729/matt
[48] Project CHIP. Connected home over IP, 2024. https: er-smart-home-standard-apple-amazon-google.
//github.com/project-chip/connectedhomeip.
[60] Junzhe Wang and Lannan Luo. Privacy leakage analysis
[49] PyPI. pytesseract: Python-tesseract is an optical char- for colluding smart apps. In IEEE/IFIP International
acter recognition (OCR) tool for python, 2023. https: Conference on Dependable Systems and Networks Work-
//pypi.org/project/pytesseract/. shops, 2022.
[50] Shisong Qin, Fan Hu, Zheyu Ma, Bodong Zhao, Tingting [61] Junzhe Wang, Matthew Sharp, Chuxiong Wu, Qiang
Yin, and Chao Zhang. NSFuzz: Towards efficient and Zeng, and Lannan Luo. Can a deep learning model
state-aware network service fuzzing. ACM Transactions for one architecture be used for others? retargeted-
on Software Engineering and Methodology, 32(6):1–26, architecture binary code analysis. In USENIX Security
2023. Symposium (USENIX Security), 2023.
[51] Nilo Redini, Andrea Continella, Dipanjan Das, Giulio [62] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
De Pasquale, Noah Spahn, Aravind Machiry, Antonio Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Bianchi, Christopher Kruegel, and Giovanni Vigna. DI- Denny Zhou. Self-consistency improves chain of
ANE: Identifying fuzzing triggers in Apps to generate thought reasoning in language models. In International
under-constrained inputs for IoT devices. In Proc. IEEE Conference on Learning Representations (ICLR), 2023.
Symposium on Security and Privacy (S&P), 2021.
[63] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
[52] Mengfei Ren, Xiaolei Ren, Huadong Feng, Jiang Ming, Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
and Yu Lei. Z-fuzzer: device-agnostic fuzzing of zigbee Chain-of-thought prompting elicits reasoning in large
protocol implementation. In Proc. ACM Conference on language models. In Advances in Neural Information
Security and Privacy in Wireless and Mobile Networks, Processing Systems (NeurIPS), 2023.
2021.
[64] Jules White, Quchen Fu, Sam Hays, Michael Sand-
[53] Vinay Sachidananda, Suhas Bhairav, and Yuval Elovici. born, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse
OVER: Overhauling vulnerability detection for IoT Spencer-Smith, and Douglas C Schmidt. A prompt pat-
through an adaptable and automated static analysis tern catalog to enhance prompt engineering with chatgpt.
framework. In Proc. ACM Symposium on Applied Com- arXiv preprint arXiv:2302.11382, 2023.
puting (SAC), 2020.
[65] Wiki. Matter (standard), 2022. https://en.wikipedia.org
[54] Schutzwerk GmbH. Security considerations for matter /wiki/Matter_(standard).
developers, 2023. https://www.schutzwerk.com/en/blog
/matter-security-considerations/. [66] Wired. What is matter?, 2023. https://www.wired.com/
story/what-is-matter/.
[55] Chaofan Shou, Jing Liu, Doudou Lu, and Koushik
Sen. LLM4Fuzz: Guided fuzzing of smart con- [67] wireghoul. Doona, 2019. https://github.com/wireghoul
tracts with large language models. arXiv preprint /doona.
arXiv:2401.11108, 2024.
[68] Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian,
[56] Zhan Shu and Guanhua Yan. Iotinfer: Automated black- Michael Pradel, and Lingming Zhang. Fuzz4all: Uni-
box fuzz testing of IoT network protocols guided by versal fuzzing with large language models. In Proc.
finite state machine inference. IEEE Internet of Things IEEE International Conference on Software Engineer-
Journal, 9(22):22737–22751, 2022. ing (ICSE), 2024.

USENIX Association 33rd USENIX Security Symposium 4799


[69] Jonas Zaddach, Luca Bruno, Aurelien Francillon, and Conference on Trust, Security and Privacy in Computing
Davide Balzarotti. AVATAR: a framework to sup- and Communications (TrustCom), 2019.
port dynamic security analysis of embedded systems’
firmwares. In Network and Distributed System Security [79] Xiaogang Zhu, Sheng Wen, Seyit Camtepe, and Yang Xi-
Symposium (NDSS), 2014. ang. Fuzzing: A survey for roadmap. ACM Computing
Surveys (CSUR), 54(11s):1–36, 2022.
[70] JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hart-
mann, and Qian Yang. Why Johnny can’t prompt: how [80] Fei Zuo, Xiaopeng Li, Patrick Young, Lannan Luo,
non-AI experts try (and fail) to design LLM prompts. In Qiang Zeng, and Zhexin Zhang. Neural machine transla-
Proc. ACM Conference on Human Factors in Computing tion inspired binary code similarity comparison beyond
Systems (CHI), 2023. function pairs. In Network and Distributed System Se-
curity Symposium (NDSS), 2019.
[71] Qiang Zeng, Lannan Luo, Zhiyun Qian, Xiaojiang Du,
and Zhoujun Li. Resilient decentralized android applica-
tion repackaging detection using logic bombs. In Proc. A Example of an Assembled Prompt
ACM International Symposium on Code Generation and
Figure 12 shows the prompt for querying the information of
Optimization (CGO), 2018.
the Group cluster.
[72] Qiang Zeng, Lannan Luo, Zhiyun Qian, Xiaojiang Du, Prompt:
Zhoujun Li, Chin-Tser Huang, and Csilla Farkas. Re-
silient user-side android application repackaging and You will be provided with a section of the protocol
tampering detection using cryptographically obfuscated specification text about the Groups Cluster. Please provide
logic bombs (tdsc). IEEE Transactions on Dependable responses to the questions in the specified order and format
and Secure Computing, 18(6):2582–2600, 2021. as outlined in the text provided. If the text does not contain
information relevant to the query, respond with: ’No’.
[73] Yu Zhang, Nanyu Zhong, Wei You, Yanyan Zou, Kun-
peng Jian, Jiahuan Xu, Jian Sun, Baoxu Liu, and Wei Groups Cluster text:
Huo. NDFuzz: A non-intrusive coverage-guided fuzzing The Groups cluster manages, per endpoint, the content of
framework for virtualized network devices. Cybersecu- the node-wide Group Table that is part of the underlying
interaction layer. ... The GroupID field is set to the GroupID
rity, 5(1):1–21, 2022.
field of the received RemoveGroup command.
[74] Yaowen Zheng, Ali Davanian, Heng Yin, Chengyu Song,
Hongsong Zhu, and Limin Sun. FIRM-AFL∶ High- Please respond to the questions based on the provided text
throughput greybox fuzzing of IoT firmware via aug- of the Groups cluster with the required format. Queries are
as follows:
mented process emulation. In USENIX Security Sympo-
1. From the "Data Types" section, extract all derived
sium (USENIX Security), 2019. datatypes and their corresponding value ranges, especially
[75] Yaowen Zheng, Yuekang Li, Cen Zhang, Hongsong Zhu, the datatypes with the suffixes "Enum" and "Struct". Ensure
the return format is a Dictionary similar to the format of the
Yang Liu, and Limin Sun. Efficient greybox fuzzing of
Example Output.
applications in Linux-based IoT devices via enhanced
2. From the "Commands" section, extract all commands,
user-mode emulation. In Proc. ACM SIGSOFT Inter- and for each command, extract the datatype and value range
national Symposium on Software Testing and Analysis for each of its arguments. Ensure the return format is a
(ISSTA), 2022. Dictionary similar to the format of the Example Output.
3. From the "Attributes" section, extract all attributes, and
[76] Wei Zhou, Le Guan, Peng Liu, and Yuqing Zhang. Au- for each attribute, extract its datatype and value range. En-
tomatic firmware emulation through invalidity-guided sure the return format is a Dictionary similar to the format
knowledge inference. In USENIX Security Symposium of the Example Output.
(USENIX Security), 2021. 4. Return a Python list that includes all command IDs.
5. Return a Python list that includes all attribute IDs.
[77] Wei Zhou, Lan Zhang, Le Guan, Peng Liu, and Yuqing
Zhang. What your firmware tells you is not how you Here is an Example Output in JSON format.
should emulate it: A specification-guided approach for
firmware emulation. In Proc. ACM Conference on Com- [Example Output].
puter and Communications Security (CCS), 2022.
Figure 12: Assembled prompt for querying the information
[78] Lipeng Zhu, Xiaotong Fu, Yao Yao, Yuqing Zhang, and
of the Groups cluster.
He Wang. FIoT: Detecting the memory corruption in
lightweight IoT device firmware. In Proc. International

4800 33rd USENIX Security Symposium USENIX Association

You might also like