Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
14 views11 pages

Cozy Comparative Symbolic Execution For Binary Pro

The paper presents cozy, a tool designed for comparative symbolic execution (CSE) of binary programs, specifically focusing on validating micropatches in legacy software. Utilizing the angr symbolic execution framework, cozy analyzes pre- and post-patched binaries to identify and visualize semantic differences, helping users understand the effects of binary patches without requiring a correctness specification. The tool features a web-based interface for exploring results and is open-source, available for installation via PyPI and GitHub.

Uploaded by

jtpaasch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views11 pages

Cozy Comparative Symbolic Execution For Binary Pro

The paper presents cozy, a tool designed for comparative symbolic execution (CSE) of binary programs, specifically focusing on validating micropatches in legacy software. Utilizing the angr symbolic execution framework, cozy analyzes pre- and post-patched binaries to identify and visualize semantic differences, helping users understand the effects of binary patches without requiring a correctness specification. The tool features a web-based interface for exploring results and is open-source, available for installation via PyPI and GitHub.

Uploaded by

jtpaasch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

cozy: Comparative Symbolic Execution for Binary

Programs
Caleb Helbling, Graham Leach-Krouse, Sam Lasser, and Greg Sullivan
Draper
{chelbling, gleach-krouse, slasser, gsullivan}@draper.com

Abstract—This paper introduces cozy, a tool for analyzing languages. Second, CSE results can be difficult to interpret.
and visualizing differences between two versions of a software CSE typically produces a formal description of the programs’
binary. The primary use case for cozy is validating “mi- semantic differences; this description can be complex when the
arXiv:2504.00151v1 [cs.SE] 31 Mar 2025

cropatches”: small binary or assembly-level patches inserted into


existing compiled binaries. To perform this task, cozy leverages programs under analysis are large, when the patch produces a
the Python-based angr symbolic execution framework. Our tool large change in program behavior, or both.
analyzes the output of symbolic execution to find end states for This work presents cozy, a tool that provides insight into
the pre- and post-patched binaries that are compatible (reachable the effects of binary patches by identifying and visualizing
from the same input). The tool then compares compatible states semantic differences between binary programs. The tool has
for observable differences in registers, memory, and side effects.
To aid in usability, cozy comes with a web-based visual interface two main components: (1) a symbolic execution framework for
for viewing comparison results. This interface provides a rich set analyzing pairs of binaries, and (2) a visualization engine for
of operations for pruning, filtering, and exploring different types displaying and exploring CSE results. The cozy approach to
of program data. CSE involves running two programs on symbolic input in or-
der to identify pairs of final machine states that are compatible:
I. I NTRODUCTION
reachable by the same input. Differences in program behavior
Much of today’s infrastructure is built on a foundation of can then be characterized in terms of differences between
legacy software; maintaining and securing this software is a compatible states. One attractive feature of this approach is
critically important task. Patching legacy software must some- that unlike some comparative analyses, cozy CSE does not
times take place at the binary level due to loss of source code, require a correctness specification as input. This feature is
build toolchain/environment “bit rot,” or limitations on the useful when the analyst does not know in advance how the
deployment system (for example, bandwidth-limited systems programs should differ. In such a case, the analyst can examine
in contested environments). Under these conditions, software compatible state pairs manually or check various specifications
maintainers sometimes deploy software micropatches: minimal against the pairs in a post hoc manner.
assembly-level changes that fix a bug or add functionality. Due Because cozy targets a scenario in which source code is
to the low-level nature of binary patches, it can be difficult to unavailable, it must be able to symbolically execute binary
reason about their effects on program behavior. programs. To achieve this goal, the tool builds upon the angr
In theory, one could gain confidence that a patch has [3] binary analysis platform.
made all and only the desired changes by using a variant In summary, this work makes the following contributions:
of comparative symbolic execution1 (CSE) [1], [2]. In other • We present the cozy comparative symbolic execution
words, one could run the pre- and post-patched programs (CSE) framework, a novel adaptation of CSE to the binary
on symbolic input in order to identify inputs that cause the domain.
programs to behave differently or violate a relative correctness • We present the cozy graphical interface for visualizing
specification. However, two challenges limit CSE’s suitabil- the results of CSE and for exploring the effects of binary
ity for validating real-world binary patches. First, existing patches on program behavior.
CSE techniques target source code or idealized high-level
cozy is an open-source Python package. The tool can be
This material is based upon work supported by the Defense Advanced installed via the Python Package Index (PyPI) [4]; its source
Research Projects Agency (DARPA) and the Naval Information Warfare code and documentation are available on GitHub [5].
Center (NIWC) Pacific, under Contract No. N66001-20-C-4018. The views,
opinions and/or findings expressed are those of the author and should not be
interpreted as representing the official views or policies of the Department of II. E XAMPLE
Defense or the U.S. Government. Distribution Statement “A” (Approved for
Public Release, Distribution Unlimited). We introduce cozy with an example that involves two
1 Prior work sometimes refers to “differential symbolic execution” [1] and attempts at patching a vulnerable binary. cozy helps the user
“relational symbolic execution” [2]. Throughout this work, we use “compar- discover that while the first patch fixes the vulnerability, it
ative symbolic execution” as a generic term for this family of techniques.
also introduces unintended behavior. The tool then confirms
Workshop on Binary Analysis Research (BAR) 2025
28 February 2025, San Diego, CA, USA that the second patch fixes the vulnerability without producing
ISBN 979-8-9919276-4-2 unintended behavior.
https://dx.doi.org/10.14722/bar.2025.23004
www.ndss-symposium.org
1 void update(char *serialized) {
2 // begin patch
3 if (num_semicolons(serialized) > 2 ) {
4 puts("bad serialization!"); exit(1); }
5 // end patch
6 char *command = strtok(serialized, ";");
7 char *role = strtok(NULL, ";");
8 char *data = strtok(NULL, "");
9 if ((command is not "DELETE"|"STORE") ||
10 (role is not "root"|"guest" )) {
11 puts("bad input!"); exit(1); }
12 if ((command is "DELETE") && (role is "root")) {
13 delete(data);
14 } else if (command is "STORE") {
15 store(data);
16 } else {
17 puts("permission denied");
18 }
19 exit(0); }
20 int main(int argc, char **argv) {
21 char *command = argv[1], role = argv[2],
22 data = argv[3];
23 int len = strlen(command) + strlen(role) +
24 strlen(data) + 8;
25 char *serialized = malloc(len * sizeof(char));
26 sprintf(serialized, "%s;%s;%s", command, (b) cozy visual comparison of the pre- and post-patched versions of
27 role, data);
28 update(serialized); }
the Figure 1a program. Trees on the left and right represent possible
execution paths for the pre- and post-patched programs, respectively.
(a) Pseudo-C code for a simplified database front end. The purple node on the left represents a violation of the assertion that a
A user with root access is allowed to both store and guest cannot delete data—i.e., cozy finds an input to the pre-patched
delete data; a user with guest access is only allowed binary that breaks the “no guest deletions” rule. The right pane shows
to store data. The original version of the program, paths through the post-patched binary that are triggered by the same
which excludes lines 2–5, has a command injection input. All such paths, as well as some additional paths that are not
vulnerability that enables a malformed command string triggered by the input, have a square endpoint indicating that they
to bypass the prohibition on guest deletions. Lines 2–5 print “bad serialization!” (the patch’s error message). In other words,
are an overly restrictive patch that fixes the vulnerability the patch rejects all vulnerability-triggering inputs, but it rejects some
but also rejects valid data payloads. benign inputs as well.
Fig. 1: Program with patch (1a) and cozy visualization of CSE results for the pre- and post-patched program versions (1b).

The example program, shown in Figure 1a,2 is a simplified serialization-deserialization process incorrectly allows a guest
database server interface. The program’s update function to delete data.
takes a serialized string containing arguments command, Lines 3–4 in Figure 1a show an incorrect patch, which
role, and data separated by semicolons. The command reports an error if the serialized string contains more than two
argument must be “STORE” or “DELETE”, the role ar- semicolons. While this patch fixes the vulnerability, it is overly
gument must be “root” or “guest”, and data can be any restrictive because semicolons should be allowed in the data
string. update either (a) stores data to the database, (b) payload argument, and the patch disallows such payloads.
deletes data from the database, or (c) rejects the input as To validate this change, the patch author runs cozy on
invalid, depending on the values of command and role. the pre- and post-patched binaries. Doing so produces the
A “DELETE” command is only allowed when the role is visualization in Figure 1b. The trees in the left and right panes
“root”; the check on line 12 enforces this restriction. The represent execution paths through the pre- and post-patched
main function serializes the command line arguments into binaries, respectively. The operator has used a cozy feature to
a single semicolon-delimited string (line 17) and passes the assert that the delete function should never be called when
string to the update function. the role command line argument is “guest” (see Section III-E
The original binary, which corresponds to the pseudocode for details on assertions). The left (pre-patch) pane includes
in Figure 1a minus highlighted lines 2–5, has a command a purple node that indicates an assertion violation; in other
injection vulnerability: if the role argument is “guest” but words, cozy identifies a path through the pre-patched binary
the command argument is the string “DELETE;root”, then the that corresponds to a command injection attack.
2 While cozy operates directly on binaries, we present the program as
The user has clicked on the violation node in the left pane,
pseudo-C source code for ease of understanding. An executable version of which highlights all compatible paths in the right pane. Two
this example is available on the cozy GitHub repo [5]. paths are compatible when there is at least one concrete input

2
Fig. 2: To understand why the first patch attempt rejects valid Fig. 3: cozy generates a concrete input that exercises
input, the user finds a violation-free path through the pre- the paths from Figure 2. The input command=“STORE”,
patched binary that is compatible with a “bad serialization” role=“ROOT”, data=“;” is valid (data is allowed to
path through the post-patched binary. contain semicolons), but the patch rejects it.

that causes execution to proceed down both paths (see Section and we confirm that “bad command” errors arising from the
III-B for a detailed discussion of compatibility). Additionally, new patch correspond to either (a) the assertion condition, or
we have searched for paths in the right pane that print the (b) “bad input” conditions in the prepatched binary.
string “bad serialization” (the error message that the patch
produces); all such paths have a larger square endpoint. III. C OMPARATIVE A NALYSIS
We can immediately see that all paths compatible with the cozy uses symbolic execution to compare binary programs.
assertion violation print “bad serialization.” In other words, The tool runs both programs on the same symbolic input
the patch rejects all inputs that would have triggered the until there are no remaining states to explore. Once symbolic
vulnerability. However, the right pane also shows several free- execution is complete, cozy pairs each terminal state in the
floating squares, which are paths that print “bad serialization” pre-patched binary with each compatible terminal state in the
but are incompatible with the path to the assertion violation. post-patched binary. For each compatible pair, cozy computes
Why is the patch rejecting serialized input that would not have a diff of the pair’s register contents, memory contents, and IO
violated the assertion? side effects. Once this process is complete, the user may either
To investigate further, we click one of the “bad serialization” view the results in textual form or explore them via a graphical
matches in the right (post-patch) pane, and then hover over interface. In this section, we describe the program analysis
a compatible endpoint in the pre-patch pane, as shown in that cozy implements, and we outline the user’s options for
Figure 2. This sequence of actions corresponds to finding a controlling and customizing that analysis.
violation-free path through the pre-patched binary that the
patch would intercept. The standard output of the pre-patch A. Setup
endpoint shows that this path involves a store operation. If we cozy typically runs in a harness: a Python script that first
click that store endpoint in the left (pre-patch) pane, we can configures various cozy parameters and then invokes the tool
ask cozy for concrete input(s) that triggers the corresponding on the target binaries. To streamline the process of creating
paths. As shown in Figure 3, cozy synthesizes an input that an application-specific harness, cozy provides an interactive
indeed has a semicolon in the data argument and that is wizard that asks the user a series of questions about how the
flagged as an error by the incorrect patch. tool should perform its analysis (see Figure 4 for an example).
Finally, we replace the bad patch with a check in the main The wizard generates a harness based on the user’s responses.
function that the command argument contains no semicolons, A typical harness performs the following steps:

3
where the notation s.constraints refers to the path con-
straints of terminal state s.3
Unsat core optimization: A naı̈ve way to compute the
Compatible set is to check all n2 pairs of terminal states for
joint satisfiability. cozy implements a memoization-based op-
timization to enhance performance. When s.constraints∧
Fig. 4: Interactive wizard that generates an application-specific s′ .constraints is unsatisfiable for a pair of states (s, s′ ),
cozy harness based on user input. As shown here, one input cozy computes the unsat core and caches it. The unsat
to the wizard is the type signature of the function that will core is the minimal set of clauses for which the conjunction
serve as the entry point for symbolic execution. is unsatisfiable. Later, when we want to know if a new
pair (s, s′ ) is compatible, we first check if any previously
discovered unsat core is a subset of the joint constraints
1) Create cozy projects for both binaries. A project is an s.constraints∧s′ .constraints. If this check succeeds,
object that acts as an interface between cozy and a binary then the joint constraints are immediately unsatisfiable, and
to analyze. we can skip the expensive call to is_sat. Since most state
2) Define any hooks that are needed to model hard-to-emulate pairs are incompatible in practice, the unsat core optimization
functions. Hooks are common on embedded system targets drastically reduces the number of SMT solver queries.
where the callee function performs a side effect that cannot “No orphans” property: A desirable property of our anal-
be modeled in the angr emulation environment. ysis is the “no orphans” property; that is, every terminal state
3) Create all symbolic variables that will be used during that the analysis reaches in one program should be compatible
execution. Symbolic variables can represent function input with at least one terminal state in the other program. The
as well as sources of nondeterminism. A common example “no orphans” property supports intuitive user interaction, such
of nondeterminism is when the program requires user input that whenever the user selects a path in one program, at least
from stdin or over the network. For example, one may one corresponding path in the other program is highlighted.
simulate the getchar function with a hook that returns a The “no orphans” property holds for the symbolic execution
symbolic value. strategies cozy implements: complete execution (the variant
4) Define a run function that takes a project as input and described so far) and incomplete concolic execution (Section
symbolically executes its underlying binary, using the III-F). Each filter that the user can apply to the states through
hooks defined in step #2 along with any user-defined the cozy interface (Section IV) also preserves this property.
preconditions and initial memory values. We now give a proof for the complete execution case:
5) Call the run function once on the pre-patched binary and Lemma 1 (No Orphans). After complete symbolic execution
once on the post-patched binary to produce run results of two programs P and P ′ , a terminal state si from P always
containing lists of deadended states. has at least one compatible terminal state from P ′ .
6) Compare the run results to determine which state pairs
are compatible, and then check each compatible pair for Proof. After complete exploration, the path conditions of the
differences in registers, memory, and side effects. terminal state induce a disjoint complete partition over the
7) Launch a web browser window that shows a visualization set of possible inputs. Suppose that the input partition for the
of the comparison results. terminal states from P is {X0 , X1 , ..., Xn } and that the input
partition for the terminal states from P ′ is {Y0 , Y1 , ..., Ym }.
Because the inputs are the same for both programs, we have
B. Compatible States the following union condition:
Core to cozy’s analysis and visualization is the notion n
[ m
[
of compatible states. We say that two terminal states s and Xi = Yj
s′ are compatible if there exists at least one concrete input i=0 j=0

that causes execution to terminate in state s in the pre-patch Assume that for state si with corresponding non-empty
execution, and in state s′ in the post-patch execution. We input set Xi , the intersection with all P ′ input sets Yj is empty.
collect all compatible state pairs into the Compatible relation. This is equivalent to saying that si is an orphan.
More formally:
However, this would mean that there exists at least one
Definition 1 (Compatibility). concrete
Sm input x ∈ Xi that cannot be found in the P ′ input
j=0 Yj . This contradicts the previous union condition which
Compatible ≜ {(s, s′ ) | compatible(s, s′ )} says that the input sets must be equal. Therefore, the state si
is not an orphan state.

compatible(s, s′ ) ≜ is_sat(s.constraints ∧ 3 Note that angr stores memory and register contents separately from path

s′ .constraints) constraints; cozy is built on top of angr and inherits this design choice.

4
def index_assertion(state: angr.SimState):
C. IO Side Effects index = state.regs.r2
return (index.SGE(0) & index.SLT(BUFFER_SIZE))
In addition to comparing programs’ final states, the cozy
user might wish to compare programs in terms of the side session.add_directives(
effects that they produce. To enable this use case, cozy has cozy.directive.Assert.from_fun_offset(
project, "loop", 0x20,
a subsystem for modeling IO side effects. Common examples index_assertion, "index out of bounds"))
of IO side effects that we have modeled in example programs
include writing to stdout/stderr, writing to the network, and Listing 1: Example of creating an assertion for an array bounds
writing over a serial connection. check. At instruction loop+0x20, we assert that the index
Modeling IO side effects with cozy involves defining a (stored in register r2) must be in range. Note: SGE means
hook for a side effect-producing function that simulates the “signed greater or equal” and SLT means “signed less than.”
function’s behavior. When symbolic execution reaches a call to
a hooked function, cozy runs the hook and stores the resulting
symbolic execution or provide extra information to the execu-
side effect payload in the state. When a child state forks off
tion engine. cozy supports the following directives:
from its parent, it obtains a copy of the parent’s stored side
• Breakpoint pauses execution so that the program state can
effects. cozy keeps track of IO side effects over different
channels (stdout, network, etc.). When the user examines be inspected by user-provided Python code. When used in
compatible states in the UI, cozy visually aligns their side conjunction with a Python debugger, the simulation state
effects so that any differences are clear. can be inspected interactively.
• Assume attaches extra constraints to the program state when
D. Observational Differences execution reaches a specified point.
• Assert by default operates like an assert in an ordinary
Two compatible states with observational differences—i.e.,
programming language or testing environment. When cozy
differences in their register values, memory values, or side
performs a complete symbolic exploration, an assert can be
effects—indicate the existence of an input that causes the two
used to ensure that for all possible inputs, the provided
programs to behave differently. Because such differences may
condition cannot be falsified. A common example of an
be of interest to users, cozy checks each pair of compatible
assertion states that an array index stored in a register is
terminal states for equality of their registers, written memory,
in bounds before it is used in an array operation. Listing 1
and IO side effects. Note that these state components may be
gives an example of such an assertion.
a combination of concrete and symbolic values because cozy
When symbolic execution encounters an assertion directive,
runs programs on symbolic input.
it splits the current state into two child states: one in which
For a compatible pair (s, s′ ), register contents r in s and
the assertion is triggered, and one in which it is not. The
register contents r′ in s′ are observationally different when the
state with the triggered assertion is stashed, and it is not
following condition holds:
executed further.
is_sat(s.constraints ∧ s′ .constraints ∧ r ̸= r′ ) (1) • Postcondition is a special type of assert that executes after
the simulated function returns.
cozy constructs analogous conditions for memory writes and • Virtual print produces an IO side effect on the virtual
IO side effects, and it checks the conditions with an SMT print channel, which is useful for debugging an execution
solver. Because cozy targets a micropatch scenario in which trace within the program. This technique is analogous to a
differences between programs are small, the tool is able to use symbolic version of printf debugging.
several optimizations that reduce the number of SMT queries it • Error is a directive that is triggered whenever the program
must perform. Registers and memory values are often entirely reaches a specified address. When execution reaches an
concrete or syntactically identical, so they can be compared Error directive, cozy stashes the current state; execution
for equality without a solver query. does not proceed further. This directive is useful for marking
cozy also employs a model-caching feature from angr’s certain branches of the program as throwing an error.
built-in solver. When a formula like Condition 1 is satisfiable,
cozy caches the model (concrete assignments that make the F. Concolic Exploration
condition true). Later, when cozy needs to determine whether By default, cozy uses angr’s standard symbolic execution
a different formula is satisfiable, the tool checks whether any strategy of exploring non-terminal states in a breadth-first
of the cached models satisfy the formula before it attempts to manner. As an alternative strategy, cozy provides a variant
construct a fresh model. of concolic execution [6]. Concolic execution is desirable
when the state space is large because it allows for incomplete
E. Directives exploration while still producing a set of final states that satisfy
cozy supports several kinds of directives, which are special the “no orphans” property (Lemma 1).
hooks that run when execution reaches a specified program In the typical concolic execution scenario as presented in
address. A directive can be thought of as a breakpoint that the literature [7], the program first runs on a concrete input
runs a snippet of user-provided code—for example, to debug and generates an execution trace. Next, the program runs on

5
symbolic input, which is forced to follow the concrete trace.
After symbolic execution reaches a terminal state, a portion 1
of the symbolic path condition is negated and a new concrete
input is synthesized from this condition. This newly generated
concrete input therefore exercises a different execution path.
cozy achieves results similar to those of ordinary concolic
execution, but it uses a different exploration process. When
child states are generated from a parent, cozy substitutes 2
concrete inputs into their constraints; the tool then defers (halts
exploration of) all children with constraints that evaluate to
false. This approach obviates the need for separate concrete
execution of the program; it fuses concrete and symbolic
execution into a single process. This fusion decreases the
engineering effort required to implement the concolic approach
and integrate it with the existing complete exploration code. 3
Once symbolic execution reaches a terminal state, cozy Fig. 5: The cozy GUI consists of (1) a menubar, (2) two
uses one or both of the following heuristics to decide how to panels displaying symbolic execution trees, and (3) a diff panel
continue exploration: that enables the user to compare program branches across
1) Termination Heuristic: A termination heuristic determines various dimensions.
whether cozy should halt concolic execution. The default
termination heuristic says that concolic execution should
continue until the exploration of state space is complete. which presents detailed comparative information once the user
cozy also enables the user to choose termination heuristics selects a pair of branches from the execution trees. In the
based on cyclomatic complexity and basic block code remainder of this section, we describe the GUI’s presentation
coverage metrics; these heuristics may lead to incomplete of symbolic execution trees and its diff panel in more detail.
exploration. In addition, the user can define custom termi-
nation metrics. A. Symbolic Execution Trees
2) Candidate Heuristic If the termination heuristic says that A symbolic execution tree depicts the results of symboli-
exploration should continue, cozy needs to decide which cally executing a given program with angr. The root is the
deferred state to explore next. Choosing a deferred state initial program state, an internal node is the program state
is equivalent to negating part of the path condition of a after execution of a basic block, and an edge is a symbolic
previous exploration. execution step.
The “trivial” candidate heuristic simply chooses an arbi- When analyzing symbolic execution results, the user needs a
trary deferred state from the list of options. cozy also way to cut out extraneous noise. Typically, only a small subset
provides a more complex n-gram branch coverage heuristic of all of the possible paths through a program are of genuine
[8] that attempts to choose the deferred state with the most interest. The cozy GUI offers three main mechanisms for
unique basic block address history. focusing on the relevant parts of symbolic execution results:
Once the candidate heuristic chooses the next state to highlighting, pruning, and compression.
explore, cozy generates a new concrete input from that state’s Several types of program states that are likely to be signif-
path constraints. cozy then feeds this concrete input into both icant are automatically highlighted in the GUI. These include
programs under comparison by activating the appropriate de- states that raised errors during execution, states at which a
ferred states (those with path conditions that are now satisfied). syscall or SimProc (modeled function) call occurred, states
The program used to generate the concrete input alternates at which the program exceeded user-specified boundaries on
between the pre- and post-patch binaries to ensure that both loop iteration, and states at which a user-provided assert or
versions of the function are being explored. postcondition failed. Different colors indicate different cate-
By feeding the same concrete input to both programs, gories of potentially significant states. The color palate, and
cozy ensures that no orphaned states will be generated. This toggles to hide or show each type of state, are available under
invariant is important because it ensures that any terminal state the “View” menu in the menubar.
a user selects in the cozy UI is compatible with at least one Besides calling attention to relevant results, it can be helpful
state in the other program. to filter out irrelevant results. cozy’s main mechanism for
filtering out irrelevancies is pruning. Pruning works as follows:
IV. V ISUALIZATION cozy prunes (hides) each branch unless it is “interestingly
The cozy Graphical User Interface (GUI) is a simple web related” to a compatible branch in the facing tree, where the
application. As shown in Figure 5, the GUI presents the user user specifies (via the GUI) which relationships are interesting.
with three main interfaces: (1) a menubar; (2) a pair of panels For example, the user can indicate that two branches are
displaying two symbolic execution trees; and (3) a “diff panel,” interestingly related when their terminal states have different

6
memory contents; pruning will then leave only the branches branch will be highlighted, along with all compatible branches
that differ from at least one compatible branch of the facing in the facing tree. The user can then click on a compatible
tree in terms of their final memory contents. The relations branch from the facing tree and begin to use the diff panel, as
that the GUI checks are symmetric, so if a branch b survives described in the next section.
pruning because it is partnered with a compatible branch b′ ,
then b′ will survive as well. Therefore, pruning will never B. The Diff Panel
result in an orphaned branch. The diff panel becomes available when the user selects a
Several pruning actions are available under the “Prune” pair of compatible branches for deeper analysis. The types of
menu. In addition to memory differences, cozy can check comparisons that the diff panel supports can be grouped into
for differences in register contents as well as stdout and stderr three broad categories: comparisons of event streams, terminal
output. The tool can also check whether at least one of two states, and concrete inputs.
compatible branches ends with an error state, and whether The sequence of nodes along a symbolic execution path
at least one branch produces stdout that does not match a corresponds to several different kinds of event streams: the
user-provided regular expression. In addition, the user can stream of assembly instructions executed, the stream of read
apply multiple pruning relations simultaneously, which results and write operations on memory and registers, and the stream
in pruning with the conjunction of the selected relations. We of modeled IO effects. cozy compares these types of event
found that while it is possible in principle to apply arbitrary streams using a familiar git-style line diff. An example of an
Boolean combinations of prunings, this approach makes for assembly stream comparison appears in Figure 6, where it
an excessively complicated UI. Hence, we restrict ourselves is possible to see the exact region where program execution
to the simpler case of pure conjunction. passes through a small patch applied to a shared object file.
The final mechanism that cozy provides for sorting through For each type of event stream comparison, when the user
the results of symbolic execution is compression: merging mouses over an event, the UI highlights the tree node that
successive nodes that represent uninteresting or inevitable corresponds to that event. This behavior enables the user to
computation steps. There are two available compression levels: intuitively connect the contents of the tree-view to the contents
the user can (1) merge adjacent nodes that have identical of the event stream. In some cases, the event stream also
constraints and (2) merge every node that has a unique child contextually exposes other types of information. For example,
with that child node, eliminating all straight-line sequences of the stream of assembly instructions can provide the location
symbolic states.4 in the original source that corresponds to a given line of
Besides sorting through branches using highlighting, filter- assembly, if this information is recoverable from DWARF
ing, and compression, the cozy user must be able to extract debug information in the binaries that cozy has analyzed.
information from symbolic execution results. Within the GUI, In addition to event stream comparisons, cozy supports
there two primary features that expose information about a comparisons of terminal states. For example, cozy can com-
particular branch to a user. One of these features—tooltips— pare the final memory contents of two compatible branches.
offers simple at-a-glance information about the nodes in a This process may involve comparing symbolic values, since
branch, taken in isolation. The other feature—cozy’s diff terminal states can contain symbolic values. In such a case,
panel (Section IV-B)—looks at a branch in comparison with cozy checks whether the symbolic values in the states are
a compatible branch selected from the facing tree. logically equivalent. If they are, cozy reports this fact, and if
A tooltip appears when the cursor hovers over a node. they are not, cozy generates some concretions that illustrate
Depending on the type of node, different kinds of information a possible scenario in which the terminal states differ in spite
are available. A tooltip displays the following information: the of an identical initial state.
assembly instructions provided by angr’s disassembler for the Finally, the diff panel can generate concrete inputs that
given state; the representation of those instructions in VEX exercise compatible branches under comparison. Compatibility
[9] (the IR over which angr performs symbolic execution); guarantees the existence of an input that produces the two
the operative symbolic constraints; and concrete examples of sequences of behaviors that the branches represent. The con-
possible contents of stdout and stderr. Special states—roughly cretion view in the GUI’s diff panel displays example inputs
those with special highlighting rules as described above— that are shared between the two compatible paths. This feature,
expose additional information. For example, error states ex- in combination with cozy’s pruning functionality, make it
pose error messages, and states that invoke SimProcs give the possible to recover specific inputs that generate execution
name of the function being hooked as well as the library that paths of interest, especially paths where behavior differs
provides it. interestingly between the two binaries being compared.
To get a genuinely comparative analysis, however, a user Compatibility does not guarantee that every input that
needs to select two full branches as follows. First, the user produces the behavior associated with the first branch also
clicks on the leaf of a candidate comparison branch, and that produces the behavior associated with the second branch, or
vice versa. In cases where there are inputs that will produce
4 Symbolic execution can add a constraint without branching when, for the behavior of the first branch, but not the second (or vice
example, the result of adding the negation of the constraint is unsatisfiable. versa), cozy also makes these inputs available, and in cases

7
1) A function f taken from a Linux base64 binary
2) A modified version of f instrumented with code that
supports coverage-guided fuzz testing
To create the data set, we used the RetroWrite binary
rewriting tool [10] to instrument the base64 binary with code
that supports integration with the American Fuzzy Lop (AFL)
fuzzer [11]. We then selected 15 functions from the original
binary and paired them with their instrumented versions from
the modified binary.
B. Correctness Property and Experimental Setup
Because the instrumentation only exists to support fuzzing, a
function from the original binary should have the same observ-
able behavior as its instrumented counterpart. We use cozy
to verify this property as follows. First, cozy symbolically
executes both functions and computes the set of compatible
state pairs. Second, for each pair, cozy checks an assertion
that the states agree in terms of their register and memory
contents. If cozy can falsify this assertion, then there exists
an input that causes the two functions to behave differently,
and verification fails.
The precise formulation of state agreement depends on a
function’s return type. For example, if a function returns a 64-
bit integer, then two compatible states hold equal return values
when the full contents of their RAX registers are equal (RAX
is the 64-bit return register for the x86-64 ISA). However, in
the case of a function that returns a 32-bit integer, only the
lower 32 bits of RAX (i.e., register EAX) must be equal across
the states—the higher-order bits of RAX are allowed to differ.
Parameter types place similar constraints on the functions’
symbolic input. For these reasons, each data instance requires
a custom test harness that captures function-specific behavior.
These harnesses, along with our full evaluation framework, are
Fig. 6: Diff panel showing the assembly instructions in the included in the public cozy repository [5].
original (left) and patched (right) versions of a program. C. Results
Red and green highlighting represents deletion and insertion,
respectively. Using the process described above, we checked each func-
tion pair in the data set for equivalent observable behavior. The
evaluation took place on a machine running Ubuntu 20.04 with
where no such inputs exist, cozy makes it clear that one of an Intel i9-12900H processor and 64 GB of RAM. The results
the two branches “refines” the other, or that the two branches appear in Figure 7. The table shows symbolic execution time
are “equivalent,” in the sense that they represent behaviors that for the original and modified binaries, as well as comparison
are produced by exactly the same set of inputs. time, which includes time spent computing compatible states
and comparing register and memory contents. cozy verifies
V. E VALUATION that the instrumentation code leaves each function’s observable
We evaluated cozy by measuring the tool’s execution behavior unaffected.
time as it symbolically executed pairs of binary functions
VI. R ELATED W ORK
and checked a relative correctness property over each pair.
The evaluation goals were as follows: (1) to observe cozy’s Computing differences between programs has a long history
execution speed on widely used binary functions; (2) to in the literature. Unlike the symbolic execution discussed here,
demonstrate cozy’s ability to verify a desirable correctness the majority of previous tools operate on the textual or abstract
property; and (3) to produce a set of cozy test harnesses that syntax tree (AST) level [12]–[14], and do not attempt any
can serve as a reference point for other users of the tool. actual simulation of the programs under analysis.
The diff utility [15] distributed with Unix based operating
A. Data Set systems is one example of an early comparison program.
Each instance in the data set is a pair of binary functions: diff reports differences in lines, and performs a longest

8
Function Name # Terminal States Symbolic Execution Time (s) Comparison Time (s)
Original Instrumented
base64_decode_alloc_ctx 31 16.0683 41.1273 2.0919
base64_decode_ctx 31 15.1894 38.2189 2.0648
base64_decode_ctx_init 1 0.0108 0.0685 0.0451
base64_encode_alloc 17 6.8573 10.5143 3.9269
base64_encode 17 7.6123 12.4908 4.6590
clone_quoting_options 1 0.1203 0.1525 0.0461
close_stdout 1 1.4348 5.3738 0.0699
close_stdout_set_file_name 1 0.0098 0.0656 0.0469
close_stdout_set_ignore_EPIPE 1 0.0092 0.0629 0.0442
close_stream 58 6.2557 16.2690 4.0236
decode_4 29 4.3319 6.7472 1.5532
deregister_tm_clones 1 0.0125 0.0529 0.0453
fadvise 2 0.1976 0.1738 0.1010
get_quoting_style 1 0.1713 0.1030 0.0506
isbase64 1 0.1334 0.0849 0.0401

Fig. 7: Binary functions from a Linux base64 utility, numbers of terminal states that cozy symbolic execution finds for them,
and running times for cozy symbolic execution and verification. For each function, the original and instrumented versions
have the same number of terminal states because the instrumentation code is branchless. A comparison time is the total time
spent comparing all pairs of terminal states drawn from an original and an instrumented function.

common subsequence computation to attempt to align two text execution of code that the two programs under comparison
files. The diff utility is generic, in the sense that it will have in common. For example, a common code block B when
function over any programs that can be represented in text fed identical inputs (registers and memory) will result in the
files. However this approach, because it does not understand two programs reaching an identical ending state, regardless of
the semantics, cannot be used to provide a rich understanding the actual execution that occurs within B.
of program behaviour. C standard library hooks are one location where symbolic
cozy does utilize a textual diff over the assembly trace (see summaries are currently used in cozy. These hooks intercept
Figure 6) of a program in the visualization interface. When calls to standard library functions, and perform the equivalent
two terminal states are selected, the assembly pane will give a computation via a Python callback. The hooks are meant to
linear list of instructions executed for that trace, in the format simplify hard to execute standard library functions, typically
of color-coded line based diff. resulting in far fewer child states.
The most relevant prior work to our approach is that Abstract symbolic summaries, while providing interesting
of Person et al. in their paper on “Differential Symbolic benefits, do suffer from several drawbacks that makes them
Execution” [1]. Our approach differs in a number of key ways. infeasible to use in cozy. Due to their black box nature,
First, we analyze binary programs, whereas Person’s approach abstract symbolic summaries do not allow for fine-grained
analyzes high-level Java programs. Second, the method by analysis of register and memory contents in terminal states.
which we check for pair compatibility and report deltas Additionally, abstract symbolic summaries, since they are
differs. In Person’s computation of the partition effect delta, essentially computation that is skipped, do not allow for
path conditions are checked for strict equivalence using an generation of concrete example inputs that lead to selected
“if and only if.” This approach may detect inconsequential terminal states. In our experience with the micropatching
changes in control flow. Our approach is only concerned with process, generating concrete example inputs is essential for
observational differences—differences in registers, memory, aiding in understanding program behaviour.
and IO side effects after execution. It ignores differences at Shadow symbolic execution is another body of work [16]–
intermediate execution points that Person’s tool would flag. [18] that functions on principles similar to cozy. In shadow
Finally, our analysis of final register, memory and IO side symbolic execution, an original and patched program are
effect content is more fine-grained than Person’s approach, symbolically executed in lockstep until divergence is reached.
which has enabled us to create a novel visualization interface. Divergent program points are used to generate new test cases
Person additionally discusses symbolic summary, which we that exercise the impact of the patch. Divergence must be
do not utilize in our execution model. Symbolic summaries manually annotated by constructing a combined original and
may be used to summarize the effects of common blocks of patched program via a special change() macro.
code. Additionally, abstract summaries may be used to skip cozy differs in several key ways from shadow symbolic

9
execution. cozy executes the original and patched binary in cannot know what concrete inputs will lead to interesting
two separate symbolic execution runs, removing the need for future states.
manual change() annotations. Additionally, cozy operates Non-termination presents another problem for symbolic
on binary programs, whereas the literature on shadow sym- execution. It is obviously difficult, in general, to detect non-
bolic execution has focused on Java, C, and C++ programs. termination. In some programs, non-termination is a feature;
for example, in event-handling loops. To deal with non-
VII. D ISCUSSION
termination, we allow the user to place an upper bound on the
As part of the DARPA Assured Micropatching (AMP) number of times a loop executes. As a simple mechanism to
program, we tested cozy on a variety of third-party chal- avoid nontermination, cozy uses angr’s LocalLoopSeer
lenge problems. For example, we used cozy to (1) examine exploration technique, which detects loops by recording the
a proposed micropatch for the Army MRZR platform; (2) history of execution. If the upper bound on instruction iteration
identify a shortcoming in the initial patch; and (3) show that count is reached, we halt execution of that state and stash
all execution paths are correctly handled with an improved it. In the visualization, the spinning state can be seen as a
patch [19]. We have additionally created a variety of example downwards facing arrow.
programs designed to exercise different portions of the tool. In In this paper, we haven’t yet touched on the creation of
this section, we discuss our observations of the micropatching formal specifications for intended patch behavior. Our initial
process and how cozy performs in the overall workflow. work on the DARPA AMP program focused on this area and
The primary challenge of understanding micropatch behav- heavily utilized the CBAT tool [21]. A formal specification
ior is making sense of the large volume of information that boiled down to creating an SMT formula with an if-then-else
cozy generates. For all but the simplest programs, the textual (ITE) at the top level. The condition of the ITE determined
report cozy generates is too cumbersome to understand. This when the patch changed program behavior, the true branch
fact led to the creation of the interactive visualization interface. specified how the patch changed behavior, and the false branch
Direct examination of the symbolic values attached as state specified that memory and registers must be identical in all
constraints, or stored in registers or memory is generally other circumstances.
unhelpful. These symbolic expressions are typically large and Tool operators had several complaints about creating these
too complex to be easily understood with manual inspection. formal specifications: (1) the specifications were difficult to
cozy’s ability to generate example concrete inputs, for a pair write, requiring the construction of complex SMT formulas;
of compatible states, has proven both intuitive and useful. and (2) writing a formal specification was similar to writing
States with assertion failures are flagged with a purple color, the patch in the first place, so there were complaints about
making them easily visible in the tree view. One common having to do the same work twice. Based on feedback from
workflow is to check that all assertions triggered in the these third-party operators, we determined that an interactive,
prepatched program are not triggered in the postpatch pro- visualization-based approach would be more helpful.
gram. Prepatched assertion failures should be compatible with
The feedback loop created by the cozy tool is, in essence,
postpatch states that jump to micropatch code. By exploring
an interactive way to explore the formal specification space. It
various execution traces, concrete examples, and comparisons,
is possible to use cozy to check directly that a patch changes
the operator can achieve a high degree of assurance that the
behavior only in some specified way. This kind of formal
micropatch is behaving exactly as intended.
verification is accomplished by writing a function that takes in
The skills required to use cozy overlap with those needed
a compatible state pair and returns an assertion condition. If
to use angr. A rough understanding of assembly code is
cozy can falsify the assertion for any compatible state pair,
required to attach assertions at certain program points. The
then verification fails.
initial effort to apply cozy to two versions of an application
is outlined in Section III-A. The top-level arguments must be
constructed, which requires knowledge of the argument types VIII. C ONCLUSION
and their memory layouts. One can obtain this information
from original source code or from a reverse engineering tool In this paper we have presented cozy, a Python-based
like Ghidra [20]. framework built on top of angr that uses symbolic execution
Since cozy uses symbolic execution as its base analysis, to detect observable differences in binary programs. The
it inherits the challenges of that technique: path explosion, cozy project is designed to analyze micropatches, which are
nontermination, and costly SMT queries. To mitigate the small binary or assembly-patches inserted into existing legacy
path explosion problem, we have implemented joint concolic programs. By using cozy’s novel visualization interface, the
execution (Section III-F). The concolic execution we have tool’s operator can gain confidence that a given micropatch
implemented may be used for incomplete exploration while has its intended effect. Operators who already have experience
preserving terminal state compatibility. The difficulty of gen- with the angr symbolic execution framework will find it easy
erating “interesting” concrete inputs is still a weakness of this to get started with cozy. We hope that operators will find
approach. Although the heuristics attempt to explore deferred cozy useful as part of the verification step of the micropatch
execution states that have unique basic block histories, we development process.

10
R EFERENCES [11] M. Zalewski, “american fuzzy lop,” 2017. [Online]. Available:
https://lcamtuf.coredump.cx/afl/
[1] S. Person, M. B. Dwyer, S. Elbaum, and C. S. Păsăreanu, “Differential [12] J. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus,
Symbolic Execution,” in Foundations of Software Engineering (FSE), “Fine-Grained and Accurate Source Code Differencing,” in
2008. [Online]. Available: https://doi.org/10.1145/1453101.1453131 Automated Software Engineering (ASE), 2014. [Online]. Available:
[2] G. P. Farina, S. Chong, and M. Gaboardi, “Relational Symbolic Execu- http://doi.acm.org/10.1145/2642937.2642982
tion,” in Principles and Practice of Programming Languages (PPDP), [13] B. Fluri, M. Würsch, M. Pinzger, and H. Gall, “Change Distilling: Tree
2019. [Online]. Available: https://doi.org/10.1145/3354166.3354175 Differencing for Fine-Grained Source Code Change Extraction,” IEEE
[3] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, Transactions on Software Engineering, vol. 33, no. 11, pp. 725–743,
A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna, 2007.
“SoK: (State of) The Art of War: Offensive Techniques in Binary [14] S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom,
Analysis,” in IEEE Symposium on Security and Privacy, 2016. [Online]. “Change Detection in Hierarchically Structured Information,” SIGMOD
Available: https://doi.org/10.1109/SP.2016.17 Rec., vol. 25, no. 2, p. 493–504, jun 1996. [Online]. Available:
[4] C. Helbling, “cozy Python Package Index (PyPI) entry,” https:// https://doi.org/10.1145/235968.233366
pypi.org/project/cozy-re/. [15] J. W. Hunt and M. D. MacIlroy, An Algorithm for Differential File
[5] ——, “GitHub repository for the cozy development,” https:// Comparison. Bell Laboratories Murray Hill, 1976.
github.com/draperlaboratory/cozy. [16] H. Palikareva, T. Kuchta, and C. Cadar, “Shadow of a Doubt:
[6] P. Godefroid, N. Klarlund, and K. Sen, “DART: Directed Automated Testing for Divergences between Software Versions,” in International
Random Testing,” SIGPLAN Not., vol. 40, no. 6, p. 213–223, jun 2005. Conference on Software Engineering (ICSE), 2016. [Online]. Available:
[Online]. Available: https://doi.org/10.1145/1064978.1065036 https://doi.org/10.1145/2884781.2884845
[7] K. Sen, “Concolic Testing,” in Automated Software Engineering (ASE), [17] Y. Noller, H. L. Nguyen, M. Tang, and T. Kehrer, “Shadow
2007. [Online]. Available: https://doi.org/10.1145/1321631.1321746 Symbolic Execution with Java PathFinder,” ACM SIGSOFT Softw.
[8] J. Wang, Y. Duan, W. Song, H. Yin, and C. Song, “Be Sensitive Eng. Notes, vol. 42, no. 4, pp. 1–5, 2017. [Online]. Available:
and Collaborative: Analyzing Impact of Coverage Metrics in Greybox https://doi.org/10.1145/3149485.3149492
Fuzzing,” in Research in Attacks, Intrusions and Defenses (RAID), [18] T. Kuchta, H. Palikareva, and C. Cadar, “Shadow Symbolic
2019. [Online]. Available: https://www.usenix.org/conference/raid2019/ Execution for Testing Software Patches,” ACM Trans. Softw. Eng.
presentation/wang Methodol., vol. 27, no. 3, pp. 10:1–10:32, 2018. [Online]. Available:
[9] Y. Shoshitaishvili, R. Wang, C. Hauser, C. Kruegel, and https://doi.org/10.1145/3208952
G. Vigna, “Firmalice - Automatic Detection of Authentication [19] C. Helbling, “cozy subdirectory for DARPA AMP MRZR challenge
Bypass Vulnerabilities in Binary Firmware,” in Network and problem,” https://github.com/draperlaboratory/cozy/tree/main/test
Distributed System Security Symposium (NDSS), 2015. [Online]. Avail- programs/amp target3 hackathon.
able: https://www.ndss-symposium.org/ndss2015/firmalice-automatic- [20] NSA, “Ghidra,” https://www.ghidra-sre.org/.
detection-authentication-bypass-vulnerabilities-binary-firmware [21] C. Fortuna, C. Casinghino, S. Lasser, J. Paasch, C. Roux, and
[10] S. Dinesh, N. Burow, D. Xu, and M. Payer, “RetroWrite: Statically P. Zucker, “CBAT: A Comparative Binary Analysis Tool,” in Binary
Instrumenting COTS Binaries for Fuzzing and Sanitization,” in IEEE Analysis Research Workshop (BAR), March 2024. [Online]. Available:
Symposium on Security and Privacy, 2020. [Online]. Available: https://dx.doi.org/10.14722/bar.2024.23009
https://doi.org/10.1109/SP40000.2020.00009

11

You might also like