steg86: hiding messages in x86 binaries
rust munich meetup
william woodruff
august 25 2020
agenda
I yours truly
I steganography?
I steg on programs
I x86 instruction encoding
I steg86
yours truly
I william woodruff
I @8x5clPW2 • yossarian.net • blog.yossarian.net
I senior security engineer @ trail of bits
I work: program analysis research, mostly in LLVM
I disclaimer: independent talk, not representing employer
I open source: member of homebrew, miscellaneous contributor
steganography?
I “hiding data within data”
I not cryptography
I different techniques for
different data
I popular targets:
I images
I sound files
I plain text
I what about programs?
steg on programs
programs are a natural choice
for steg
I can be very large (lots of
info capacity)
I complex binary formats
(PE, Mach-O, ELF)
I complex instruction
encodings (x86/AMD64,
ARM w/ Thumb)
I present on every computer,
not inherently suspicious
steg on programs: approaches
I hide information in stack layout, register selection
I problem: need the program’s source
I problem: need to maintain a compiler. . .
I hide information in the format itself (e.g. segment order)
I problem: specific to a format, may not apply to others
I rewrite the program after compilation
I ex: add eax, -50 → sub eax, 50
I problem: code/data disambiguation (difficult to solve)
I problem: relocations, position independent code (-fPIC)
I problem: CPU-level semantics (arithmetic, status flags)
I can we do better?
x86 instruction encoding
I variable length (up to 15 bytes)
I extremely complex (decades of compat, overloaded fields)
I rich source/sink combinations
I register-to-register (mov ebx, eax)
I register-to-memory (mov dword [1337], eax)
I memory-to-register (mov eax, dword [1337])
I immediate-to-register (mov eax, 1337)
I immediate-to-memory (mov dword [1337], 1337)
x86 instruction encoding: modr/m
I essentially an 8-bit lookup table of (some) operand encodings
I doesn’t cover all possible operands, for historical reasons. . .
I simplest case: encodes one or two operands
I reg/opcode field: one register operand
I r/m field: one register or memory operand
I enables mem-to-reg, reg-to-mem, reg-to-reg operations
x86 instruction encoding: xor
opcode instruction
31 /r xor r/m32, r32
33 /r xor r32, r/m32
I reg-to-mem, mem-to-reg, reg-to-reg, ...
I there are two reg-to-reg encodings!
I 31 C0 → mov eax, eax
I 33 C0 → also mov eax, eax!
I they’re even the same size!
I 64-bit variants (w/ REX prefix) work too!
steg86
I central conceit: each reg-to-reg pair represents one bit of
information
I with enough bits, we can hide messages!
I binary format independent
I uses goblin to unpack PE/ELF/Mach-O binaries
I encodings are the same size, so PIC/relocations aren’t broken
I uses iced for decoding/encoding/semantics
I ~700 lines of rust total (much of it constants)
I CLI: steg86 {profile,embed,extract}
steg86: semantic duals
I it turns out there are a bunch of these
I 9 instructions (add, adc, sub, sbb, and, or, xor, mov, cmp)
I 4 variants (8, 16, 32, 64-bit) each1
I each dual gives us 1 bit of information
I minus a little space for a header with metadata
I how common are these instructions?
$ steg86 profile /bin/bash
Summary for /bin/bash:
175828 total instructions
27957 potential semantic pairs
27925 bits of information capacity (3490 bytes)
I not bad!
1
actually 3 in any particular CPU mode. . .
steg86: semantic duals
each pair represents (false, true). . .
static SEMANTIC_PAIRS: &[(Code, Code)] = &[
// ADD
(Code::Add_rm8_r8, Code::Add_r8_rm8),
(Code::Add_rm16_r16, Code::Add_r16_rm16),
(Code::Add_rm32_r32, Code::Add_r32_rm32),
(Code::Add_rm64_r64, Code::Add_r64_rm64),
// ... snip ...
];
steg86: profiling
for every instruction in the program. . .
// skip instructions we don't support
if !SUPPORTED_OPCODES.contains(&instruction.code()) {
continue;
}
// skip non reg-to-reg instructions
if instruction.op0_kind() != OpKind::Register
|| instruction.op1_kind() != OpKind::Register
{
continue;
}
offsets.push(instruction.ip() as usize);
steg86: embedding
for each candidate instruction. . .
let new_code = {
let tuple = SEMANTIC_PAIRS
.iter()
.find(|&&t| old_code == t.0 || old_code == t.1)
.unwrap();
match (bit, tuple.0 == old_code) {
(false, true) | (true, false) => {
// already correct!
continue;
}
(false, false) => tuple.0,
(true, true) => tuple.1,
}
};
steg86: embedding
let new_instruction = Instruction::with_reg_reg(
new_code,
instruction.op0_register(),
instruction.op1_register(),
);
let new_len = encoder
.encode(&new_instruction, offset as u64)
.map_err(|s| anyhow!(s))?;
// ... snip ...
text_copy
.data
.splice(
offset..(offset + new_len), encoder.take_buffer());
steg86: results
binary diff:
$ cargo install steg86
$ echo "hello!" > message.txt
$ steg86 embed \
/bin/bash test.steg \
< message.txt
$ steg86 extract test.steg
hello!
steg86: next steps
I other tricks
I test reg1, reg2 is the same as test reg2, reg1
I same with xchg
I multi-byte nops
I deficiencies
I code/data disambiguation is impossible in the general case
I many open problems in program analysis reduce to this
I partial workarounds: CFG recovery, jump table identification
I very easy to detect (real compilers stick to one encoding)
thank you!
slides: yossarian.net/publications#munich-rust-2020
github: woodruffw/steg86
blog post: hiding messages in x86 binaries using semantic duals
contact:
[email protected] / @8x5clPW2
links and prior work
I A86 assembler (1980s!)
I HYDAN (2004)
I ARMaHYDAN (2019, PoC||GTFO)