This project is a minimalistic compiler written in C that compiles shell scripts into standalone ELF64 executables for Linux on the x86_64 architecture. It translates shell command lines into machine code, embeds them in an ELF binary, and handles process control and system calls natively without relying on an external shell or interpreter.
makeThe produced sh2elf binary is a normal host executable. No additional runtime libraries are required.
./sh2elf script.sh -o elf.out # emits `elf.out` (defaults to a.out)
./elf.out # runs the translated scriptThe source language is intentionally tiny. Anything outside the rules below is rejected with a parse error or left uninterpreted.
-
Command layout: commands are separated by newlines or
;. Blank lines are ignored. Trailing|entries are rejected. -
Comments: unquoted
#begins a comment that extends to the end of the current line (after any inline whitespace). Inside single or double quotes#is treated literally. -
Pipelines: the
|operator connects stdout of the left stage to stdin of the right one. Arbitrary length pipelines are supported, mixing built-ins and external commands. -
Conditional execution:
&&runs the following pipeline only when the previous command succeeds (exit status0), while||runs the next pipeline only when the previous command fails (non zero status). Conditions short circuit without altering the last exit status. -
Redirection: each stage accepts a single input (
< file) and single output redirection (> fileoverwrite,>> fileappend). Redirections must have a word argument and cannot appear without an accompanying command. -
Built-ins:
echoprints its arguments separated by single spaces and appends a newline.cdchanges to the provided directory (cd DIR). Missing arguments are ignored.exitterminates the program with status 0.
-
External commands: names containing
/are executed verbatim. Otherwise the compiler tries/bin/NAMEand then/usr/bin/NAME. NoPATHlookup occurs. The runtime passes an empty environment (envpterminates with NULL). -
Tokenisation & quoting:
- Unquoted tokens are split on spaces, tabs, and carriage returns.
- Backslash outside quotes escapes the next character (e.g.
echo foo\ bar). - Single quotes (
'literal') preserve characters verbatim until the matching'. - Double quotes recognise
",\,\$, and\`escapes; all other backslash pairs keep the backslash (e.g."Hello\n"staysHello\n). - Newlines inside double quotes can be escaped with
\+ newline (line continuation).
-
Argument vectors: argv is constructed exactly as parsed; no globbing, parameter expansion, command substitution, arithmetic expansion, nor brace expansion is implemented.
- Comments (# ...), background jobs (&), logical operators (&&, ||), subshells, functions, here documents, set, variable assignment, or environment inheritance.
- Background jobs (
&), subshells, functions, here documents,set, variable assignment, or environment inheritance. - Signals are not trapped; generated programs exit on failed
execveor unhandled system call errors.
Compile and run the included samples:
cat scripts/hello.sh
./sh2elf scripts/hello.sh -o hello
./hello
cat scripts/pipeline.sh
./sh2elf scripts/pipeline.sh -o pipeline
./pipeline
cat scrits/logic.sh
./sh2elf scripts/logic.sh -o logic
./logicThe sh2elf compiler includes a fully integrated tokenizer and parser to transform raw shell script text into executable machine code:
- The tokenizer reads the shell script input character by character and breaks it into meaningful tokens while respecting shell syntax.
- It handles complex quoting rules:
- Single quotes
'...'treat everything literally until the closing quote. - Double quotes
"..."allow escapes and preserve spaces within the string. - Backslash
\escapes the next character.
- Single quotes
- Token terminators include whitespace, pipeline symbols
|, command separators;or newlines, and redirection symbols<,>. - It accumulates characters into tokens until a terminator or quote is detected, enabling commands and arguments to be accurately extracted.
- The parser consumes tokens sequentially and organizes them into a hierarchical structure representing the shell script logic:
- Stage: Represents a single command and its arguments, along with input/output redirections.
- Pipeline: A sequence of
Stage's connected by pipe|operators. - Script: One or more pipelines separated by command terminators (
;or newline).
- Redirections are parsed and attached to the relevant
Stage. - Error checking is performed to detect syntax errors such as missing command after a pipe or unterminated quotes.
- The output is a tree like structure that fully describes the commands, their arguments, pipes, and redirections.
- The structured script representation feeds into code emission routines generating native
x86_64machine code. - Built-in commands (
echo,cd,exit) are implemented inline by emitting syscall instructions directly. - External commands are executed using
fork()andexecve()syscalls; the exec path is resolved if not absolute by checking common bin directories. - Pipelines are handled by creating pipes and managing file descriptors between forked children.
- Arguments and strings are stored in a dedicated string pool with relocations patched once the ELF layout is finalized.
- The final machine code is wrapped in a minimal ELF64 executable with proper headers and segments, making the binary runnable on Linux without dependencies.
Together, the tokenizer and parser transform text shell scripts into an intermediate representation that clearly separates lexical analysis, syntactic parsing, and code generation.
This low level modular design allows complex shell behavior to be implemented using just system calls, without an external interpreter, while maintaining clarity and correctness in the transformation from source text to executable machine code.
- The compiler generates raw
x86_64machine instructions byte by byte into a dynamic buffer. - Instruction helper functions emit opcodes and immediates manually, for example:
mov_rax_imm32(c, x)emits bytes formov rax, imm32.syscall_(c)emits thesyscallinstruction to invoke Linux kernel syscalls.- Conditional jumps (
je_rel32,jne_rel32) emit placeholder offsets to be patched later once the target address is known.
- Registers (like
rax,rdi,rsi,rdx,r10) are loaded with immediate values or addresses for syscall arguments. - System calls for typical shell operations are implemented:
sys_write(write to file descriptor),sys_fork(create child process),sys_execve(execute a binary),sys_wait4(wait for child process),- file operations like
sys_openat,sys_dup2,sys_closefor managing redirections.
- All string literals (command arguments, file names) are stored in a read only string pool buffer.
- When emitting code that loads address of strings, a zero placeholder is emitted.
- These placeholders are registered in a relocation list to be patched later.
- After code emission is complete, the final absolute addresses of strings inside the ELF
.rodatasegment are used to patch the machine code.
- The executable is built as a minimal
ELF64file with these components:- ELF header: identifies
ELF64, type executable, machinex86_64, entry point. - Program headers (segments):
- A loadable text segment containing machine code followed by the
.rodatastring pool. - A loadable
bsssegment reserved for uninitialized data used at runtime (e.g., environment pointers, pipe file descriptors, child pids).
- A loadable text segment containing machine code followed by the
- ELF header: identifies
- Sections are not included separately; only program segments are generated directly.
- Virtual addresses are chosen as conventional Linux
x86_64load addresses (e.g., code at 0x400000,.bssat 0x600000). - The binary is written out to the specified output filename, and file permissions are set to executable (0755).
- The parsed script structure is converted sequentially into machine code.
- For each parsed pipeline and stage:
- Code for commands, argument setup, syscalls for fork/exec, and pipe/redirection management is emitted.
- Built-in commands bypass creating new processes; their behavior is implemented inline in assembly.
- Pipelines setup multiple pipes and forked children, duplicating file descriptors to implement Unix semantics.
- Error cases and exec failures include emitting code to print an error string then exit with error.
This compiler manually assembles every byte of machine code and ELF headers from scratch, without assembler or linker, demonstrating full control over:
- Encoding of instructions and operands.
- Address and offset relocations for strings.
- ELF layout with precise segment and memory mapping.
- Implementation of shell like process and I/O management using Linux syscall ABI.
This project is provided under the GPL3 License.