rawccopy-rs is a low-level library designed for the direct extraction of file content from New Technology File System (NTFS) volumes. It operates by parsing on-disk data structures, bypassing high-level operating system file I/O APIs. This methodology provides unfettered access to file data, irrespective of file system locks, security descriptors, or API-level data concealment mechanisms. The library's approach is rooted in a direct interpretation of NTFS metadata, including the Master File Table ($MFT), attribute runlists, and index B-trees, enabling the reconstruction of any file's data stream from a raw disk image or a live volume.
Standard interaction with file systems is mediated through operating system APIs (e.g., CreateFileW, ReadFile in the Windows API). While suitable for general-purpose computing, this abstracted model presents significant limitations in specialized fields such as digital forensics and incident response (DFIR), where obtaining a "ground truth" representation of on-disk data is required.
API-level access is subject to several constraints:
- Exclusive File Locking: The operating system and its applications often place exclusive locks on critical system files (e.g., registry hives, pagefiles, active database files), preventing them from being read by other processes.
- Security and Permissions: Access to files is governed by security descriptors, which may prevent even a privileged user from reading specific data.
- API-Level Obfuscation: Malicious software (rootkits) can intercept or "hook" file system APIs to conceal the presence of files, directories, or alternate data streams from user-mode applications.
- Filesystem Abstractions: The API presents a simplified view of a file, hiding the underlying complexity of its physical storage, such as fragmentation, compression, or residency within the master file table.
Overcoming these limitations requires a methodology that circumvents the OS file system driver and interacts directly with the volume at the block level.
The rawccopy-rs library implements a direct-access model by parsing NTFS on-disk structures. This process reconstructs the location and content of a file by interpreting the file system's metadata as a database.
The process begins by acquiring a handle to a raw block device (e.g., \\.\PhysicalDrive0, \\.\C:) or a forensic disk image. The first sector of the target volume, the Volume Boot Record (VBR), is read. The OEM ID field is validated to confirm the presence of an NTFS file system ("NTFS ").
From the VBR's BIOS Parameter Block (BPB), critical geometry parameters are extracted, including:
- Bytes per sector.
- Sectors per cluster.
- The logical cluster number (LCN) of the Master File Table ($MFT).
The $MFT is the central metadata file in NTFS, containing at least one entry—an MFT record—for every file and directory on the volume. The library first locates the $MFT using the LCN from the boot sector and reads its own MFT record (always at index 0).
Each MFT record is a fixed-size block of data (typically 1024 bytes). Before parsing, each record undergoes a "fix-up" procedure. NTFS uses an Update Sequence Array (USA) to protect against torn writes. The last two bytes of each sector in the record are replaced with a signature, and the original bytes are stored in the USA within the record header. The library validates this signature and patches the original bytes back into the record to ensure its integrity before further processing.
An MFT record is composed of a series of variable-length attribute structures. These attributes define the characteristics of a file, such as its name, timestamps, and data content. The library iterates through these attributes to locate the primary data stream, represented by the $DATA attribute.
NTFS file data can be stored in two ways:
-
Resident Data: For very small files, the data is stored directly within the
$DATAattribute inside the MFT record itself. Extraction is a simple matter of reading the bytes from the attribute's value offset. -
Non-Resident Data: For larger files, the data is stored in clusters elsewhere on the volume. The
$DATAattribute contains not the data itself, but a set of pointers known as a runlist (or mapping pairs).
The runlist is a highly compact representation of data extents. Each entry specifies a starting Virtual Cluster Number (VCN) within the file and a corresponding Logical Cluster Number (LCN) on the disk, along with the length of the contiguous run. The library parses these runlists to build a complete map of the file's physical layout on the disk, allowing for the precise reconstruction of fragmented files.
The methodology extends to handle more complex NTFS file system features:
- Sparse Files: A runlist entry with an LCN of zero indicates a sparse region, which contains no allocated data. The library interprets this as a block of zero-bytes of the specified length.
- Compressed Files: NTFS supports transparent file compression using the LZNT1 algorithm. A compressed
$DATAattribute has a non-zero compression unit size. The library reads the compressed data in blocks from the disk, identifies compressed versus uncompressed regions within a compression unit, and applies an LZNT1 decompression routine to reconstruct the original data. - Attribute Lists: If a file has too many attributes to fit within a single MFT record, some attributes are moved to extension MFT records. An
$ATTRIBUTE_LISTattribute is created in the base record, which contains pointers to the MFT records holding the externalized attributes. The library parses the$ATTRIBUTE_LISTto locate and read all parts of a fragmented attribute, such as a highly fragmented$DATAstream.
To locate a file by its path, the library implements a parser for NTFS index structures, which are used for directories. A directory's $INDEX_ROOT and $INDEX_ALLOCATION attributes form a B-tree that maps file names to their MFT record references.
The library traverses this B-tree by starting at the root directory (MFT record #5) and recursively searching the index for each component of the target path. This allows for the resolution of any file path to its corresponding MFT record number without relying on OS API calls. The path resolution logic also supports NTFS reparse points by parsing the $REPARSE_POINT attribute to follow symbolic links and volume mount points.
The rawccopy library provides a robust and filesystem-native methodology for file data extraction. By parsing on-disk NTFS structures directly—from the boot sector to MFT records, attribute runlists, and index trees—it reconstructs file content with high fidelity. This approach successfully bypasses the abstractions and limitations of standard file I/O APIs, making it a suitable engine for forensic tooling that requires unmediated access to file system data.