The LZW (Lempel-Ziv-Welch) algorithm is a well-known data compression technique that uses a dictionary-based approach for encoding and decoding data. It is particularly effective in handling repetitive data, making it a popular choice for formats like GIF and TIFF.
This project implements an LZW encoder and decoder in C to compress and decompress text files using a 15-bit dictionary efficiently.
- The source file must contain only ASCII characters.
- The source file size should not exceed a few megabytes.
- No new dictionary entries can be added once the dictionary is full.
- Character Encoding: Each ASCII character is encoded using one byte, with the most significant bit set to 0. This distinguishes it from dictionary entries, as ASCII characters are 7-bit.
- Index Encoding: Each dictionary index is represented using two bytes, with the most significant bit set to 1. This allows for a maximum of 32,768 entries, suitable for compressing files within the specified size limit.
- Minimum Entry Size: Dictionary entries are only referenced when they contain at least 3 characters. This is because 2 bytes are used for an index, making smaller entries less space-efficient.
- Index Offset: Dictionary indexing starts from 0, not 256. This separation allows for 256 additional entries in the dictionary.
- Hashed Indexing: Each dictionary entry is indexed by a checksum value. A linked list is used to manage entries sharing the same checksum, optimising dictionary lookups and reducing the chance of collision.
-
Initialisation:
- The program starts by checking if the correct number of command-line arguments is provided.
- It opens the input file for reading in binary mode and the output file for writing.
-
Dictionary Setup:
- A dictionary (
dict) is initialized to store up to 32,768 entries. - An array of linked lists (
idx) is used to map checksum values to dictionary entries, enhancing lookup efficiency.
- A dictionary (
-
Main Encoding Loop:
- For each subsequent character read from the input file:
- It creates a string
pcby appending the current character to the previous stringp. - The program calculates a checksum for
pcto determine if it exists in the dictionary. - If
pcis found in the dictionary, it becomes the newp, and the loop continues. - If
pcis not found and the dictionary is not full,pcis added to the dictionary. - If
pis found in the dictionary, its corresponding dictionary index is written to the output file.
- It creates a string
- For each subsequent character read from the input file:
-
Final Output:
- After the entire input file is processed, the final value of
pis output to the file. - The program ensures that any remaining data is written before closing the files.
- After the entire input file is processed, the final value of
-
Cleanup:
- The program frees all dynamically allocated memory, including the dictionary entries and linked lists.
- The files are closed, and the program exits.
-
Initialisation:
- The program starts by verifying that the correct number of command-line arguments (input file and output file) are provided.
- It opens the input file for reading in binary mode and the output file for writing.
-
Dictionary Setup:
- A dictionary (
dict) is initialized to store up to 32,768 entries. - An array of linked lists (
idx) is used to map checksum values to dictionary entries, improving lookup efficiency.
- A dictionary (
-
Main Decoding Loop:
- The program reads the first character from the input file, writes it directly to the output file, and stores it as the previous character (
p). - For each subsequent character read from the input file:
- If the character is an ASCII value (less than 128), it is directly written to the output file.
- If the character is part of a dictionary entry (greater than or equal to 128), the program reads an additional byte to form the complete dictionary index.
- The program then checks if the dictionary entry exists. If it does, the corresponding string is written to the output file. If it doesn’t, the string is derived from the previous dictionary entry and written to the output file.
- The program forms a new dictionary entry by appending the first character of the current string to the previous string and adds it to the dictionary if there is space.
- The program reads the first character from the input file, writes it directly to the output file, and stores it as the previous character (
-
Final Index Handling:
- The program checks if the newly formed string (
pc) exists in the dictionary using a checksum-based lookup. - If
pcis not found and the dictionary is not full, the program addspcto the dictionary and updates the corresponding index.
- The program checks if the newly formed string (
-
Cleanup:
- After processing the entire input file, the program frees all dynamically allocated memory for the dictionary entries and linked lists.
- It closes the input and output files before exiting.
- The size of the encoded file is always less than or equal to the size of the source file.
- Both encoding and decoding processes complete in under 5 seconds for files smaller than 2MB in size.
To compile the LZW encoder and decoder, run the make command or the following:
gcc -o lencode lencode.c
gcc -o ldecode ldecode.cTo encode a file, run the command:
./lencode originalFile encodedFileTo decode an encoded file, run the command:
./ldecode encodedFile decodedFileReplace originalFile, encodedFile, and decodedFile with the appropriate file names.
To run the sanity test script, run the command:
./autotestThe script will execute tests based on example files provided in the test folder.
This project presents a straightforward attempt at the problem with limited focus on performance optimisation.