Thanks to visit codestin.com
Credit goes to github.com

Skip to content

This is a basic LZW codec written in C for compressing and decompressing ASCII text files using a 15-bit dictionary.

Notifications You must be signed in to change notification settings

denniemok/lzw-codec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LZW Encoder and Decoder

Overview

The LZW (Lempel-Ziv-Welch) algorithm is a well-known data compression technique that uses a dictionary-based approach for encoding and decoding data. It is particularly effective in handling repetitive data, making it a popular choice for formats like GIF and TIFF.

This project implements an LZW encoder and decoder in C to compress and decompress text files using a 15-bit dictionary efficiently.

Constraints

  • The source file must contain only ASCII characters.
  • The source file size should not exceed a few megabytes.
  • No new dictionary entries can be added once the dictionary is full.

Technical Details

  • Character Encoding: Each ASCII character is encoded using one byte, with the most significant bit set to 0. This distinguishes it from dictionary entries, as ASCII characters are 7-bit.
  • Index Encoding: Each dictionary index is represented using two bytes, with the most significant bit set to 1. This allows for a maximum of 32,768 entries, suitable for compressing files within the specified size limit.
  • Minimum Entry Size: Dictionary entries are only referenced when they contain at least 3 characters. This is because 2 bytes are used for an index, making smaller entries less space-efficient.
  • Index Offset: Dictionary indexing starts from 0, not 256. This separation allows for 256 additional entries in the dictionary.
  • Hashed Indexing: Each dictionary entry is indexed by a checksum value. A linked list is used to manage entries sharing the same checksum, optimising dictionary lookups and reducing the chance of collision.

Encoding Implementation

  1. Initialisation:

    • The program starts by checking if the correct number of command-line arguments is provided.
    • It opens the input file for reading in binary mode and the output file for writing.
  2. Dictionary Setup:

    • A dictionary (dict) is initialized to store up to 32,768 entries.
    • An array of linked lists (idx) is used to map checksum values to dictionary entries, enhancing lookup efficiency.
  3. Main Encoding Loop:

    • For each subsequent character read from the input file:
      • It creates a string pc by appending the current character to the previous string p.
      • The program calculates a checksum for pc to determine if it exists in the dictionary.
      • If pc is found in the dictionary, it becomes the new p, and the loop continues.
      • If pc is not found and the dictionary is not full, pc is added to the dictionary.
      • If p is found in the dictionary, its corresponding dictionary index is written to the output file.
  4. Final Output:

    • After the entire input file is processed, the final value of p is output to the file.
    • The program ensures that any remaining data is written before closing the files.
  5. Cleanup:

    • The program frees all dynamically allocated memory, including the dictionary entries and linked lists.
    • The files are closed, and the program exits.

Decoding Implementation

  1. Initialisation:

    • The program starts by verifying that the correct number of command-line arguments (input file and output file) are provided.
    • It opens the input file for reading in binary mode and the output file for writing.
  2. Dictionary Setup:

    • A dictionary (dict) is initialized to store up to 32,768 entries.
    • An array of linked lists (idx) is used to map checksum values to dictionary entries, improving lookup efficiency.
  3. Main Decoding Loop:

    • The program reads the first character from the input file, writes it directly to the output file, and stores it as the previous character (p).
    • For each subsequent character read from the input file:
      • If the character is an ASCII value (less than 128), it is directly written to the output file.
      • If the character is part of a dictionary entry (greater than or equal to 128), the program reads an additional byte to form the complete dictionary index.
      • The program then checks if the dictionary entry exists. If it does, the corresponding string is written to the output file. If it doesn’t, the string is derived from the previous dictionary entry and written to the output file.
      • The program forms a new dictionary entry by appending the first character of the current string to the previous string and adds it to the dictionary if there is space.
  4. Final Index Handling:

    • The program checks if the newly formed string (pc) exists in the dictionary using a checksum-based lookup.
    • If pc is not found and the dictionary is not full, the program adds pc to the dictionary and updates the corresponding index.
  5. Cleanup:

    • After processing the entire input file, the program frees all dynamically allocated memory for the dictionary entries and linked lists.
    • It closes the input and output files before exiting.

Performance

  • The size of the encoded file is always less than or equal to the size of the source file.
  • Both encoding and decoding processes complete in under 5 seconds for files smaller than 2MB in size.

Instructions

To compile the LZW encoder and decoder, run the make command or the following:

gcc -o lencode lencode.c
gcc -o ldecode ldecode.c

To encode a file, run the command:

./lencode originalFile encodedFile

To decode an encoded file, run the command:

./ldecode encodedFile decodedFile

Replace originalFile, encodedFile, and decodedFile with the appropriate file names.

Sanity Test

To run the sanity test script, run the command:

./autotest

The script will execute tests based on example files provided in the test folder.

Disclaimer

This project presents a straightforward attempt at the problem with limited focus on performance optimisation.

About

This is a basic LZW codec written in C for compressing and decompressing ASCII text files using a 15-bit dictionary.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published