Huffman Coding is one of the lossless data compression techniques. It assigns variable-length codes to the input characters, based on the frequencies of their occurence. The most frequent character is given the smallest length code.

I thought of implementing the data compression program. The key things in the implementation were:

  • Building frequency dictionary

  • Select 2 minimum frequency symbols and merge them repeatedly: Used Min Heap

  • Build a tree of the above process: Created a HeapNode class and used objects to maintain tree structure

  • Assign code to characters: Recursively traversed the tree and assigned the corresponding codes

  • Encode the input text. Added padding to the encoded text, if it’s not of a length of multiple of 8. Stored this padding information, in 8 bits, in the beginning of the resultant code.

  • Write the result to an output binary file, which will be our compressed file.

End result

After running on a several sample text files, Compression Ratio on an average was achieved to be 2.1 : 1.

So it’s like you have your very own text file compression program.

I implemented both the compression and decompression functions. Decompressing the compressed file brought back the original state of the file, without any data loss.

Before you look at the code from beginning, first check out the outline of the 2 functions compress (line 101) and decompress (line 157) , and then look at details of the other functions called from them.

Compression (line 101) Decompression (line 157)
  1. Make frequency dictionary
  2. Make heap
  3. Merge Nodes and build tree
  4. Make codes
  5. Encode Text
  6. Pad encodded text
  7. Make byte array
  8. Output the byte array to binary file
  1. Read binary file
  2. Remove padding
  3. Decode text
  4. Output decoded text to txt file

The class HuffmanCoding takes complete path of the text file to be compressed as parameter. (as its data members store data specific to the input file).

The compress() function returns the path of the output compressed file.

The function decompress() requires path of the file to be decompressed. (and decompress() is to be called from the same object created for compression, so as to get code mapping from its data members)

Running the program:

Save the above code.

Create a sample text file. Or download a sample file from sample.txt (right click, save as)

Save the code below, in the same directory as the above code, and Run this python code (edit the path variable below before running. initialize it to text file path)
from huffman import HuffmanCoding

#input file path
path = "/home/ubuntu/Downloads/sample.txt"

h = HuffmanCoding(path)

output_path = h.compress()

The compressed .bin file and the decompressed file are both saved in the same directory as of the input file.


On running on the above linked sample text file:

Initial Size: 715.3 kB
Compressed file Size: 394.0 kB

Plus, the decompressed file comes out to be exactly the same as the original file, without any data loss.

And that is all for Huffman Coding implementation, with compression and decompression. This was fun to code.