Huffman Coding

From Sega Retro

This teeny-tiny article needs some work. You can help us by expanding it.

Huffman Coding[1] also known as Huffman Compression[2], is an algorithm for the lossless compression of files based on the frequency of occurrence of a symbol in the file that is being compressed. It was developed by David Albert Huffman[3] in 1952, an American pioneer in computer science.[4]

The Huffman algorithm is based on statistical coding, which means that the probability of a symbol has a direct bearing on the length of its representation. The more probable the occurrence of a symbol is, the shorter will be its bit-size representation. In any file, certain characters are used more than others. Using binary representation, the number of bits required to represent each character depends upon the number of characters that have to be represented. Using one bit we can represent two characters, i.e., 0 represents the first character and 1 represents the second character. Using two bits we can represent four characters, and so on.

The algorithm starts by building a list of all the alphabet symbols in descending order of their probabilities. It then constructs a tree,[5] with a symbol at every leaf, from the bottom up. This is done in steps, where at each step the two symbols with smallest probabilities are selected, added to the top of the partial tree, deleted from the list, and replaced with an auxiliary symbol representing the two original symbols. When the list is reduced to just one auxiliary symbol (representing the entire alphabet), the tree is complete. The tree is then traversed to determine the codes of the symbols.

Unlike ASCII code, which is a fixed-length code using seven bits per character, Huffman compression is a variable-length coding system that assigns smaller codes for more frequently used characters and larger codes for less frequently used characters in order to reduce the size of files being compressed and transferred.

For example, in a file with the following data:

XXXXXXYYYYZZ the frequency of "X" is 6, the frequency of "Y" is 4, and the frequency of "Z" is 2. If each character is represented using a fixed-length code of two bits, then the number of bits required to store this file would be 24, i.e., (2 x 6) + (2x 4) + (2x 2) = 24. If the above data were compressed using Huffman compression, the more frequently occurring numbers would be represented by smaller bits, such as: X by the code 0 (1 bit) Y by the code 10 (2 bits) Z by the code 11 (2 bits) therefore the size of the file becomes 18, i.e., (1x 6) + (2 x 4) + (2 x 2) = 18.

In the above example, more frequently occurring characters are assigned smaller codes, resulting in a smaller number of bits in the final compressed file.

Burt Sloane from Recreational Brainware used an incircuit emulator on the game The Revenge of Shinobi to find out how it was able to include so much art, as it used the same Rom size as Spider-Man vs. The Kingpin (4Mbits). He then used Huffman Coding and modeled his compression after The Revenge of Shinobi algorithm.

Mega Drive games that use Huffman Coding


  2. File:Data Compression The Complete Reference Book.pdf, page 99