Huffman Code is a lossless data compression algorithm that assigns variable-length binary codes to input characters based on their frequency. It is widely used in data compression applications such as file compression and encoding.
Concept
Huffman coding works by assigning shorter codes to more frequent characters and longer codes to less frequent characters, minimizing the total number of bits used to represent the data.
Algorithm
The Huffman coding algorithm follows these steps:
- Calculate Frequency: Count the occurrence of each character in the input data.
- Build a Priority Queue: Create a min-heap where each node represents a character and its frequency.
- Construct a Huffman Tree:
- Extract the two nodes with the lowest frequency.
- Merge them into a new node with their combined frequency.
- Repeat until only one node remains, which becomes the root of the tree.
- Assign Binary Codes: Traverse the tree and assign binary values (0 for left, 1 for right) to generate unique Huffman codes for each character.
Example
Given the input string hello world, the frequency table and corresponding Huffman codes might be:
Character | Frequency | Huffman Code |
---|---|---|
h | 1 | 1100 |
e | 1 | 1101 |
l | 3 | 10 |
o | 2 | 00 |
w | 1 | 1110 |
r | 1 | 1111 |
d | 1 | 01 |
The compressed representation replaces each character with its Huffman code.
Implementation
A simple implementation of Huffman coding in Python:
import heapq
from collections import Counter, namedtuple
class Node(namedtuple("Node", ["char", "freq", "left", "right"])):
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(text):
freq = Counter(text)
heap = [Node(char, freq, None, None) for char, freq in freq.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
merged = Node(None, left.freq + right.freq, left, right)
heapq.heappush(heap, merged)
return heap[0]
def generate_codes(node, prefix="", codebook={}):
if node:
if node.char:
codebook[node.char] = prefix
generate_codes(node.left, prefix + "0", codebook)
generate_codes(node.right, prefix + "1", codebook)
return codebook
text = "hello world"
tree = build_huffman_tree(text)
codes = generate_codes(tree)
print(codes)
Advantages
- Lossless Compression
- Ensures no data loss in reconstruction.
- Efficient Encoding
- Optimized for frequent characters, reducing overall storage size.
- Used in Many Applications
- Commonly applied in ZIP, GZIP, JPEG, and other compression formats.
Limitations
- Requires Frequency Table
- The table must be transmitted alongside the encoded data.
- Not Optimal for Small Texts
- Overhead can be significant for short messages.
- Sensitive to Input Data
- Performance depends on character frequency distribution.
Applications
- File Compression
- Used in ZIP and GZIP formats.
- Multimedia Encoding
- Applied in JPEG and MP3 compression.
- Data Transmission
- Efficient encoding in communication protocols.