Correctness of Huffman's algorithm

 First of all, I would like to give a bit of context and explain my motivation behind this article. I have gone through the book [1] and found that the proof of correctness is a bit less informative and making it a bit more hard for a reader to understand on their own. Then I checked some other lecture slides, materials and unfortunately found none of them helpful in understanding the proof easily. The first part of the proof, which is Lemma 16.2 is much better and has some illustrations to assist readers, hence I don’t think it requires more explanation. But the last part which is Lemma 16.3 lacks the required details, hence I thought of adding them here to assist other readers because I don’t want them to spend their time going through the painstaking process which I had when I was first reading the proof.

Lemma 16.3

Let $C$ be a given alphabet with frequency $c.freq$ defined for each character $c \in C$. Let $x$ and $y$ be two characters in $C$ with minimum frequency. Let $C^\prime$ be the alphabet $C$ with the characters $x$ and $y$ removed and a new character $z$ added, so that $C^\prime = C - \{x, y\}\; \cup \; \{z\}$. Define freq for $C^\prime$ as for $C$, except that $z.freq = x.freq + y.freq$. Let $T^\prime$ be any tree representing an optimal prefix code for the alphabet $C^\prime$. Then the tree $T$, obtained from $T^\prime$ by replacing the leaf node for $z$ with an internal node having $x$ and $y$ as children, represents an optimal prefix code for the alphabet $C$. 

Proof

We first show how to express the cost $B(T)$ of tree $T$ in terms of the cost $B(T^\prime)$ of the tree $T^\prime$, by considering the component costs in equation (16.4). Let’s first consider the two trees $T$ and $T^\prime$ with some sample alphabet for each represented by $C$ and $C^\prime$ respectively.

$C = \{f: 5,\; e: 9,\; c: 12,\; b: 13,\; d: 16,\; a: 45\}$  
$C^\prime = \{c: 12,\; b: 13,\; g: 14,\; d: 16,\; a: 45\}$

Now, let’s draw the two trees for the above two alphabet, denoted by $T$ and $T^\prime$ respectively. 


 



Now note that our $f$ and $e$ characters are equivalent to $x$ and $y$ in the general case respectively and our character $g$ is equivalent to the character $z$ in the general case.

For each character $c\; \in\; C - \{x,\; y\}$, we have that $d_T(c) = d_{T^\prime}(c)$, and hence $c.freq \cdot d_T(c) = c.freq \cdot d_{T^\prime}(c)$. Putting this into our instance, we have, $c\; \in\; C - \{f,\; e\} \equiv \{c, b, d, a\}$ and you may see that the depth of those characters are the same in both the trees $T$ and $T^\prime$ respectively. Also note that, $d_T(x) = d_T(y) = d_{T^\prime}(z) + 1$ is equivalent to $d_T(f) = d_T(e) = d_{T^\prime}(g) + 1$ in our instance.

Now we can obtain $T$ from the tree $T^\prime$ just by making the leaf node $g$ an internal node, and then adding $f$ and $e$ as it’s children. This is equivalent to changing the leaf node $z$ in $T^\prime$ into an internal node in $T$ and then adding $x$ and $y$ as it’s children in the general case.  

Now, we define the cost of the tree $T$ as,
$$B(T) = \sum_{c \in C}c.freq \cdot d_T(c)$$
Now the tree $T$, obtained from $T^\prime$ by inserting two characters $f$ and $e$ and then removing the character $g$. You may realize this if you compare the alphabet $C$ and $C^\prime$. This is equivalent to adding two characters $x$ and $y$ and then removing the character $z$ in our more general case. Now, using the above equation, we can write down the cost functions for two trees as  follows. Note that here after I’ll only consider the general case.

$$B(T) = B(T^\prime) + x.freq \cdot d_T(x) + y.freq \cdot d_T(y) - z.freq \cdot d_{T^\prime}(z)$$
$$x.freq \cdot d_T(x) + y.freq \cdot d_T(y) = (x.freq + y.freq) \cdot d_T(x) = (x.freq + y.freq) \cdot (d_{T^\prime}(z) + 1)$$
$$x.freq \cdot d_T(x) + y.freq \cdot d_T(y) = z.freq \cdot d_{T^\prime}(z) + x.freq + y.freq$$
$$B(T) = B(T^\prime) + x.freq + y.freq$$
$$B(T^\prime) = B(T) - x.freq - y.freq$$

We now prove the lemma by contradiction. Suppose that $T$ does not represent an optimal prefix code for $C$. Then there exists an optimal tree $T^{\prime\prime}$ such that $B(T^{\prime\prime}) \lt B(T)$. Without loss of generality (by Lemma 16.2), $T^{\prime\prime}$ has $x$ and $y$ as siblings. Let $T^{\prime\prime\prime}$ be the tree $T^{\prime\prime}$ with the common parent of $x$ and $y$ replaced by a leaf $z$ with frequency $z.freq = x.freq + y.freq$. Then

$$B(T^{\prime\prime \prime}) = B(T^{\prime \prime}) - x.freq - y.freq$$
$$B(T^{\prime\prime \prime}) \lt B(T) - x.freq - y.freq$$
$$B(T^{\prime\prime \prime}) \lt B(T^\prime)$$

yielding a contradiction to the assumption that $T^\prime$ represents an optimal prefix code for  $C^\prime$. Thus, $T$ must represent an optimal prefix code for the alphabet $C$. 

Conclusion

Now, given that $T^\prime$ is an optimal code tree for  $C^\prime$ which is $n - 1$ in size, we can conclude that $T$ yields an optimal tree for the alphabet $C$ of size $n$. And this proves the optimal substructure property of Huffman's algorithm.

References

[1] https://www.amazon.com/Introduction-Algorithms-3rd-MIT-Press/dp/0262033844

Comments

Popular posts from this blog

Introducing Java Reactive Extentions in to a SpringBoot Micro Service

Optimal binary search trees

Edit distance