Errata list for A Concise Introduction to Data Compression

Last Updated 9 Feb 2015 ------ Error found by Ulrich Tipp I am working with your book "A concise Introduction to Data Compression" in a students seminar. Unfortunately I can't follow your argument on page 72 where the number of different Huffman Codes is calculated. The number of 96 different code trees is agreed. But why are there only 94 codes produced by the Huffman method and not only 32? The argument that 5 and 6 must start with a different bit excludes many more codes, because for the 3 bit words we still have more than two choices, haven't we? I look forward to hearing from you Kind Regards, Ulrich Tipp -------Answer by Giovanni Motta: Dear Ulrich, I've read your notes and checked the discussion on Prof. Salomon's book. I think that both answers reported in section 2.2.3 are numerically correct, however the generation of the 94 instead of 96 codes is not. Answer 1 counts the number of ways a Huffman tree with 5 internal nodes (or 6 symbols) can be labeled. Answer 2 counts the number of optimal prefix codes for the distribution in the example (more on this later). The probability distribution that is used in the example only allows a unique construction (unlike the example you have provided in your notes), so there is only one tree that can be generated by strictly following the Huffman construction (the tree in Fig 2.11 (a) and the tree (A) in this 1-page pdf file (sorry for the handwriting). It is well known that for many probability distributions (such as the one in your example) there can be multiple Huffman trees. This is exploited in the construction of codes that have minimal variance among the length of the codewords or the maximum depth of the tree. Minimizing the depth of the tree, for example, is necessary to minimize the decoding delay (and so the buffering) of a variable-length code. Returning to the example in the book, there are four possible minimal-length trees (in this 1-page pdf file they are labeled A-D) that can be constructed starting from that distribution. ONLY ONE is a proper Huffman tree, since the remaining three constructions (B, C, D) at some point merge subtrees with probabilities that are not the smallest available. Nevertheless, they still generate codeword lengths of 3, 3, 3, 3, 2, 2, and so they are optimal in the Huffman sense. The conclusion is that Answer 2 counts the number of such prefix codes: 24 * 4 = 96 codes since four non-isomorphic trees can each be labeled in 24 different ways. This is different than the phenomenon you have pointed out in your example, i.e., when you can make multiple choices in the construction. I don't know of any closed formula to count (given a probability distribution) the number of Huffman codes. However, given the number of symbols, there is a formula to count the number of non-isomorphic Huffman trees. The numerical sequence should be the Catalan Numbers: http://www.research.att.com/~njas/sequences/index.html?q=1%2C+1%2C+2%2C+5%2C+14&language=english&go=Search This problem is identical to ours: - Number of ways to insert n pairs of parentheses in a word of n+1 letters. - E.g. for n=3 there are five ways: ((ab)(cd)), (((ab)c)d), ((a(bc))d), (a((bc)d)), (a(b(cd))). If you want to try deriving a closed formula, I would suggest starting from the lengths of the codewords and not from the probability distribution, since the lengths are integers, but you still have the problem of probability distributions that can generate different sets of codeword lengths. Let me know what you think. ------ [Feb 2015] Error found by Yin Chen Page 160, line -5: "and therefore equals \sqrt(1/n) times the average of the n data values" should be "and therefore equals \sqrt(n) times the average of the n data values". ------- "If you don't make mistakes you're not working on hard enough problems. And that's a big mistake" Frank Wilczek, unsolicited advice to grad student (1974?) appears in {\it longing for the harmonies} p. 261