Variable-Length Codes for Data Compression

By David Salomon. Errors, questions, comments, and suggestions should be emailed. Published by Springer Verlag, September 2007. ISBN 978-1-84628-958-3. xii+191 pages.

A BibTeX style file and an Errata list are available.

I would like to thank Giovanni Motta and Cosmin Truta for their valuable help in developing this book.

Preface

The dates of most of the important historical events are known, but not always very precisely. We know that Kublai Khan, grandson of Ghengis Khan, founded the Yuan dynasty in 1280 (it lasted until 1368), but we don't know precisely (i.e, the month, day and hour) when this act took place. A notable exception to this state of affairs is the modern age of telecommunications, a historical era whose birth is known precisely, up to the minute. On Friday, 24 May 1844, at precisely 9:45 in the morning, Samuel Morse inaugurated the age of modern telecommunications by sending the first telegraphic message in his new code. The message was sent over an experimental line funded by the American Congress from the Supreme Court chamber in Washington, D.C. to the B & O railroad depot in Baltimore, Maryland. Taken from the Bible (Numbers 23:23), the message was "What hath God wrought?" It had been suggested to Morse by Annie Ellsworth, the young daughter of a friend. It was prerecorded on a paper tape, was sent to a colleague in Baltimore, and was then decoded and sent back by him to Washington. An image of the paper tape can be viewed at http://memory.loc.gov/mss/mmorse/071/071009/0001d.jpg.

Morse was born near Boston and was educated at Yale. We would expect the inventor of the telegraph (and of such a sophisticated code) to have been a child prodigy who tinkered with electricity and gadgets from an early age (the electric battery was invented when Morse was nine years old). Instead, Morse became a successful portrait painter with more than 300 paintings to his credit. It wasn't until 1837 that the 46-year-old Morse suddenly quit his painting career and started thinking about communications and tinkering with electric equipment. It is not clear why he made such a drastic career change at such an age, but it is known that two large, wall-size paintings that he made for the Capitol building in Washington, D.C. were ignored by museum visitors and rejected by congressmen. It may have been this disappointment that gave us the telegraph and the Morse code.

Given this background, it is easy to imagine how the 53-year-old Samuel Morse felt on that fateful day, Friday the 24th of May 1844, as he sat hunched over his mysterious apparatus, surrounded by a curious crowd of onlookers, some of whom had only a vague idea of what he was trying to demonstrate. He must have been very anxious, because his telegraph project, his career, and his entire future depended on the success of this one test. The year before, the American Congress awarded him $30,000 to prepare this historical test and prove the value of the electric telegraph (and thus also confirm the ingenuity of yankees), and here he is now, dependent on the vagaries of his batteries, on the new, untested 41-mile-long telegraph line, and on a colleague in Baltimore.

Fortunately, all went well. The Friend in Baltimore received the message, decoded it, and resent it within a few minutes, to the great relief of Morse and to the amazement of the many congressmen assembled around him.

The morse code, with its quick dots and dashes, was extensively used for many years, first for telegraphy, and beginning in the 1890s, for early radio communications. The development of more advanced communications technologies in the 20th century displaced the Morse code, which is now largely obsolete. Today, it is used for emergencies, for navigational radio beacons, land mobile transmitter identification, and by continuous wave amateur radio operators.

Our interest in the Morse code is primarily with a little-known aspect of this code. In addition to its advantages for telecommunications, the Morse code is also an early example of text compression. The various dot-dash codes developed by Morse (and possibly also by his associate, Alfred Vail) have different lengths, and Morse intuitively assigned the short codes (a single dot and a single dash) to the letters E and T, the longer, four dots-dashes, he assigned to Q, X, Y, and Z. The even longer, five dots-dashes codes, were assigned to the 10 digits, and the longest codes (six dots and dashes) became those of the punctuation marks. Morse also specified that the signal for error is eight consecutive dots, in response to which the receiving operator should delete the last word received.

It is interesting to note that Morse was not the first to think of compression (in terms of time saving) by means of a code. The Well-known Braille code for the blind was developed by Louis Braille in the 1820s and is still in common use today. It consists of groups (or cells) of 3x2 dots each, embossed on thick paper. Each of the six dots in a group may be flat or raised, implying that the information content of a group is equivalent to six bits, resulting in 64 possible groups. The letters, digits, and common punctuation marks do not require all 64 codes, which is why the remaining groups may be used to code common words---such as |and|, |for|, and |of|---and common strings of letters---such as |ound|, |ation| and |th|.

The Morse code has another feature that makes it relevant to us. Because the individual codes have different lengths, there must be a way to identify the end of a code. Morse solved this problem by requiring accurate relative timing. If the duration of a dot is taken to be one unit, then that of a dash is three units, the space between the dots and dashes of one character is one unit, the space between characters is three units, and the interword space is six units (five for automatic transmission). This book is concerned with the use of variable-length codes to compress digital data. With these codes, it is important not to have any extra spaces. In fact, there is no such thing as a space, because computers use only zeros and 1's. Thus, when a string of data symbols is compressed by assigning short codes (that are termed "codewords") to the symbols, the codewords (whose lengths vary) are concatenated into a long binary string without any spaces or separators. Such variable-length codes must therefore be designed to allow for unambiguous reading. Somehow, the decoder should be able to read bits and identify the end of each codeword. Such codes are referred to as uniquely decodable or uniquely decipherable (UD).

Variable-length codes have become important in many areas of computer science. This book is a survey of this important topic. It presents the principles underlying this type of codes and describes the important classes of variable-length codes. Many examples illustrate the applications of these codes to data compression. The book is devoted to the codes, which is why it describes very few actual compression algorithms. Notice that many important (and some not so important) methods, algorithms, and techniques for compressing data are described in detail in "Data Compression: The Complete Reference, 4th edition".

The term "representation" is central to our discussion. A number can be represented in decimal, binary, or any other number base (or number system. Mathematically, a representation is a bijection (or a bijective function) of an infinite, countable set S_1 of strings onto another set S_2 of strings (in practice, S_2 consists of binary strings, but it may also be ternary or based on other number systems), such that any concatenation of any elements of S_2 is UD. The elements of S_1 are called data symbols and those of S_2 are codewords. Set S_1 is an alphabet and set S_2 is a code. An interesting example is the standard binary notation. We normally refer to it as the binary representation of the integers, but according to the definition above it is not a representation because it is not UD. It is easy to see, for example, that a string of binary codewords that starts with 11 can be either two consecutive 1's or the code of 3.

This book is aimed at readers who have a basic knowledge of data compression and who want to know more about the specific codes used by the various compression algorithms. The necessary mathematical background includes logarithms, polynomials, a bit of calculus and linear algebra, and the concept of probability. This book is not intended as a guide to software implementors and has no programs. Errors, mistypes, comments, and questions should be emailed.

It is my pleasant duty to acknowledge the substantial help and encouragement I have received from Giovanni Motta and Cosmin Truta and for their painstaking efforts. They read drafts of the text, found many errors and misprints, and provided valuable comments and suggestions that improved this book and made it what it is.

(Epigraph). The Preface is the most important part of the book. Even reviewers read a preface.

---Philip Guedalla

Book Cover

From the back cover:

Most data compression methods that are based on variable-length codes employ the Huffman or Golomb codes. However, there are a large number of less-known codes that have useful properties - such as those containing certain bit patterns, or those that are robust - and these can be useful. This book brings this large set of codes to the attention of workers in the field and to students of computer science.

David Salomon's clear style of writing and presentation, which has been familiar to readers for many years now, allows easy access to this topic. This comprehensive text offers readers a detailed, reader-friendly description of the variable length codes used in the field of data compression. Readers are only required to have a general familiarity with computer methods and essentially an understanding of the representation of data in bits and files.

Topics and Features:

Discusses codes in-depth, not the compression algorithms, which are readily available in many books

Includes detailed illustrations, providing readers with a deeper and broader understanding of the topic

Provides a supplementary author-maintained website, with errata and auxiliary material "http://www.davidsalomon.name/VLCadvertis/VLC.html"

Easily understood and used by computer science majors requiring only a minimum of mathematics

Can easily be used as a main or auxiliary textbook for courses on algebraic codes or data compression and protection

An ideal companion volume to David Salomon's fourth edition of Data Compression: The Complete Reference

Computer scientists, electrical engineers and students majoring in computer science or electrical engineering will find this volume a valuable resource, as will those readers in various physical sciences and mathematics.

Table of Contents

 Preface vii;
 Introduction 1;
1 Basic Codes 9;
1.1 Codes, Fixed- and Variable-Length 9;
1.2 Prefix Codes 12;
1.3 VLCs, Entropy, and Redundancy 13;
1.4 Universal Codes 18;
1.5 The Kraft-McMillan Inequality 19;
1.6 Tunstall Code 21;
1.7 Schalkwijk's Coding 23;
1.8 Tjalkens--Willems V-to-B Coding 28;
1.9 Phased-In Codes 31;
1.10 Redundancy Feedback (RF) Coding 33;
1.11 Recursive Phased-In Codes 37;
1.12 Self-Delimiting Codes 40;
1.13 Huffman Coding 41;
2 Advanced Codes 67;
2.1 VLCs for Integers 67;
2.2 Start-Step-Stop Codes 69;
2.3 Start/Stop Codes 71;
2.4 Elias Codes 72;
2.5 Levenstein Code 78;
2.6 Even Rodeh Code 79;
2.7 Punctured Elias Codes 80;
2.8 Other Prefix Codes 81;
2.9 Ternary Comma Code 84;
2.10 Location Based Encoding (LBE) 85;
2.11 Stout Codes 87;
2.12 Yamamoto's Recursive Code 89;
2.13 VLCs And Search Trees 93;
2.14 Taboo Codes 95;
2.15 Wang's Flag Code 100;
2.16 Yamamoto Flag Code 102;
2.17 Number Bases 106;
2.18 Fibonacci Code 108;
2.19 Generalized Fibonacci Codes 112;
2.20 Goldbach Codes 116;
2.21 Additive Codes 122;
2.22 Golomb Code 125;
2.23 Rice Codes 132;
2.24 Subexponential Code 134;
2.25 Codes Ending with "1" 135;
3 Robust Codes 139;
3.1 Codes For Error Control 139;
3.2 The Free Distance 145;
3.3 Synchronous Prefix Codes 146;
3.4 Resynchronizing Huffman Codes 152;
3.5 Bidirectional Codes 155;
3.6 Symmetric Codes 164;
3.7 VLEC Codes 166;
Summary and Unification 173;
Bibliography 175;
Index 183;

New Developments

The field of variable-length codes is still very active. As an example of a recent, sophisticated code please see "Prefix encoding by means of the (2,3)-representation of numbers" by Anatolii Anisimov, "IEEE Transactions on Information Theory", 2013, vol. 59, issue 4, pp. 2359-2374.

Important!

The following significant error was found by Jan de Vries. Section 1.9 on the phased-in codes is all wrong. The correct description of these interesting codes can now be found here.

Review

The following review, by Farhan Nasim, was published in the Journal of the ACM SIGACT, volume 44, number 4, pages 24-26, December 2013.

The author followed quite an informal style to write the book; this makes the book more readable than a usual technical book. Mathematical backgrounds required to understand any code or any class of codes are reviewed before describing that. Some of the codes are accompanied by Mathematica codes that implement them; readers who want to do experiments with the code will find them helpful. The book's fairly rich bibliography will help those who want to study the topic further. Besides all of these, the biographical sketches of people who made important contributions to variable-length codes, historical remarks, quotations, etc., make the book more enjoyable.

The book can serve well as a gentle introduction to the topic as well as a supplement to advanced books on the topic. Any advanced computer science undergraduate or students of other mathematical disciplines with introductory knowledge of computer and data communication can self-study the book.

Quotation

If, by any chance, I have omitted anything more or less proper or necessary, I beg forgiveness, since there is no one who is without fault and circumspect in all matters.

---Leonardo Fibonacci, 1202

Last updated on 15 Mar 2014.