A Concise Introduction to Data Compression

By David Salomon. Errors, questions, comments, and suggestions should be emailed. Published by Springer in 2008. ISBN 978-1-84800-071-1. xii + 314 pages.

A BibTeX style file and an Errata list are available.

I would like to thank Giovanni Motta and Cosmin Truta for their valuable help in developing this book.

Book Cover

From the back cover

Compressing data is an option naturally selected when faced with problems of high costs or restricted space. Written by a renowned expert in the field, this book offers readers a succinct, reader-friendly foundation to the chief approaches, methods and techniques currently employed in the field of data compression.

Part I presents the basic approaches to data compression and describes a few popular techniques and methods commonly used to compress data. The reader discovers essential concepts, such as variable-length and prefix codes, statistical distribution and run-length encoding. Part II then concentrates on advanced techniques, such as arithmetic coding, orthogonal transforms, subband transforms and the Burrows-Wheeler transform.

Features:

Clear overview of the principles underlying this field

Outlines the essentials of the various approaches to compressing data

Contains many learning aids such as: chapter introductions and summaries, chapter-end exercises, comprehensive glossary, etc.

Provides numerous examples of important compression algorithms

Offers a supplementary author-maintained website, with errata and auxiliary material

An ideal introductory volume to David Salomon's fourth edition of Data Compression: The Complete Reference

Complete and clear, this book is the perfect resource for undergraduates in computer science and requires a minimum of mathematics. It is also ideal for readers with a basic knowledge of computer science wanting to learn about data compression.

Preface

It is virtually certain that a reader of this book is both a computer user and an Internet user, and thus the owner of digital data. More and more people all over the world generate, use, own, and enjoy digital data. Digital data is created (by a word processor, a digital camera, a scanner, an audio A/D converter, or other devices), it is edited on a computer, stored (either temporarily, in memory, less temporarily, on a disk, or permanently, on an optical medium), transmitted between computers (on the Internet or in a local-area network), and output (printed, watched, or played, depending on its type).

These steps often apply mathematical methods to modify the representation of the original digital data, because of three factors, time/space limitations, reliability (data robustness), and security (data privacy). These are discussed in some detail here:

The first factor is time/space limitations. It takes time to transfer even a single byte either inside the computer (between the processor and memory) or outside it over a communications channel. It also takes space to store data, and digital images, video, and audio files tend to be large. Time, as we know, is money. Space, either in memory or on our disks, doesn't come free either. More space, in terms of bigger disks and memories, is becoming available all the time, but it remains finite. Thus, decreasing the size of data files saves time, space, and money---three important resources. The process of reducing the size of a data file is popularly referred to as data compression, although its formal name is source coding (coding done at the source of the data, before it is stored or transmitted).

In addition to being a useful concept, the idea of saving space and time by compression is ingrained in us humans, as illustrated by (1) the rapid development of nanotechnology and (2) the quotation at the end of this Preface.

The second factor is reliability. We often experience noisy telephone conversations (with both cell and landline telephones) because of electrical interference. In general, any type of data, digital or analog, sent over any kind of communications channel may become corrupted as a result of channel noise. When the bits of a data file are sent over a computer bus, a telephone line, a dedicated communications line, or a satellite connection, errors may creep in and corrupt bits. Watching a high-resolution color image or a long video, we may not be able to tell when a few pixels have wrong colors, but other types of data require absolute reliability. Examples are an executable computer program, a legal text document, a medical X-ray image, and genetic information. Change one bit in the executable code of a program, and the program will not run, or worse, it may run and do the wrong thing. Change or omit one word in a contract and it may reverse its meaning. Reliability is therefore important and is achieved by means of error-control codes. The formal name of this mathematical discipline is channel coding, because these codes are employed when information is transmitted on a communications channel.

The third factor that affects the storage and transmission of data is security. Generally, we do not want our data transmissions to be intercepted, copied, and read on their way. Even data saved on a disk may be sensitive and should be hidden from prying eyes. This is why digital data can be encrypted with modern, strong encryption algorithms that depend on long, randomly-selected keys. Anyone who doesn't possess the key and wants access to the data may have to resort to a long, tedious process of either trying to break the encryption (by analyzing patterns found in the encrypted file) or trying every possible key. Encryption is especially important for diplomatic communications, messages that deal with money, or data sent by members of secret organizations. A close relative of data encryption is the field of data hiding (steganography). A data file A (a payload) that consists of bits may be hidden in a larger data file B (a cover) by taking advantage of ``holes'' in B that are the result of redundancies in the way data is represented in B.

Overview and goals

This book is devoted to the first of these factors, namely data compression. It explains why data can be compressed, it outlines the principles of the various approaches to compressing data, and it describes several compression algorithms, some of which are general, while others are designed for a specific type of data.

The goal of the book is to introduce the reader to the chief approaches, methods, and techniques that are currently employed to compress data. The main aim is to start with a clear overview of the principles behind this field, to complement this view with several examples of important compression algorithms, and to present this material to the reader in a coherent manner.

Organization and features

The book is organized in two parts, basic concepts and advanced techniques. The first part consists of the first three chapters. They discuss the basic approaches to data compression and describe a few popular techniques and methods that are commonly used to compress data. Chapter 1 introduces the reader to the important concepts of variable-length codes, prefix codes, statistical distributions, run-length encoding, dictionary compression, transforms, and quantization. Chapter 2 is devoted to the important Huffman algorithm and codes, and Chapter 3 describes some of the many dictionary-based compression methods.

The second part of this book is concerned with advanced techniques. The original and unusual technique of arithmetic coding is the topic of Chapter 4. Chapter 5 is devoted to image compression. It starts with the chief approaches to the compression of images, explains orthogonal transforms, and discusses the JPEG algorithm, perhaps the best example of the use of these transforms. The second part of this chapter is concerned with subband transforms and presents the WSQ method for fingerprint compression as an example of the application of these sophisticated transforms. Chapter 6 is devoted to the compression of audio data and in particular to the technique of linear prediction. Finally, other approaches to compression---such as the Burrows-Wheeler method, symbol ranking, and SCSU and BOCU-1-are given their due in Chapter 7.

The many exercises sprinkled throughout the text serve two purposes, they illuminate subtle points that may seem insignificant to readers and encourage readers to test their knowledge by performing computations and obtaining numerical results.

Other aids to learning are a prelude at the beginning of each chapter and various intermezzi where interesting topics, related to the main theme, are examined. In addition, a short summary and self-assessment exercises follow each chapter. The glossary at the end of the book is comprehensive, and the index is detailed, to allow a reader to easily locate all the points in the text where a given topic, subject, or term appear.

Other features that liven up the text are puzzles (with answers at the end of the book) and various boxes with quotations or with biographical information on relevant persons.

Target audience

This book was written with undergraduate students in mind as the chief readership. In general, however, it is aimed at those who have a basic knowledge of computer science; who know something about programming and data structures; who feel comfortable with terms such as bit, mega, ASCII, file, I/O, and binary search; and who want to know how data is compressed. The necessary mathematical background is minimal and is limited to logarithms, matrices, polynomials, calculus, and the concept of probability. This book is not intended as a guide to software implementors and has few programs.

The book's web site, with an errata list, BibTeX information, and auxiliary material, is part of the author's web site, located at |http://www.ecs.csun.edu/~dsalomon/|. Any errors found, comments, and suggestions should be directed to |[email protected]|.

Acknowlegments

I would like to thank Giovanni Motta John Motil for their help and encouragement. Giovanni also contributed to the text and pointed out numerous errors.

In addition, my editors at Springer Verlag, Wayne Wheeler and Catherine Brett, deserve much praise. They went over the manuscript, made numerous suggestions and improvements, and contributed much to the final appearance of the book.

Lakeside, California, David Salomon

August 2007

To see a World in a Grain of Sand And a Heaven in a Wild Flower, Hold Infinity in the palm of your hand And Eternity in an hour.

-William Blake, Auguries of Innocence

Preface vii;
Part I: Basic Concepts 1;
Introduction 5;
1 Approaches to Compression 21;
 1.1 Variable-Length Codes 25;
 1.2 Run-Length Encoding 41;
  Intermezzo: Space-Filling Curves 46;
 1.3 Dictionary-Based Methods 47;
 1.4 Transforms 50;
 1.5 Quantization 51;
 Chapter Summary 58;
2 Huffman Coding 61;
 2.1 Huffman Encoding 63;
 2.2 Huffman Decoding 67;
 2.3 Adaptive Huffman Coding 76;
  Intermezzo: History of Fax 83;
 2.4 Facsimile Compression 85;
 Chapter Summary 90;
3 Dictionary Methods 93;
 3.1 LZ78 95;
  Intermezzo: The LZW Trio 98;
 3.2 LZW 98;
 3.3 Deflate: Zip and Gzip 108;
 Chapter Summary 120;
Part II: Advanced Techniques 121;
4 Arithmetic Coding 123;
 4.1 The Basic Idea 124;
 4.2 Implementation Details 130;
 4.3 Underflow 133;
 4.4 Final Remarks 134;
  Intermezzo: The Real Numbers 135;
 4.5 Adaptive Arithmetic Coding 137;
 4.6 Range Encoding 140;
 Chapter Summary 142;
5 Image Compression 145;
 5.1 Introduction 146;
 5.2 Approaches To Image Compression 148;
  Intermezzo: History of Gray Codes 153;
 5.3 Image Transforms 154;
 5.4 Orthogonal Transforms 158;
 5.5 The Discrete Cosine Transform 162;
  Intermezzo: Statistical Distributions 180;
 5.6 JPEG 181;
  Intermezzo: Human Vision and Color 186;
 5.7 The Wavelet Transform 200;
 5.8 Filter Banks 217;
 5.9 WSQ, Fingerprint Compression 221;
 Chapter Summary 227;
6 Audio Compression 229;
 6.1 Companding 232;
 6.2 The Human Auditory System 233;
  Intermezzo: Heinrich Georg Barkhausen 236;
 6.3 Linear Prediction 237;
 6.4 Mu-Law and A-Law Companding 240;
 6.5 Shorten 246;
 Chapter Summary 247;
7 Other Methods 249;
 7.1 The Burrows-Wheeler Method 250;
  Intermezzo: Fibonacci Codes 255;
 7.2 Symbol Ranking 256;
 7.3 SCSU: Unicode Compression 260;
 Chapter Summary 266;
Bibliography 267;
Glossary 273;
Solutions to Puzzles 283;
Answers to Exercises 285;
Index 307;

Auxiliary Material

• Number of Huffman Codes. The following message was sent on 2 December 2009 by Prof Ulrich Tipp ([email protected]) to the data compression community:

Dear Coding enthusiast,

in my seminar on Data Compression I am working with my students with the Book "A concise Introduction to Data Compression" by David Salomon. When we came to chapter 2.2.3 on the "Number of Huffman Codes" I was not able to understand the reason why there should be 94 instead of 96 or 32 Codes (in the example given there). Further I missed the possibility that the tree of Huffman Codes might be unisomorphic in special probability distributions. You can find an example in the attached note. I wrote the question how to get a general formula for the number of Huffman codes (or trees) to David Salomon and he encouraged me to put this question to a broader audience. For this purpose he gave me your e-mail-address. Any hints or comments are welcome.

• The Intermezzo on page 135 is a short discussion of real numbers and their baffling properties. The document found here (PDF, 9 pages, 100KB) is an extension of this discussion and includes the concept of computability and the famous uncomputable number Ω (Omega).

Book Review

by Ville Hautamaki in ACM SIGACT News, June 2012.

Overview

Prof. David Salomon is known for many excellent books covering different aspects of data communications. One of his main focuses is source coding, more commonly known as data compression. This book aims to bring the wealth of knowledge from his previous books to a one- or two-semester undergraduate course.

Style

I have always enjoyed reading Prof. Salomon's books, a large part of the enjoyment comes from the his judicious use of quotations to spice up the text. This book is consistent with that style. In the present book, multiple explanations and examples are used to clarify and expose different and important concepts, such as entropy and variable length coding.

The book also includes a numerous program code snippets. These programs are clearly written and so should be easily understood by students.

Opinion

The book is excellent and fills the stated goal. When I am going to teach data compression again, I will most definitely use A Concise Introduction to Data Compression as a textbook for the class.

Quotation

Nothing is more impossible than to write a book that wins every reader's approval.

---Miguel de Cervantes

Last Updated 8 Nov, 2012.