Data Compression Techniques Using SQ and USQ Mark DiVecchio San Diego IBM PC Users Group In any computer system, efficient management of the file storage space is important. The two programs SQ and USQ reduce the size of data files by using the Huffman Data Compression Algorithm. A file is composed of a sequence of bytes. A byte is composed of 8 bits. That byte can have a decimal value between 0 and 255. A typical text file, like a C language program, is composed of the ASCII character set (decimal 0 to 127). This character set only requires seven of the eight bits in a byte. Just think -- if there were a way to use that extra bit for another character, you could reduce the size of the file by 12.5%. For those of you who use upper case characters only, you use only about 100 of the ASCII codes from 0 to 255. You could save 60% of your storage if the bits not needed within each byte could be used for other characters. What if you could encode the most frequently used characters in the file with only one bit? For example, if you could store the letter "e" (the most commonly used letter in the English language) using only one bit : "1", you could save 87.5% of the space that the "e" would normally take if it were stored as the ASCII "0110 0101" binary. The answer to these questions is the SQ and USQ programs. The SQ program uses the Huffman coding technique to search for the frequency of use of each of the 256 possible byte patterns, and it then assigns a translation for each character to a bit string. All of these bit strings are placed end to end and written onto a disk file. The encoding information is also put on the file since the USQ program needs to know the character distribution of the original file. The USQ program reads in the encoding information and then reads in the encoded file. It is a simple matter to scan the encoded file and produce an output file which is identical to the file that SQ started with. Huffman Coding Technique This is by far the most popular encoding technique in use today. The Huffman encoding replaces fixed bit characters with variable length bit strings. The length of the bit string is roughly inversely proportional to the frequency of occurrence of the character. For those of you inclined to such symbolism: Length of bit ~= log (character string 2 probability) The implementation of the algorithm which we will discuss encodes fixed bit strings of length 8. This algorithm requires two passes through the input file. The first pass builds a table of 256 entries showing the frequency of each occurrence of each of the 256 possible values for a byte of information. Once the counting is complete, the algorithm must decide which bit strings to associate with each of the 256 characters that were found in the file. Note that if a particular byte value was never used, no string association is needed. The second pass through the input file converts each byte into its encoded string. Remember that when the output file is created, the information used for encoding must also be written on the file for use by the decoding program. The decoding program reads the encoding information from the file and then starts reading the bit strings. As soon as enough bits are read to interpret a character, that character is written onto the final output file. See the next two sections on how SQ and USQ actually implement this. Even though this article primarily has addresses ASCII input files, there is nothing which restricts this algorithm to ASCII. It will work on binary files (.COM or .EXE) as well. But since the length of the encoded bit string is approximately equal to the inverse of the frequency of occurrence of each 8 bit byte, a binary file may not compress very much. This is because a binary file most likely has a uniform distribution over the 256 values in a byte. A machine language program is not like the English language where the letter "e" is used far more than other letters. If the distribution is uniform, the encoded bit strings will all be the same length and the encoded file could be longer than the original (because of the encoding information on the front of the file). All of this has to be qualified, because machine language programs tend to use a lot of "MOV" instructions and have a lot of bytes of zeros so that encoding .COM and .EXE files does save some disk space. SQ The SQ program is an example of the Huffman algorithm. The first thing that SQ does is read through the input file and create a distribution array for the 256 possible characters. This array contains counts of the number of occurrences of each of the 256 characters. The program counts these values in a 16 bit number. It makes sure that, if you are encoding a big file, counts do not exceed a 16 bit value. This is highly unlikely, but it must be accounted for. At the same time, SQ removes strings of identical characters and replaces them with the ASCII character DLE followed by a character count of 2-255. SQ replaces the ASCII DLE with the pair of characters: DLE DLE. This is not related to the Huffman algorithm but just serves to compress the file a little more. Once SQ has scanned the input file, it creates a binary tree structure containing this frequency occurrence information. The most frequently occurring characters have the shortest path from the root to the node, and the least frequently occurring characters have the longest path. For example, if your file were: ABRACADABRA (a very simple and magical example) The table of frequency of occurrences would be: Letter # of Occurrences ------ --------------- A 5 B 2 C 1 D 1 R 2 all the rest 0 Since the letter "A" occurs most often, it should have the shortest encoded bit string. The letters "C" and "D" should have the longest. The other characters which don't appear in the input file don't need to be considered. SQ would create a binary tree to represent this information. The tree might look something like this (for purposes of discussion only): root <---Computer trees are / \ always upside down! 1 0 <-- This is called a / / \ node. A 1 0 / / \ <--This is called B 1 0 branch. / / \ C 1 0 / \ D R <-This is a leaf From this our encoded bit strings which are kept in a translation table would be: Table Entry Character Binary ----------- --------- ------ 1 A 1 2 B 01 3 C 001 4 D 0001 5 R 0000 The output file would be: A B R A C A D A B R A ---------------------------------- 1 01 0000 1 001 1 0001 1 01 0000 1 (binary) A1 31 A1 (hex) We have reduced the size of your file from ten bytes to three bytes for a 70% savings. For this simple example, things aren't that well off since we must put the binary tree encoding information onto the file as well. So the file size grew a lot. But consider a file with the word ABRACADABRA repeated 100,000 times. Now the encoding information is going to be a very very small percentage of the output file and the file will shrink tremendously. SQ opens the output file and writes out the binary tree information. Then SQ rewinds the input file and rereads it from the beginning. As it reads each character, it looks into the translation table and outputs the corresponding bit string. SQ is a little more complicated than what I have outlined since it must operate in the real world of hardware, but this is a fairly complete description of the algorithm. USQ The USQ program is very straightforward. It reads in the encoding information written out by SQ and builds the identical binary tree that SQ used to encode the file. USQ then reads the input file as if it were a string of bits. Starting at the root of the tree, it traverses one branch of the tree with each input bit. If it has reached a leaf, it has a character and that character is written to the output file. USQ then starts at the root again with the next bit from the input file. What does it all mean? Now that we understand the algorithm and a little about how the SQ and USQ programs work, we can use that knowledge to help us run our systems a little more efficiently. (So all of this theory is worth something after all!). 1. Files must be above a threshold size, or else the output file will be longer than the input file because of the decoding information put at the beginning of the compressed data. We don't know the exact size of the threshold because the encoding binary tree information depends on the distribution of the characters in a file. At least we know to check the size of the encoded file after we run SQ to make sure our file didn't grow. 2. Some files will not compress well if they have a uniform distribution of byte values, for example, .COM or .EXE files. This is because of the way SQ builds the tree. Remember that bytes with the same frequency of occurrence are at the same depth (usually) in the tree. So if all of the bytes have the same depth, the output strings are all the same length. 3. SQ reads the input file twice. If you can, use RAM disk at least for the input file and for both files if you have the room. The next best case is to use two floppy drives, one for input and one for output. This will cause a lot of disk starts and stops but not much head movement. Worst case is to use one floppy drive for both input and output. This will cause a lot of head movement as the programs alternate between the input and output files. Other Compression Techniques RUN-LENGTH ENCODING Run-length encoding is a technique whereby sequences of identical bytes are replaced by the repeated byte and a byte count. As you might guess, this method is effective only on very specialized files. One good candidate is a screen display buffer. A screen is made up mostly of "spaces". A completely blank line could be reduced from 80 bytes of spaces to one space followed by a value of 80. To go from 80 bytes down to two bytes is a savings of almost 98%. You might guess that for text files or binary files, this technique does not work well at all. ADAPTIVE COMPRESSION This technique replaces strings of characters of code. For example, the string "ABRACADABRA" would be replaced by a code. Typical algorithms use a 12 bit code. The algorithm is unique in that it only requires a single pass through the input file as the encoding is taking place. The current incarnation of this procedure is called the LZW method (after co-inventors; A. Lempel, J. Ziv and T. Welch). This algorithm claims a savings of 66% on machine language files and up to 83% on COBOL files. Other Reading If you are interested in reading more about data compression techniques, you may be interested in these articles: H.K. Reghbati, "An Overview of Data Compression Techniques," Computer Magazine, Vol. 14, No. 4, April 1981, pp. 71-76. T.A. Welch, "A Technique for High-Performance Data Compression", Computer Magazine, Vol 17, No. 6, June 1984, pp. 8-19. J. Ziv and A. Lempel, "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, Vol. It-23, No.3, May 1977, pp. 337-343.