Deja Vu #06
30 сентября 1998 |
|
CODING - The algorithms of data compression.
SoundTrack: RELIZ BY D.J.DARKMAN / SMG 1998 __________________________________________ (C) Diver / CRG (C) Elena Gladkova / CRG __________________________________________ Algorithms for data compression Part 1. Compression of Haffmenu Introduction ----------------------------------------- This series of articles devoted to a systematic exposition of the algorithms for data compression. Why it might be interesting for you? You want to know how to operate the program data compression? Those compression algorithms they enjoy, will be described detailed and systematic way, perhaps, if readers will wish the next time will include source code, illustrating the work of these algorithms. :-) You want to use data compression programs that you write, or going to write 'coolest' wrapper on Speccy? Are you interested in science side of the issue of data compression? Systematic exposition and short mathematical calculations to help you get enough solid knowledge on this issue. Magic formula: COMPRESSION = Modeling + coding ----------------------------------------- Data compression - the process of reducing the number of bits needed to store a certain amount of information. Lossless - Information recovered from the compressed state, exactly corresponds to the original (before compression). Lossy - Information, after compression, only partially conforms to the original (used in processing images and sound). Generally speaking, data compression is the process of handling characters some messages and the translation of these characters in some codes. If this process organized effectively, resulting in result of a coded message takes up less space than the original. When viewing the messages being processed compression algorithm implements two almost independent of each other process: - Supports the model of the processed message; - Based on the model of another CODE message fragment. Usually the whole process of compression mistakenly identified only with the coding process, while the encoding is the only part of the compression process, interacting with the data model. When that the compression will be different for various modeling techniques. So if you ever hear a statement like: "Huffman coding gives optimal results, it is proved mathematically, "- or vice versa:" Coding Huffman does not give good results, "- Take it easy. And then and another, at least - incorrect approval. In each case, the overall results of the compression algorithm dependent and of the modeling method and the method of coding. Huffman coding ----------------------------------------- Graph - the collection of sets of nodes and a set of arcs naprvlenny from one node to another. Tree - a graph with the following properties: a) any of the sites did not include more than one arc (ie no cycles); b) only one node is not part of any arc, he called the root of the tree; c) moving along the arcs from the root, you can get into any site. Leaf of a tree - the node from which does not leave a single arc. In a pair of tree nodes connected by an arc, the one from which it comes, is called the parent, another - a child. Two nodes are called brothers, if they have the same parent. Binary Tree - the tree, which from all nodes except the leaves, leaves two arc. Huffman coding tree (the H-tree) - a binary tree in which each node has weight, and weight equal to the total weight of the parent to his children. Input alphabet - a set of symbols included in the message. In the late forties, at the dawn of information theory, the idea of developing new and effective ways of encoding information were in the air. Researchers involved in issues of entropy, information content and redundancy. Interestingly, that the initial work on compression took place before the advent of modern digital computer. Today Information theory developed in parallel with programming, but while the idea development of algorithms that use binary arithmetic coding of characters was a significant step in front. One of the first efficient algorithms information coding was proposed DA Haffmenom in 1952. The idea of the algorithm is as follows: knowing the probability of occurrence of characters in the message, we can describe construction procedure codes of variable length, consisting of a number of bits. Characters are more likely to be assigned shorter codes. Huffman codes have a unique prefix, which allows uniquely decode them, despite their variable length. Classic Huffman algorithm gets a table at the entrance to the frequency of occurrence characters in the message. Further on the basis This table is based tree encoding Huffman (H-tree). Algorithm for constructing H-tree is simple and elegant. 1. Characters in the input alphabet form free list of nodes. Each leaf has weight, which can be equal to either the probability or the number of occurrences of a character in compressible message. 2. Selected two free node in the tree with the lowest weights. 3. Created by their parents with a weight equal to their total weight. 4. The parent is added to the list of free sites, and two of his children are removed from the this list. 5. Single arc extending from the parent is associated with bit 1, the other - - Bit 0. 6. Steps, starting with the second repeated until as long as the list of free sites will remain only one free node. He is considered the root of the tree. Suppose we have the following table Frequency: 15 7 6 6 5 A B C D In the first step of the leaves of the tree using two with the lowest weights - G and D. They join the new node-parent whose weight is set at 5 +6 = 11. Then, nodes D and E are removed from the free list. Node T corresponds to the branch 0 parent node Y - branch 1. The next step is the same thing happens with nodes B and C, as now, this pair has very little weight in the tree. Creates a new node with a weight 13, and nodes B and C are removed from the free list. After all this coding tree looks as shown in Figure 1. 0 1 0 1 November 13 15 7 6 6 5 A B C D Figure 1. Huffman coding tree after the second step. The next step is "flyweight" pair nodes are B / B and F / D. For them, the more times created by the parents, now with the weight 24. Node B / B corresponds to the branch of the parent 0, T / A - branch 1. In the last step on the free list have only two nodes - a node A and node (B / B) / (T / W). Once again, created a parent with a weight of 39 and former free nodes attached to its various branches. Since there was only one free node, the algorithm for constructing Huffman coding tree is completed. H - tree presented in Figure 2. 0 1 39 0 1 24 0 1 0 1 November 13 15 7 6 6 5 A B C D Figure 2. The final tree coding Huffman. To determine the code for each of characters included in the message, we must go all the way from the leaf of the tree corresponding to this symbol to the root of the tree, accumulating bits as you move along the branches tree. Thus obtained a sequence of bits is a code of characters written in reverse order. For a given symbol table Huffman codes will be as follows. A 0 B 100 101 F 110 D 111 Since none of the resulting code does not match the prefix of another, they can be unambiguously decoded by reading them from a stream. In addition, the most common message symbol A is encoded minimal kolichstvom bits, and the rarest symbol D - the highest. Classic Huffman algorithm is significant disadvantage. To restore the contents of the compressed message decoder must know the frequency table, which enjoyed a coder. Consequently, the length of compressed communication increases the length of the table frequencies, which should come before data that could negate all the efforts on the compression of the message. In addition, the need for complete frequency statistics before actually coding requires two passes over the post: one for constructing a model of communication (frequency tables and the H-tree), the other for actually coding. Adaptive Compression ----------------------------------------- Adaptive compression allows the model does not transmit messages along with himself and limited to a single pass according to the both encoding and decoding. Virtually any form of encoding can be converted to adaptive. In general program that implements an adaptive compression can be expressed in the following form: InitsializirovatModel (); Until the end of the message Symbol = VzyatSleduyuschiySimvol (); Code (symbol); Update modelSimvolom (Symbol); End While Decoder in the adaptive scheme works similarly: InitsializirovatModel (); Until the end of compressed data Symbol = РаскодироватьСледующийСимвол (); VydatSimvol (Symbol); ObnovitModelSimvolom (Symbol); End While The scheme of adaptive coding / decoding works because, as you would encoding and decoding using the same procedure, "Initialize the model" and "ObnovitModelSimvolom. AND compressor and decompressor starts with an "empty" model does not contain information about communication) and each viewed the symbol renew it the same way. Adaptive Huffman Coding ----------------------------------------- The next step in the development of the algorithm Huffman became its adaptive version. Her the main subject of this article. In making the algorithm adaptive Huffman coding of the greatest difficulties arise in the development of a procedure Update ModelSimvolom () could simply insert inside this procedure complete the construction of Huffman coding tree. As a result, we would have the slowest the world's compression algorithm, since the construction H-tree - it's too much work and to make its handling of each character is unwise. Fortunately, there is a way to modify an existing H-tree so that the display processing of a new symbol. Ordered tree ----------------------------------------- We say that a tree has ordering property if its nodes may be listed in order of increasing weight and in this enumeration, each node located next to his brother. Example ordered tree is shown in Figure 3. Here, W - weight of a node, N - ordinal number in list of nodes. W = 17 N = 9 W = 7 D N = 7 W = 10 W = 3 W = 4 N = 8 N = 5 N = 6 A B C D W = 1 W = 2 W = 2 W = 2 N = 1 N = 2 N = 3 N = 4 Figure 3. Property ordered tree Huffman coding. We assume, without proof, the statement that the binary tree is a tree Huffman coding if and only if when it satisfies the property of regularity. Preservation of property order in the process of updating the tree allows us to to be sure that the binary tree, with whom we work - an H-tree both before and after the update of weight in leaves tree. Updating a tree while reading the next message symbol consists of two operations. The first - the weight gain of tree nodes - shown in Figure 4. Initially, increasing the weight of the sheet shall be considered a symbol for unity. Then increases weight of the parent to bring it into line with the new values of weight in children. This process continues until until we get to the root of the tree. The average number of transactions increased weight is equal to the average number of bits required to to encode a character. The second operation - a permutation of nodes in the tree-is required when the weight gain node leads to disruption of the properties of ordering, that is, when the increased weight Site has more than the weight of the next in order of the node (Fig. 5). If you continue to continue to handle the increase in weight, moving to the root of the tree, our tree will stop be a tree Huffman. Step 4 W = 18 N = 9 Step3 W = 8 W N = 7 W = 10 Step2 W = 4 W = 4 N = 8 Step 1 N = 5 N = 6 A B C D W = 2 W = 2 W = 2 W = 2 N = 1 N = 2 N = 3 N = 4 Figure 4. The procedure for updating the tree coding tion with increasing weight of the sheet A. W = 18 N = 9 W = 8 W N = 7 W = 10 W = 4 W = 4 N = 8 N = 5 N = 6 A B C D W = 3 W = 2 W = 2 W = 2 N = 1 N = 2 N = 3 N = 4 Figure 5. With increasing weight sheet A is violated is a property of order. To preserve the ordering of the tree coding algorithm works as follows: manner. Let the new increased weight node is W +1. Then begin to move on list in the direction of increasing the weight until find the last node with weight W. We rearrange the current sites and found each other in list (Figure 6), thus restoring order to the tree. (In this case, parents each of the nodes also change.) At this permutation operation ends. W = 18 N = 9 W = 8 W N = 7 W = 10 W = 4 W = 4 N = 8 N = 5 N = 6 T B A W = 2 W = 2 W = 2 W = 3 N = 1 N = 2 N = 3 N = 4 Figure 6. Tree coding after the first permutation of nodes of F and A. I moved the operation to increase weight units continues. Next node, whose weight will be increased by an algorithm - a new parent node, increasing weight of which caused a permutation. Suppose that the symbol A met message two more times vpodryad. Tree coding after a double call the update procedure shown in Figure 7. W = 21 N = 9 D W = 11 W = 10 N = 8 N = 7 A W = 6 W = 5 N = 6 N = 5 W = 4 In N = 4 W = 2 N = 3 Mr. B W = 2 W = 2 N = 1 N = 2 Figure 7. Tree after the completion of update. Notice how the updated tree coding is reflected in the length of the codes Huffman for the members of this message characters. In general, the algorithm updates the tree can be written as follows: ObnovitModelSimvolom (Symbol) { TekuschiyUzel ListSootvetstvuyuschy = (Symbol) Always UvelichitVes (TekuschiyUzel) If (TekuschiyUzel = KorenDereva) Output; If (Weight (TekuschiyUzel)> Weight (SleduyuschiyZa (TekuschiyUzel))) Permutation (); TekuschiyUzel = Parent (TekuschiyUzel); End Always } The problems of adaptive coding ----------------------------------------- Initialization It would be good to the encoder did not spend wasted code space to the characters that are not found in the message. If it is a classical algorithm for Huffman, characters that are not found in the message are already known before encoding, as known to the table frequencies and symbols, in which the frequency of occurrence is 0. In the adaptive version of the algorithm, we can not know in advance what symbols appear in the message. You can initialize the Huffman tree so that it was all 256 symbols of the alphabet (For 8-bit codes) with a frequency equal to 1. At the beginning of encoding each code will have a length of 8 bits. As model adaptation common symbols will be coded into smaller and smaller number of bits. This approach is workable, but it significantly reduces the compression, especially for short messages. Better to start modeling from scratch wood and add to it the characters only as they appear in the compressible message. But this leads to an obvious contradiction: when the symbol appears in the message the first time, it can not be encoded because it's not in the tree encoding. To resolve this contradiction, we introduce a special ESCAPE code for the decoder would mean that the next character is encoded outside the context of models of communication. For example, it can transmit a stream of compressed data as is, without coding in general. Method "ZakodirovatSimvol" in the algorithm of adaptive Huffman coding can be written as follows. ZakodirovatSimvol (Symbol) { If SimvolUzheEstVTablitse (Symbol) ВыдатьКодХаффменаДляСимвола (Symbol) Otherwise { ВыдатьКодХаффменаДляСимвола (ESCAPE) VydatSimvol (Symbol) } } Using a special character ESCAPE implies a certain initialization of the tree prior to coding and decoding: it placed two special character: ESCAPE and EOF (EOF), with a weight equal to 1 (Fig. 8). Since the process update the tree will not affect their weight, during encoding, they will move at the most distant branches of the tree and have a very long codes. ROOT 2 EOF ESC January 1 Fig.8. Tree encoding after initialization Overflow In the process of compression algorithm, the weight nodes in the tree, Huffman coding is growing steadily. The first problem arises when the weight of the root of the tree starts exceed the capacity of the cell in which it is stored. Typically, a 16-bit value, and therefore can not be greater than 65535. The second problem deserves even more attention, it may be much earlier, when size of the longest Huffman code exceeds capacity of the cell, which is used to transmit it to the output. Decoder does not care which the length of code, it decodes as it moves down the tree coding selecting from the input stream one bit. Coder also has to start from the leaves of trees and move up to the top, collecting the bits that need to pass. This usually occurs with a variable of type "integer", and when Huffman code length exceeds the size of the type "integer" in bits, an overflow occurs. SoundTrack: NIK-O: -= CrAzY NOSTALGIE =- ... __________________________________________ We can prove that the maximum length Huffman code for communications with the one and the same input alphabet will have, if the frequencies of symbols form a sequence Fibonacci sequence (Fig. 9). W = 13 W = 8 W = 5 W = 3 W = 2 W = 1 W = 1 W = 1 W = 2 W = 3 W = 5 A B C D E Fig.9. Huffman tree on Fibonacci Fibonacci function is defined as follows. int Fib (int n) { if (n <= 1) return 1; else return (Fib (n-1) + Fib (n-2)); } Function written in C, but hopefully meaning is clear. For example: Fib (5) = Fib (4) + Fib (3) = Fib (3) + Fib (2) + Fib (2) + Fib (1) = Fib (2) + Fib (1) + Fib (1) + Fib (0) + Fib (1) + Fib (0) +1 = Fib (1) + Fib (0) 6 = 8. If the weight of the root node in the tree, Huffman is Fib (i), then the longest code possible for this tree has length i-1. This means that if the "whole", used to represent weights in the tree Huffman, have a length of 16 bits, then the weight of the root is equal to 4181 (Fib (18)) may to overflow. (Generally speaking, the message with frequencies of characters equal to Fibonacci numbers up to Fib (18), is a great way to test the performance of the software compression Haffmenu.) Scaling the weights of nodes H-tree ----------------------------------------- Taking into account the above, Huffman algorithm updates the tree should be changed as follows: for weight gain should check it out at achievement of the permitted maximum. If we peaked, it is necessary "to scale" weight, usually by dividing the weight of leaves on the integer 2, for example, and then counting the weight of the other all the nodes. However, by dividing the weight consumed there is a problem with the fact that after perform this operation, the tree can change its shape. The reason is that we divide whole numbers and division truncate. Properly organized Huffman tree after scaling can be in a form significantly different from the original. This is because the scaling leads to a loss of accuracy of our statistics. But with the collection of new statistics implications of these "mistakes" almost come to naught. Scaling weight - quite an expensive operation, since it leads to the need to rebuild all tree coding. But, as the need arises in it is relatively rare, then this can be put up. Gains from scale ----------------------------------------- Scaling weights of tree nodes at regular intervals gives an unexpected result. Despite the fact that scaling is a loss of accuracy statistical tests show that it leads to better compression performance than if the scaling was postponed. This can be explained by the fact that the current character compressible flow more "similar" to its close predecessors, than those who had met much earlier. Scaling reduces the influence "Old" characters at the statistics and to increase the influence of her "recent" characters. It is very difficult to measure quantitatively, but, in principle, scaling has positive effect on the degree of compression information. Experiments with scaling at various points in the compression process show that the degree of compression depends strongly on the time scale of weight, but there are no rules for choosing the optimal time scale for the program, oriented compression of any type of information. Conclusion ----------------------------------------- Since both DA Huffman published In 1952 his work "method of construction codes with minimum redundancy, "its encryption algorithm has become a base for a huge number of further studies in this area. On this day in computer journals can find a large number publications devoted to how different realizations of Huffman algorithm, and the search for its best use. Coding Huffman is used in commercial software compression built into some fax machines and even used in the JPEG algorithm Compression of graphic images with losses. THE END. The content of the following articles: - Arifmiticheskoe coding - The algorithms of LZ (Lempel-Ziv) - Algorithm LZW (Lempel-Ziv-Welch) - We write a wrapper or Combined compression schemes - Etc ...
Other articles:
Similar articles:
В этот день... 21 November