CODING - The algorithms of data compression.

<b>CODING</b> - The algorithms of data compression.
SoundTrack: RELIZ BY D.J.DARKMAN / SMG 1998
__________________________________________


(C) Diver / CRG
(C) Elena Gladkova / CRG
__________________________________________



       Algorithms for data compression



                 Part 1.

            Compression of Haffmenu



                 Introduction
-----------------------------------------
   This series of articles devoted to a systematic exposition 
of the algorithms for data compression. Why it might be 
interesting for you?


   You want to know how to operate the program data 
compression? Those compression algorithms they enjoy, will be 
described detailed and systematic way, perhaps, if

readers will wish the next time
will include source code,
illustrating the work of these algorithms. :-)

   You want to use data compression programs that you write, or
going to write 'coolest' wrapper on Speccy? Are you interested 
in science side of the issue of data compression? Systematic 
exposition and short mathematical calculations to help you get 
enough solid knowledge on this issue. 


           Magic formula:

     COMPRESSION = Modeling + coding
----------------------------------------- Data compression - 
the process of reducing the number of bits needed to store

a certain amount of information.
Lossless - Information recovered from the compressed state, 
exactly corresponds to the original (before compression).

Lossy - Information, after compression, only partially conforms 
to the original (used in processing images and sound). 


   Generally speaking, data compression is the process of 
handling characters some messages and the translation of these 
characters in some codes. If this process organized 
effectively, resulting in result of a coded message takes up 
less space than the original. 

   When viewing the messages being processed compression 
algorithm implements two almost independent of each other 
process:  - Supports the model of the processed message;

 - Based on the model of another CODE
message fragment.

   Usually the whole process of compression mistakenly
identified only with the coding process, while the encoding is 
the only part of the compression process, interacting with the 
data model. When that the compression will be different for

various modeling techniques.

   So if you ever hear a statement like: "Huffman coding gives
optimal results, it is proved mathematically, "- or vice 
versa:" Coding Huffman does not give good results, "- Take it 
easy. And then and another, at least - incorrect

approval. In each case, the overall results of the compression 
algorithm dependent and of the modeling method and the method 
of coding. 


           Huffman coding
----------------------------------------- Graph - the 
collection of sets of nodes and a set of arcs naprvlenny from 
one node to another.

Tree - a graph with the following properties:
 a) any of the sites did not include more than one arc (ie no 
cycles);  b) only one node is not part of any

arc, he called the root of the tree;
 c) moving along the arcs from the root, you can
get into any site.


   Leaf of a tree - the node from which does not leave a single 
arc. In a pair of tree nodes connected by an arc, the one from 
which it comes, is called the parent, another - a child. Two 
nodes are called brothers, if they have the same parent.


   Binary Tree - the tree, which from
all nodes except the leaves, leaves two
arc.

   Huffman coding tree (the H-tree) - a binary tree in which 
each node has weight, and weight equal to the total weight of 
the parent to his children. 

   Input alphabet - a set of symbols
included in the message.


   In the late forties, at the dawn of information theory, the 
idea of developing new and effective ways of encoding 
information were in the air. Researchers involved in issues of 
entropy, information content and redundancy. Interestingly, 
that the initial work on compression took place before the 
advent of modern digital computer. Today Information theory 
developed in parallel with programming, but while the idea

development of algorithms that use binary arithmetic coding of 
characters was a significant step in front.


   One of the first efficient algorithms
information coding was proposed DA
Haffmenom in 1952. The idea of the algorithm is as follows: 
knowing the probability of occurrence of characters in the 
message, we can describe construction procedure codes of 
variable length, consisting of a number of bits. Characters are 
more likely to be assigned shorter codes. Huffman codes have a 
unique prefix, which allows uniquely decode them, despite their

variable length.

   Classic Huffman algorithm gets a table at the entrance to 
the frequency of occurrence characters in the message. Further 
on the basis This table is based tree encoding

Huffman (H-tree). Algorithm for constructing
H-tree is simple and elegant.

   1. Characters in the input alphabet form
free list of nodes. Each leaf has
weight, which can be equal to either the probability or the 
number of occurrences of a character in compressible message.


   2. Selected two free node in the tree with the lowest 
weights. 

   3. Created by their parents with a weight equal to their 
total weight. 

   4. The parent is added to the list of free sites, and two of 
his children are removed from the this list.


   5. Single arc extending from the parent
is associated with bit 1, the other -
- Bit 0.

   6. Steps, starting with the second repeated until as long as 
the list of free sites will remain only one free node. He is 
considered the root of the tree. 


   Suppose we have the following table
Frequency:


                15 7 6 6 5

                 A B C D



   In the first step of the leaves of the tree using two with 
the lowest weights - G and D. They join the new node-parent

whose weight is set at 5 +6 = 11. Then, nodes D and E are 
removed from the free list. Node T corresponds to the branch 0 
parent node Y - branch 1. 

   The next step is the same thing happens with
nodes B and C, as now, this pair has
very little weight in the tree. Creates a new node with a 
weight 13, and nodes B and C are removed from the free list. 
After all this coding tree looks as shown in Figure 1.




                    0 1 0 1

                        November 13



      15 7 6 6 5


       A B C D



   Figure 1. Huffman coding tree after the second step.


   The next step is "flyweight" pair
nodes are B / B and F / D. For them, the more
times created by the parents, now with the weight 24. Node B / 
B corresponds to the branch of the parent 0, T / A - branch 1. 

   In the last step on the free list
have only two nodes - a node A and
node (B / B) / (T / W). Once again, created a parent with a 
weight of 39 and former free nodes attached to its various 
branches. 

   Since there was only one free
node, the algorithm for constructing Huffman coding tree is 
completed. H - tree presented in Figure 2.




          0 1

              39


                                0 1

                                    24


                    0 1 0 1

                        November 13



      15 7 6 6 5


       A B C D



   Figure 2. The final tree coding
Huffman.


   To determine the code for each of
characters included in the message, we must
go all the way from the leaf of the tree corresponding to this 
symbol to the root of the tree, accumulating bits as you move 
along the branches tree. Thus obtained a sequence of bits is a 
code of characters written in reverse order.


   For a given symbol table Huffman codes will be as follows.


                  A 0

                  B 100

                  101

                  F 110

                  D 111



   Since none of the resulting code
does not match the prefix of another, they can be unambiguously 
decoded by reading them from a stream. In addition, the most 
common message symbol A is encoded minimal kolichstvom bits, 
and the rarest symbol D - the highest. 

   Classic Huffman algorithm is
significant disadvantage. To restore the contents of the 
compressed message decoder must know the frequency table, which 
enjoyed a coder. Consequently, the length of compressed 
communication increases the length of the table frequencies, 
which should come before data that could negate all the efforts

on the compression of the message. In addition, the need for 
complete frequency statistics before actually coding requires 
two passes over the post: one for constructing a model of 
communication (frequency tables and the H-tree), the other for 
actually coding.



            Adaptive Compression
-----------------------------------------
   Adaptive compression allows the model does not transmit 
messages along with himself and limited to a single pass 
according to the both encoding and decoding.


   Virtually any form of encoding can be converted to adaptive. 
In general program that implements an adaptive compression can 
be expressed in the following form: 


   InitsializirovatModel ();


   Until the end of the message


    Symbol = VzyatSleduyuschiySimvol ();

    Code (symbol);

    Update modelSimvolom (Symbol);


   End While


   Decoder in the adaptive scheme works
similarly:


   InitsializirovatModel ();


   Until the end of compressed data


    Symbol = РаскодироватьСледующийСимвол ();

    VydatSimvol (Symbol);

    ObnovitModelSimvolom (Symbol);


   End While


   The scheme of adaptive coding / decoding works because, as 
you would encoding and decoding using the same procedure, 
"Initialize the model" and "ObnovitModelSimvolom. AND 
compressor and decompressor starts with an "empty" model does 
not contain information about communication) and each viewed 
the symbol renew it the same way. 


     Adaptive Huffman Coding
-----------------------------------------
   The next step in the development of the algorithm
Huffman became its adaptive version. Her
the main subject of this article.

   In making the algorithm adaptive Huffman coding of the 
greatest difficulties arise in the development of a procedure 
Update ModelSimvolom () could simply

insert inside this procedure complete the construction of 
Huffman coding tree. As a result, we would have the slowest the 
world's compression algorithm, since the construction H-tree - 
it's too much work and to make its handling of each character 
is unwise. Fortunately, there is a way to modify an existing 
H-tree so that the display processing of a new symbol. 


           Ordered tree
-----------------------------------------
   We say that a tree has
ordering property if its nodes
may be listed in order of increasing weight and in this 
enumeration, each node located next to his brother. Example

ordered tree is shown in Figure 3.
Here, W - weight of a node, N - ordinal number in
list of nodes.



                                        W = 17


                                        N = 9


                       W = 7 D


                       N = 7 W = 10


            W = 3 W = 4 N = 8


            N = 5 N = 6


    A B C D


  W = 1 W = 2 W = 2 W = 2


  N = 1 N = 2 N = 3 N = 4



 Figure 3. Property ordered tree
Huffman coding.


   We assume, without proof, the statement
that the binary tree is a tree
Huffman coding if and only if
when it satisfies the property of regularity.

   Preservation of property order in
the process of updating the tree allows us to
to be sure that the binary tree, with whom we work - an H-tree
both before and after the update of weight in leaves
tree.

   Updating a tree while reading the next message symbol 
consists of two operations.


   The first - the weight gain of tree nodes -
shown in Figure 4. Initially, increasing the weight of the 
sheet shall be considered a symbol for unity. Then increases 
weight of the parent to bring it into line with the new values 
of weight in children. This process continues until until we 
get to the root of the tree. The average number of transactions 
increased weight is equal to the average number of bits 
required to to encode a character. The second

operation - a permutation of nodes in the tree-is required when 
the weight gain node leads to disruption of the properties of 
ordering, that is, when the increased weight Site has more than 
the weight of the next in order of the node (Fig. 5). If you 
continue to continue to handle the increase in weight, moving 
to the root of the tree, our tree will stop be a tree Huffman.



                                  Step 4

                                        W = 18


                                        N = 9

                Step3

                       W = 8 W


                       N = 7 W = 10

      Step2

            W = 4 W = 4 N = 8

Step 1 N = 5 N = 6


    A B C D


  W = 2 W = 2 W = 2 W = 2


  N = 1 N = 2 N = 3 N = 4



Figure 4. The procedure for updating the tree coding
       tion with increasing weight of the sheet A.



                                        W = 18


                                        N = 9


                       W = 8 W


                       N = 7 W = 10


            W = 4 W = 4 N = 8


            N = 5 N = 6


    A B C D


  W = 3 W = 2 W = 2 W = 2


  N = 1 N = 2 N = 3 N = 4



Figure 5. With increasing weight sheet A is violated
       is a property of order.


   To preserve the ordering of the tree
coding algorithm works as follows:
manner. Let the new increased weight node
is W +1. Then begin to move on
list in the direction of increasing the weight until
find the last node with weight W. We rearrange the current 
sites and found each other in list (Figure 6), thus restoring 
order to the tree. (In this case, parents each of the nodes 
also change.) At this permutation operation ends.




                                        W = 18


                                        N = 9


                       W = 8 W


                       N = 7 W = 10


            W = 4 W = 4 N = 8


            N = 5 N = 6


    T B A


  W = 2 W = 2 W = 2 W = 3


  N = 1 N = 2 N = 3 N = 4



Figure 6. Tree coding after the first permutation of nodes of F 
and A. 


   I moved the operation to increase
weight units continues. Next
node, whose weight will be increased by an algorithm - a new 
parent node, increasing weight of which caused a permutation.


   Suppose that the symbol A met
message two more times vpodryad. Tree coding after a double 
call the update procedure shown in Figure 7. 


            W = 21


            N = 9

   D W = 11


  W = 10 N = 8


  N = 7 A W = 6


            W = 5 N = 6


            N = 5 W = 4

                        In

                                            N = 4

                       W = 2


                       N = 3

                                    Mr. B


                                   W = 2 W = 2


                                   N = 1 N = 2



  Figure 7. Tree after the completion of

         update.


   Notice how the updated tree coding is reflected in the 
length of the codes Huffman for the members of this message

characters.

   In general, the algorithm updates the tree can be written as 
follows: 

ObnovitModelSimvolom (Symbol)
{
 TekuschiyUzel ListSootvetstvuyuschy = (Symbol)
 Always

  UvelichitVes (TekuschiyUzel)

  If (TekuschiyUzel = KorenDereva) Output;

  If (Weight (TekuschiyUzel)>

          Weight (SleduyuschiyZa (TekuschiyUzel)))

     Permutation ();

  TekuschiyUzel = Parent (TekuschiyUzel);
 End Always
}


     The problems of adaptive coding
-----------------------------------------
              Initialization



   It would be good to the encoder did not spend
wasted code space to the characters that are not found in the 
message. 

   If it is a classical algorithm for Huffman, characters that 
are not found in the message are already known before encoding, 
as known to the table frequencies and symbols, in which the 
frequency of occurrence is 0. In the adaptive version of the 
algorithm, we can not know in advance what symbols appear in 
the message. You can initialize the Huffman tree so that it was 
all 256 symbols of the alphabet (For 8-bit codes) with a 
frequency equal to 1. At the beginning of encoding each code 
will have a length of 8 bits. As model adaptation common symbols

will be coded into smaller and smaller
number of bits. This approach is workable, but it significantly 
reduces the compression, especially for short messages.


   Better to start modeling from scratch
wood and add to it the characters only
as they appear in the compressible message. But this leads to 
an obvious contradiction: when the symbol appears in the 
message the first time, it can not be encoded because it's not 
in the tree encoding. 

   To resolve this contradiction, we introduce a special ESCAPE 
code for the decoder would mean that the next character is 
encoded outside the context of models of communication. For 
example, it can transmit a stream of compressed data as is, 
without coding in general. Method "ZakodirovatSimvol" in the 
algorithm of adaptive Huffman coding can be written as follows.


ZakodirovatSimvol (Symbol)
{
 If SimvolUzheEstVTablitse (Symbol)

    ВыдатьКодХаффменаДляСимвола (Symbol)
 Otherwise

  {

   ВыдатьКодХаффменаДляСимвола (ESCAPE)

   VydatSimvol (Symbol)

  }
}


   Using a special character ESCAPE implies a certain 
initialization of the tree prior to coding and decoding: it 
placed two special character: ESCAPE and EOF (EOF), with a 
weight equal to 1 (Fig. 8). Since the process update the tree 
will not affect their weight, during encoding, they will move

at the most distant branches of the tree and have a
very long codes.


                        ROOT


                         2


             EOF ESC


              January 1



Fig.8. Tree encoding after initialization


               Overflow



   In the process of compression algorithm, the weight
nodes in the tree, Huffman coding is growing steadily. The 
first problem arises when the weight of the root of the tree 
starts exceed the capacity of the cell in which it is stored. 
Typically, a 16-bit value, and therefore can not be greater 
than 65535. The second problem deserves even more attention, it 
may be much earlier, when size of the longest Huffman code 
exceeds capacity of the cell, which is used to transmit it to 
the output. Decoder does not care which

the length of code, it decodes as it moves down the tree coding
selecting from the input stream one bit.
Coder also has to start from the leaves of trees and
move up to the top, collecting the bits that need to pass. This 
usually occurs with a variable of type "integer", and when 
Huffman code length exceeds the size of the type "integer" in 
bits, an overflow occurs. SoundTrack: NIK-O: -= CrAzY NOSTALGIE 
=- ... __________________________________________




   We can prove that the maximum length
Huffman code for communications with the one and the same
input alphabet will have, if the frequencies of symbols form a 
sequence Fibonacci sequence (Fig. 9).




                                                  W = 13


                                       W = 8


                            W = 5


                 W = 3


      W = 2



W = 1 W = 1 W = 1 W = 2 W = 3 W = 5

 A B C D E



Fig.9. Huffman tree on Fibonacci


   Fibonacci function is defined as follows.

int Fib (int n)
{
 if (n <= 1) return 1;
 else return (Fib (n-1) + Fib (n-2));
}

Function written in C, but hopefully
meaning is clear. For example:

Fib (5) = Fib (4) + Fib (3) = Fib (3) + Fib (2) +
Fib (2) + Fib (1) = Fib (2) + Fib (1) + Fib (1) + Fib (0) +
Fib (1) + Fib (0) +1 = Fib (1) + Fib (0) 6 = 8.


   If the weight of the root node in the tree, Huffman is Fib 
(i), then the longest code possible for this tree has length 
i-1. 

   This means that if the "whole", used to represent weights in 
the tree Huffman, have a length of 16 bits, then the weight of 
the root is equal to 4181 (Fib (18)) may to overflow. 
(Generally speaking, the message with frequencies of characters 
equal to Fibonacci numbers up to Fib (18), is a great way to 
test the performance of the software compression Haffmenu.)



   Scaling the weights of nodes H-tree
-----------------------------------------
   Taking into account the above,
Huffman algorithm updates the tree should be changed as 
follows: for weight gain should check it out at

achievement of the permitted maximum. If we
peaked, it is necessary "to scale" weight, usually by dividing 
the weight of leaves on the integer 2, for example, and then 
counting the weight of the other all the nodes. 

   However, by dividing the weight consumed there is a problem 
with the fact that after perform this operation, the tree can 
change its shape. The reason is that we divide whole numbers 
and division truncate. 

   Properly organized Huffman tree after scaling can be in a 
form significantly different from the original. This is because 
the scaling leads to a loss of accuracy of our statistics. But 
with the collection of new statistics implications of these 
"mistakes" almost come to naught. Scaling weight - quite an 
expensive operation, since it leads to the need to rebuild all 
tree coding. But, as the need arises in it is relatively rare, 
then this can be put up. 


        Gains from scale
-----------------------------------------
   Scaling weights of tree nodes at regular intervals gives an 
unexpected result. Despite the fact that scaling is a loss of 
accuracy statistical tests show that it leads to better 
compression performance than if the scaling was postponed. This 
can be explained by the fact that the current character 
compressible flow more "similar" to its close predecessors, 
than those who had met much earlier. Scaling reduces the 
influence "Old" characters at the statistics and to increase 
the influence of her "recent" characters. It is very difficult 
to measure quantitatively, but, in principle, scaling has

positive effect on the degree of compression
information. Experiments with scaling at various points in the 
compression process show that the degree of compression depends 
strongly on the time scale of weight, but there are no rules 
for choosing the optimal time scale for the program,

oriented compression of any type of information.


                Conclusion
-----------------------------------------
   Since both DA Huffman published
In 1952 his work "method of construction
codes with minimum redundancy, "its encryption algorithm has 
become a base for a huge number of further studies in this 
area. On this day in computer journals can find a large number

publications devoted to how different realizations of Huffman 
algorithm, and the search for its best use. Coding Huffman is 
used in commercial software compression built into some fax 
machines and even used in the JPEG algorithm Compression of 
graphic images with losses. 



                 THE END.



   The content of the following articles:

   - Arifmiticheskoe coding

   - The algorithms of LZ (Lempel-Ziv)

   - Algorithm LZW (Lempel-Ziv-Welch)

   - We write a wrapper or

     Combined compression schemes

   - Etc ...