|
Deja Vu #06
30 сентября 1998 |
|
CODING - The algorithms of data compression.

SoundTrack: RELIZ BY D.J.DARKMAN / SMG 1998
__________________________________________
(C) Diver / CRG
(C) Elena Gladkova / CRG
__________________________________________
Algorithms for data compression
Part 1.
Compression of Haffmenu
Introduction
-----------------------------------------
This series of articles devoted to a systematic exposition
of the algorithms for data compression. Why it might be
interesting for you?
You want to know how to operate the program data
compression? Those compression algorithms they enjoy, will be
described detailed and systematic way, perhaps, if
readers will wish the next time
will include source code,
illustrating the work of these algorithms. :-)
You want to use data compression programs that you write, or
going to write 'coolest' wrapper on Speccy? Are you interested
in science side of the issue of data compression? Systematic
exposition and short mathematical calculations to help you get
enough solid knowledge on this issue.
Magic formula:
COMPRESSION = Modeling + coding
----------------------------------------- Data compression -
the process of reducing the number of bits needed to store
a certain amount of information.
Lossless - Information recovered from the compressed state,
exactly corresponds to the original (before compression).
Lossy - Information, after compression, only partially conforms
to the original (used in processing images and sound).
Generally speaking, data compression is the process of
handling characters some messages and the translation of these
characters in some codes. If this process organized
effectively, resulting in result of a coded message takes up
less space than the original.
When viewing the messages being processed compression
algorithm implements two almost independent of each other
process: - Supports the model of the processed message;
- Based on the model of another CODE
message fragment.
Usually the whole process of compression mistakenly
identified only with the coding process, while the encoding is
the only part of the compression process, interacting with the
data model. When that the compression will be different for
various modeling techniques.
So if you ever hear a statement like: "Huffman coding gives
optimal results, it is proved mathematically, "- or vice
versa:" Coding Huffman does not give good results, "- Take it
easy. And then and another, at least - incorrect
approval. In each case, the overall results of the compression
algorithm dependent and of the modeling method and the method
of coding.
Huffman coding
----------------------------------------- Graph - the
collection of sets of nodes and a set of arcs naprvlenny from
one node to another.
Tree - a graph with the following properties:
a) any of the sites did not include more than one arc (ie no
cycles); b) only one node is not part of any
arc, he called the root of the tree;
c) moving along the arcs from the root, you can
get into any site.
Leaf of a tree - the node from which does not leave a single
arc. In a pair of tree nodes connected by an arc, the one from
which it comes, is called the parent, another - a child. Two
nodes are called brothers, if they have the same parent.
Binary Tree - the tree, which from
all nodes except the leaves, leaves two
arc.
Huffman coding tree (the H-tree) - a binary tree in which
each node has weight, and weight equal to the total weight of
the parent to his children.
Input alphabet - a set of symbols
included in the message.
In the late forties, at the dawn of information theory, the
idea of developing new and effective ways of encoding
information were in the air. Researchers involved in issues of
entropy, information content and redundancy. Interestingly,
that the initial work on compression took place before the
advent of modern digital computer. Today Information theory
developed in parallel with programming, but while the idea
development of algorithms that use binary arithmetic coding of
characters was a significant step in front.
One of the first efficient algorithms
information coding was proposed DA
Haffmenom in 1952. The idea of the algorithm is as follows:
knowing the probability of occurrence of characters in the
message, we can describe construction procedure codes of
variable length, consisting of a number of bits. Characters are
more likely to be assigned shorter codes. Huffman codes have a
unique prefix, which allows uniquely decode them, despite their
variable length.
Classic Huffman algorithm gets a table at the entrance to
the frequency of occurrence characters in the message. Further
on the basis This table is based tree encoding
Huffman (H-tree). Algorithm for constructing
H-tree is simple and elegant.
1. Characters in the input alphabet form
free list of nodes. Each leaf has
weight, which can be equal to either the probability or the
number of occurrences of a character in compressible message.
2. Selected two free node in the tree with the lowest
weights.
3. Created by their parents with a weight equal to their
total weight.
4. The parent is added to the list of free sites, and two of
his children are removed from the this list.
5. Single arc extending from the parent
is associated with bit 1, the other -
- Bit 0.
6. Steps, starting with the second repeated until as long as
the list of free sites will remain only one free node. He is
considered the root of the tree.
Suppose we have the following table
Frequency:
15 7 6 6 5
A B C D
In the first step of the leaves of the tree using two with
the lowest weights - G and D. They join the new node-parent
whose weight is set at 5 +6 = 11. Then, nodes D and E are
removed from the free list. Node T corresponds to the branch 0
parent node Y - branch 1.
The next step is the same thing happens with
nodes B and C, as now, this pair has
very little weight in the tree. Creates a new node with a
weight 13, and nodes B and C are removed from the free list.
After all this coding tree looks as shown in Figure 1.
0 1 0 1
November 13
15 7 6 6 5
A B C D
Figure 1. Huffman coding tree after the second step.
The next step is "flyweight" pair
nodes are B / B and F / D. For them, the more
times created by the parents, now with the weight 24. Node B /
B corresponds to the branch of the parent 0, T / A - branch 1.
In the last step on the free list
have only two nodes - a node A and
node (B / B) / (T / W). Once again, created a parent with a
weight of 39 and former free nodes attached to its various
branches.
Since there was only one free
node, the algorithm for constructing Huffman coding tree is
completed. H - tree presented in Figure 2.
0 1
39
0 1
24
0 1 0 1
November 13
15 7 6 6 5
A B C D
Figure 2. The final tree coding
Huffman.
To determine the code for each of
characters included in the message, we must
go all the way from the leaf of the tree corresponding to this
symbol to the root of the tree, accumulating bits as you move
along the branches tree. Thus obtained a sequence of bits is a
code of characters written in reverse order.
For a given symbol table Huffman codes will be as follows.
A 0
B 100
101
F 110
D 111
Since none of the resulting code
does not match the prefix of another, they can be unambiguously
decoded by reading them from a stream. In addition, the most
common message symbol A is encoded minimal kolichstvom bits,
and the rarest symbol D - the highest.
Classic Huffman algorithm is
significant disadvantage. To restore the contents of the
compressed message decoder must know the frequency table, which
enjoyed a coder. Consequently, the length of compressed
communication increases the length of the table frequencies,
which should come before data that could negate all the efforts
on the compression of the message. In addition, the need for
complete frequency statistics before actually coding requires
two passes over the post: one for constructing a model of
communication (frequency tables and the H-tree), the other for
actually coding.
Adaptive Compression
-----------------------------------------
Adaptive compression allows the model does not transmit
messages along with himself and limited to a single pass
according to the both encoding and decoding.
Virtually any form of encoding can be converted to adaptive.
In general program that implements an adaptive compression can
be expressed in the following form:
InitsializirovatModel ();
Until the end of the message
Symbol = VzyatSleduyuschiySimvol ();
Code (symbol);
Update modelSimvolom (Symbol);
End While
Decoder in the adaptive scheme works
similarly:
InitsializirovatModel ();
Until the end of compressed data
Symbol = РаскодироватьСледующийСимвол ();
VydatSimvol (Symbol);
ObnovitModelSimvolom (Symbol);
End While
The scheme of adaptive coding / decoding works because, as
you would encoding and decoding using the same procedure,
"Initialize the model" and "ObnovitModelSimvolom. AND
compressor and decompressor starts with an "empty" model does
not contain information about communication) and each viewed
the symbol renew it the same way.
Adaptive Huffman Coding
-----------------------------------------
The next step in the development of the algorithm
Huffman became its adaptive version. Her
the main subject of this article.
In making the algorithm adaptive Huffman coding of the
greatest difficulties arise in the development of a procedure
Update ModelSimvolom () could simply
insert inside this procedure complete the construction of
Huffman coding tree. As a result, we would have the slowest the
world's compression algorithm, since the construction H-tree -
it's too much work and to make its handling of each character
is unwise. Fortunately, there is a way to modify an existing
H-tree so that the display processing of a new symbol.
Ordered tree
-----------------------------------------
We say that a tree has
ordering property if its nodes
may be listed in order of increasing weight and in this
enumeration, each node located next to his brother. Example
ordered tree is shown in Figure 3.
Here, W - weight of a node, N - ordinal number in
list of nodes.
W = 17
N = 9
W = 7 D
N = 7 W = 10
W = 3 W = 4 N = 8
N = 5 N = 6
A B C D
W = 1 W = 2 W = 2 W = 2
N = 1 N = 2 N = 3 N = 4
Figure 3. Property ordered tree
Huffman coding.
We assume, without proof, the statement
that the binary tree is a tree
Huffman coding if and only if
when it satisfies the property of regularity.
Preservation of property order in
the process of updating the tree allows us to
to be sure that the binary tree, with whom we work - an H-tree
both before and after the update of weight in leaves
tree.
Updating a tree while reading the next message symbol
consists of two operations.
The first - the weight gain of tree nodes -
shown in Figure 4. Initially, increasing the weight of the
sheet shall be considered a symbol for unity. Then increases
weight of the parent to bring it into line with the new values
of weight in children. This process continues until until we
get to the root of the tree. The average number of transactions
increased weight is equal to the average number of bits
required to to encode a character. The second
operation - a permutation of nodes in the tree-is required when
the weight gain node leads to disruption of the properties of
ordering, that is, when the increased weight Site has more than
the weight of the next in order of the node (Fig. 5). If you
continue to continue to handle the increase in weight, moving
to the root of the tree, our tree will stop be a tree Huffman.
Step 4
W = 18
N = 9
Step3
W = 8 W
N = 7 W = 10
Step2
W = 4 W = 4 N = 8
Step 1 N = 5 N = 6
A B C D
W = 2 W = 2 W = 2 W = 2
N = 1 N = 2 N = 3 N = 4
Figure 4. The procedure for updating the tree coding
tion with increasing weight of the sheet A.
W = 18
N = 9
W = 8 W
N = 7 W = 10
W = 4 W = 4 N = 8
N = 5 N = 6
A B C D
W = 3 W = 2 W = 2 W = 2
N = 1 N = 2 N = 3 N = 4
Figure 5. With increasing weight sheet A is violated
is a property of order.
To preserve the ordering of the tree
coding algorithm works as follows:
manner. Let the new increased weight node
is W +1. Then begin to move on
list in the direction of increasing the weight until
find the last node with weight W. We rearrange the current
sites and found each other in list (Figure 6), thus restoring
order to the tree. (In this case, parents each of the nodes
also change.) At this permutation operation ends.
W = 18
N = 9
W = 8 W
N = 7 W = 10
W = 4 W = 4 N = 8
N = 5 N = 6
T B A
W = 2 W = 2 W = 2 W = 3
N = 1 N = 2 N = 3 N = 4
Figure 6. Tree coding after the first permutation of nodes of F
and A.
I moved the operation to increase
weight units continues. Next
node, whose weight will be increased by an algorithm - a new
parent node, increasing weight of which caused a permutation.
Suppose that the symbol A met
message two more times vpodryad. Tree coding after a double
call the update procedure shown in Figure 7.
W = 21
N = 9
D W = 11
W = 10 N = 8
N = 7 A W = 6
W = 5 N = 6
N = 5 W = 4
In
N = 4
W = 2
N = 3
Mr. B
W = 2 W = 2
N = 1 N = 2
Figure 7. Tree after the completion of
update.
Notice how the updated tree coding is reflected in the
length of the codes Huffman for the members of this message
characters.
In general, the algorithm updates the tree can be written as
follows:
ObnovitModelSimvolom (Symbol)
{
TekuschiyUzel ListSootvetstvuyuschy = (Symbol)
Always
UvelichitVes (TekuschiyUzel)
If (TekuschiyUzel = KorenDereva) Output;
If (Weight (TekuschiyUzel)>
Weight (SleduyuschiyZa (TekuschiyUzel)))
Permutation ();
TekuschiyUzel = Parent (TekuschiyUzel);
End Always
}
The problems of adaptive coding
-----------------------------------------
Initialization
It would be good to the encoder did not spend
wasted code space to the characters that are not found in the
message.
If it is a classical algorithm for Huffman, characters that
are not found in the message are already known before encoding,
as known to the table frequencies and symbols, in which the
frequency of occurrence is 0. In the adaptive version of the
algorithm, we can not know in advance what symbols appear in
the message. You can initialize the Huffman tree so that it was
all 256 symbols of the alphabet (For 8-bit codes) with a
frequency equal to 1. At the beginning of encoding each code
will have a length of 8 bits. As model adaptation common symbols
will be coded into smaller and smaller
number of bits. This approach is workable, but it significantly
reduces the compression, especially for short messages.
Better to start modeling from scratch
wood and add to it the characters only
as they appear in the compressible message. But this leads to
an obvious contradiction: when the symbol appears in the
message the first time, it can not be encoded because it's not
in the tree encoding.
To resolve this contradiction, we introduce a special ESCAPE
code for the decoder would mean that the next character is
encoded outside the context of models of communication. For
example, it can transmit a stream of compressed data as is,
without coding in general. Method "ZakodirovatSimvol" in the
algorithm of adaptive Huffman coding can be written as follows.
ZakodirovatSimvol (Symbol)
{
If SimvolUzheEstVTablitse (Symbol)
ВыдатьКодХаффменаДляСимвола (Symbol)
Otherwise
{
ВыдатьКодХаффменаДляСимвола (ESCAPE)
VydatSimvol (Symbol)
}
}
Using a special character ESCAPE implies a certain
initialization of the tree prior to coding and decoding: it
placed two special character: ESCAPE and EOF (EOF), with a
weight equal to 1 (Fig. 8). Since the process update the tree
will not affect their weight, during encoding, they will move
at the most distant branches of the tree and have a
very long codes.
ROOT
2
EOF ESC
January 1
Fig.8. Tree encoding after initialization
Overflow
In the process of compression algorithm, the weight
nodes in the tree, Huffman coding is growing steadily. The
first problem arises when the weight of the root of the tree
starts exceed the capacity of the cell in which it is stored.
Typically, a 16-bit value, and therefore can not be greater
than 65535. The second problem deserves even more attention, it
may be much earlier, when size of the longest Huffman code
exceeds capacity of the cell, which is used to transmit it to
the output. Decoder does not care which
the length of code, it decodes as it moves down the tree coding
selecting from the input stream one bit.
Coder also has to start from the leaves of trees and
move up to the top, collecting the bits that need to pass. This
usually occurs with a variable of type "integer", and when
Huffman code length exceeds the size of the type "integer" in
bits, an overflow occurs. SoundTrack: NIK-O: -= CrAzY NOSTALGIE
=- ... __________________________________________
We can prove that the maximum length
Huffman code for communications with the one and the same
input alphabet will have, if the frequencies of symbols form a
sequence Fibonacci sequence (Fig. 9).
W = 13
W = 8
W = 5
W = 3
W = 2
W = 1 W = 1 W = 1 W = 2 W = 3 W = 5
A B C D E
Fig.9. Huffman tree on Fibonacci
Fibonacci function is defined as follows.
int Fib (int n)
{
if (n <= 1) return 1;
else return (Fib (n-1) + Fib (n-2));
}
Function written in C, but hopefully
meaning is clear. For example:
Fib (5) = Fib (4) + Fib (3) = Fib (3) + Fib (2) +
Fib (2) + Fib (1) = Fib (2) + Fib (1) + Fib (1) + Fib (0) +
Fib (1) + Fib (0) +1 = Fib (1) + Fib (0) 6 = 8.
If the weight of the root node in the tree, Huffman is Fib
(i), then the longest code possible for this tree has length
i-1.
This means that if the "whole", used to represent weights in
the tree Huffman, have a length of 16 bits, then the weight of
the root is equal to 4181 (Fib (18)) may to overflow.
(Generally speaking, the message with frequencies of characters
equal to Fibonacci numbers up to Fib (18), is a great way to
test the performance of the software compression Haffmenu.)
Scaling the weights of nodes H-tree
-----------------------------------------
Taking into account the above,
Huffman algorithm updates the tree should be changed as
follows: for weight gain should check it out at
achievement of the permitted maximum. If we
peaked, it is necessary "to scale" weight, usually by dividing
the weight of leaves on the integer 2, for example, and then
counting the weight of the other all the nodes.
However, by dividing the weight consumed there is a problem
with the fact that after perform this operation, the tree can
change its shape. The reason is that we divide whole numbers
and division truncate.
Properly organized Huffman tree after scaling can be in a
form significantly different from the original. This is because
the scaling leads to a loss of accuracy of our statistics. But
with the collection of new statistics implications of these
"mistakes" almost come to naught. Scaling weight - quite an
expensive operation, since it leads to the need to rebuild all
tree coding. But, as the need arises in it is relatively rare,
then this can be put up.
Gains from scale
-----------------------------------------
Scaling weights of tree nodes at regular intervals gives an
unexpected result. Despite the fact that scaling is a loss of
accuracy statistical tests show that it leads to better
compression performance than if the scaling was postponed. This
can be explained by the fact that the current character
compressible flow more "similar" to its close predecessors,
than those who had met much earlier. Scaling reduces the
influence "Old" characters at the statistics and to increase
the influence of her "recent" characters. It is very difficult
to measure quantitatively, but, in principle, scaling has
positive effect on the degree of compression
information. Experiments with scaling at various points in the
compression process show that the degree of compression depends
strongly on the time scale of weight, but there are no rules
for choosing the optimal time scale for the program,
oriented compression of any type of information.
Conclusion
-----------------------------------------
Since both DA Huffman published
In 1952 his work "method of construction
codes with minimum redundancy, "its encryption algorithm has
become a base for a huge number of further studies in this
area. On this day in computer journals can find a large number
publications devoted to how different realizations of Huffman
algorithm, and the search for its best use. Coding Huffman is
used in commercial software compression built into some fax
machines and even used in the JPEG algorithm Compression of
graphic images with losses.
THE END.
The content of the following articles:
- Arifmiticheskoe coding
- The algorithms of LZ (Lempel-Ziv)
- Algorithm LZW (Lempel-Ziv-Welch)
- We write a wrapper or
Combined compression schemes
- Etc ...
Other articles:
Similar articles:
В этот день... 15 November