Odyssey Magazine #01
05 марта 1997 |
|
System - IBM: On algaritme compression Lempel-Ziw Welch and its implementation for format GIF.
Steve Blackstock
Steve Blackstock helped porusski speak at the Institute of
Applied Mathematics AH CCCP A. Samotohin
Music by Phantom Lord
EXPLANATION LZW and GIF
I hope this little document will help enlighten those who
want to know more about the compression algorithm
Lempel-Ziv Welch, and specifically on its realization for the
format of GIF.
Before we begin, a little bit about
terminology in the light of this document:
Symbol: a fundamental element of the data. In ordinary text
files is a single byte. Into a bitmap you are interested, this
index which specifies the color of the pixel. I
I will refer to any character as
on "K".
The stream of characters: a stream of characters such
as a data file.
"Chain": a few consecutive
characters. Chain length may vary
1 to a very large number of characters. I
can specify an arbitrary string as
"[...] K ".
"Prefix": almost the same as a chain, but it is understood
that the prefix immediately preceding character, and
prefix can have zero length. I will
refer to an arbitrary prefix, as in
"[...]".
"Root": a single-character string. For
most purposes, it's just a symbol, but
sometimes it can be otherwise. This [...] K,
where [...] is empty.
"Code": the number determined by the well-known
the number of bits that encodes the chain.
The stream codes: output codes
such as "raster data".
"Element": the code and its chain.
"String table": a list of items
usually, but not necessarily unique.
This should be sufficient for understanding the document.
LZW - a method of data compression, which removes the
advantages of the chains with repeating data. Since raster data
usually contain a lot of this repetition, LZW is a good
method to compress them, and disclosure.
At the moment, let's consider
conventional encoding and decoding using LZW-algorithm. In the
GIF used variation of this algorithm.
Compression and disclosure LZW manipulates three objects:
a stream of characters, codestream and a table conversations.
When compressed stream of characters is an input and
codestream - weekends. In disclosing
input is a stream of codes, and the flow
symbols - the weekends. String table is generated and the
compression and the disclosure, but it is never transmitted
from the compression to the disclosure and vice versa.
The first thing we do in
LZW-compression is initialize our
strings. To do this, we
must choose a code size (bits) and know how many possible
values can take our characters.
Let's put the code size to 12 bits, which means you can store
0FFF, or 4096, elements in our table
chains. Let's also assume that
We have 32 different possible characters.
(This corresponds, for example, a picture with
32 possible colors for each pixel.) To initialize the table, we
establish compliance with the code # 0 character
# 0, code # 1 to character # 1, and so on, until code # 31 and
# 31 symbols. In fact, we indicated that each code from 0 to 31
is the root. More in the table will not be other codes with
this property.
Now we will begin to compress the data. Let's first define
something called "Current prefix". This prefix, we will always
remember and to compare with him here in the future. I will
denote it as "[. c.]". Initially, current prefix contains
nothing. Let us also define as the "current string", which
formed the current prefix and the next character in the stream
of characters. I will refer to the current chain as "[. c.] K",
where K - some character.
Now look at the first symbol in
stream of symbols. Let's call it P. We make
[. C.] P the current chain. (At this point,
This, of course, the root P.) Now let's run
Search string table to determine
Does it [. c.] P. Of course, now
this happens because in our table
initialization have been placed all the roots. In this case we
do nothing. Now make the current prefix [. C.] P.
Take the next character from stream
symbol. Let's call it Q. Add current
prefix to form [. c.] Q, ie
current string. Do a search in the string table to determine
whether the in it [. c.] Q. In this case, of course, will not.
Aha! Now we need to do something. Add [. C.] Q (which
in this case is PQ) in the string table as code # 32, and
derive the code for [. C.] to the codestream. Now start again
with the current prefix, the corresponding
root of P. We continue to add characters to the
[. C.], to form [. C.] K, until
long as we can not find [. c.] K in
string table. Then output the code for
[. C.] and add [. C.] K to the string table. In pseudo code the
algorithm will be described approximately as follows:
[1] Initialize string table;
[2] [. C.] <- empty;
[3] K <- next character in the stream
characters;
[4] Does [. C.] K in the table chain
kidneys? (Yes: [. C.] <- [. C.] K;
go to [3])
(No: add [. C.] K in the table
zu chains;
output the code for [. c.] in the flow
codes; [. c.] <- K; go to [3])
How easy it is! Of course, when we execute step [3] and
the input stream does not have more characters, you draw code
for [. C.] and leave the table. All done.
Want an example? Let's assume
that we have a 4-character alphabet:
A, B, C, D. Stream of characters looks like ABACABA. Let's
compress it. First, we initialize our string table: # 0 = A, #
1 = B, # 2 = C, # 3 = D. The first character is A, which is
included in the string table, so [. c.] becomes A. Next we take
the AB, which is not included in the table hence we derive code
# 0 (for [. C.]), and add AB to the string table
code # 4. [. C.] becomes B.
Next we get [. C.] A = BA, which is not
included in the string table, so
output code # 1, and add BA to the table
chains with a code # 5. [. C.] becomes A. Next we get AC, which
is not included in the string table. Output code # 0, and add
AC to the string table with code # 6. Now [. C.] is equal to C.
Next we get [. C.] A = CA, which is not included in the table.
Output # 2 for C, and add CA to the table as code # 7. Now [.
C.] = A. Next, we take AB, which is a string table, so [. c.]
becomes equal to AB, and we are looking for ABA, which is not
in the string table, so we output the code for AB, which is #
4, and add ABA to the string table as code # 8. [. C.] is equal
to A. We can no longer take the characters, so we display code
# 0 for A and ends. Consequently, the codestream is # 0 # 0 # 1
# 2 # 4 # 0.
A few words (three), it should be said about efficiency:
use hashing strategy. Search for the string table can be paired
with significant computation and hashing significantly
reduce these costs. Note
that "direct LZW" compression runs the risk of
overflow string table - and you can
code that can not be represented
number of bits, previously established for the codes. There are
several ways for To cope with this problem, and GIF
implements the most simple of them. We will
do well.
The important point, which is
note is that at any point during the compression of the
condition: if [...] K in the string table, then [...] also
enters into it. This circumstance leads to an effective
method of storing the chains in the table.
Instead of memorizing the table
the whole chain, use the fact that any
chain can be represented as a prefix plus a character: [...] K.
If you make [...] K to the table, you know that [...]
already there, so you can
remember the code for [...] plus closing
symbol K.
It's all about what care should be taken
under compression. Disclosure, perhaps more
conceptually difficult, but software
it is easier to implement.
We describe how to do it. We again
begin with the initialization string table.
This table is formed based on the knowledge we possess about
the generators in the end the stream of symbols, such as the
possible values of the characters. In GIF-files, this
information can be found in the title, as the number of
possible values pixels. However, the beauty of LZW is
that is all we need. Compression
was carried out in such a way that we never meet in a stream of
codes code, which we could not convert into a chain.
We need to define something
called the "current Code" to which we refer to as "", and
"old code", which we refer to as "<old>". To begin unpacking
take the first code. Now it becomes .
This code will initialize the table
chains as root. Derive a root in a stream of characters. This
code does Legacy Code <old>.
(*) Now we take the code and
assigning it . It is possible that
This code is not included in the string table, but
let's assume that it's there
there. Derive the chain corresponding
a stream of characters. Now we find
first character in the chain, you
just received. Call it K. Add it to the prefix [...], generated
by <old>, to get a new string [...] K. Add this chain to string
table and set the old code <old> equal to the current code
. Repeat on the place that I have outlined an asterisk
and you're all done. Read this paragraph again if you just
"run" on it!
Now let's consider the possibility that not
included in the string table. Get back to the compression and
try to understand what happens when you input stream appears
the chain-type P [...] P [...] PQ. Assume that P [...]
is already in the table, but P [...] P - no.
The encoder performs parsing
P [...], and finds that P [...] P is not in the table. This
will lead to the conclusion code for P [...] and add P [...] P
to string table. Then he will take P [...] P
for the chain and find that
P [...] P is in the table and give off
code for P [...] P, if it appears that
P [...] PQ in the table is missing.
The decompressor is always "on
one step behind "the encoder. When the decoder sees the code
for P [...] P, it is not add this code to your table right away,
because he needed the beginning character
P [...] P to add to the chain for the last code P [...], to form
code for P [...] P. However, when the decoder finds the code
that it is still unknown, he will always be on a more recent
addition to the table. Consequently, he can guess that the
chain this code should be and, in fact,
will always be correct.
If I decoder, and I saw the code
# 124, and my string table contains the latest code is only #
123, I assume that code # 124 must be added to
to my string table and to bring the very chain. If code # 123
generates a chain to which I refer here as the prefix
[...], Then code # 124 in this particular case
will be [...] plus the first character of [...]. So I have to
add the first symbol [...] To itself. Not so bad.
As an example (often
occurring), let us assume that
we have a bitmap image in which
the first three pixels have the same color.
Ie My character stream looks like:
QQQ .... For definiteness, let us say that we have 32 colors
and Q corresponds to the color # 12. The encoder generates a
sequence of codes 12.32 ,.... (If you do not understand why,
take a minute to understand.) Recall that # 32 not included in
the initial table, which contains codes from # 0 to # 31.
Decoder see # 12 and transmits it as a color
Q. Then he will see # 32, the significance of
which he does not know. But if he thinks about it long enough,
he can understand that QQ should be part of # 32
in the table and QQ should be the following chain of inference.
Thus, the pseudo-code decoding can be represented as
follows:
[1] Initialize string chains;
[2] to take the first code: ;
[3] output circuit for a
stream of characters;
[4] <old> = ;
[5] <- the following code to
current codes;
[6], there is in Table
chains chains? (Yes: output chain
Key to a stream of symbols
fishing ;[...] <- listening to
<old>; K <- first character of the transit
slyatsii for ; add
[...] K to the string table;
<old> <- ;) (no: [...]
<- Listening to <old>; K
<- First character [...]; output
[...] K in a stream of characters and to
The addition to the table the chains
check; <old> <- )
[7] go to [5];
Again, if you discover in step [5], there are no more
characters, you must complete. Conclusion chains and finding
the initial character in them put on their own performance
problems, but I not going there to offer ways to
solutions. Half the fun of programming is to allow these pieces!
And now GIF'a variations on this theme. In the header
part of the GIF-file, there is a field called the stream of
raster Data code size. " This is a very confusing name for this
field, but we must come to terms with it. In fact, this
"The size of the root." The actual size (in bytes) of
compression codes actually changes during compression /
decompression, and I I will refer to it here, as in
"Compression size."
The initial table, as usual, contains codes for all the
roots, but its top adds two special
code. Suppose we have a "code size", which is usually equal to
the number of bits per pixel. Denote its N. If the number of
bits per pixel is 1, N must be 2:
roots take up slots # 0 and # 1 in the initial
table and two special codes will take up slots # 4 # 5. In any
other case where N equals the number of bits per pixel, the
roots with the cells from # 0 to # (2 ** N-1), and
Special codes are (2 ** N) and (2 ** N +
1).
The initial compression size will be equal
N +1 bit of the code. If you're encoding, you draw first codes
of length (N +1) bits, and if you're decoding, you choose the
first (N +1) bits of flow codes. As the special codes are used:
or cleanup code is (2 ** N), and <EOI> or end of
information, equal to (2 ** N + 1). Said encoder, which
need to re-initialize the string table and then reinstall
compression size equal to (N +1). <EOI> Means that the code is
no more. If you're encoding or decoding, you should
start adding items to the string table + 2. If you're
encoding, you should bring as the very first code, and
then again each time, once you reach the code
# 4095 (hex FFF), because
GIF does not allow for a greater amount of compression
12 bits. If you're opening, you
should re-initialize your table
chains, as soon as you find .
Variable length compression on the
really does not give much trouble. If you
're encoding you start with a size
compression in the (N +1) bits, and as soon as you
output the code (2 ** (compression size) -1), you
increase the compression size by one bit.
Consequently, the following code to your output will be one bit
longer. Remember that the largest compression size is 12 bits,
which corresponds to code 4095. If you reached this limit, you
must withdraw As the next code and begin
first. If you're decoding, you
must increase your compression size AS
Once you burn an element of # (2 ** (size
compression) - 1) in the string table. Next
the code that you read will be one
bit longer. Do not make the mistake of waiting until you need
to add to the table code (2 ** compression size). You already
missed the last bit of code.
Packaging codes bitstream
raster data is also a potential stumbling block for beginners
encoding and decoding. Least significant bit
code must match the lowest available
bit of the first available byte in the stream
codes. For example, if you start with a 5-bit compression
codes, and your first three code, for example, ,
<fghij>, <klmno>, where e, j, and bits o # 0, your codestream
begin as:
byte # 0: hijabcde
byte # 1:. klmnofg
Thus the distinction between conventional LZW and GIF LZW
to lie in the presence of two additional special codes and a
variable amount of compression. If you understand LZW, and you
understand these differences, you understood everything!
As P.S. You may have noticed
that the encoder has a little bit
flexibility during compression. I have described "greedy"
method that selects the code before displaying so many
characters, how possible. In fact, this method is standard for
LZW and results in the best compression ratio. However, there
is no rule that prohibits you to stop and withdraw code for the
current prefix, regardless of whether there he was on the table
or not, and add this string plus the next character in the
string table. There are various reasons to wish to do so,
especially if the chain is too long and creates difficulties
when hashing. If you need it, do it.
Hope this helps you.
__________________________________________ Steve Blackstock
Other articles:
Similar articles:
В этот день... 21 November