Odyssey Magazine #01
05 марта 1997

System - IBM: On algaritme compression Lempel-Ziw Welch and its implementation for format GIF.

<b>System</b> - IBM: On algaritme compression Lempel-Ziw Welch and its implementation for
 format GIF.
  Steve Blackstock

  Steve Blackstock helped porusski speak at the Institute of 
Applied Mathematics AH CCCP A. Samotohin



Music by Phantom Lord


          EXPLANATION LZW and GIF



      I hope this little document will help enlighten those who 
want to know more about the compression algorithm

Lempel-Ziv Welch, and specifically on its realization for the 
format of GIF. 

     Before we begin, a little bit about
terminology in the light of this document:

  Symbol: a fundamental element of the data. In ordinary text 
files is a single byte. Into a bitmap you are interested, this 
index which specifies the color of the pixel. I

I will refer to any character as
on "K".

  The stream of characters: a stream of characters such
as a data file.

  "Chain": a few consecutive
characters. Chain length may vary
1 to a very large number of characters. I
can specify an arbitrary string as
"[...] K ".

  "Prefix": almost the same as a chain, but it is understood 
that the prefix immediately preceding character, and

prefix can have zero length. I will
refer to an arbitrary prefix, as in
"[...]".

  "Root": a single-character string. For
most purposes, it's just a symbol, but
sometimes it can be otherwise. This [...] K,
where [...] is empty.

  "Code": the number determined by the well-known
the number of bits that encodes the chain.

  The stream codes: output codes
such as "raster data".

  "Element": the code and its chain.

  "String table": a list of items
usually, but not necessarily unique.

     This should be sufficient for understanding the document.

     LZW - a method of data compression, which removes the 
advantages of the chains with repeating data. Since raster data 
usually contain a lot of this repetition, LZW is a good

method to compress them, and disclosure.

     At the moment, let's consider
conventional encoding and decoding using LZW-algorithm. In the 
GIF used variation of this algorithm.


     Compression and disclosure LZW manipulates three objects: 
a stream of characters, codestream and a table conversations. 
When compressed stream of characters is an input and

codestream - weekends. In disclosing
input is a stream of codes, and the flow
symbols - the weekends. String table is generated and the 
compression and the disclosure, but it is never transmitted 
from the compression to the disclosure and vice versa. 

     The first thing we do in
LZW-compression is initialize our
strings. To do this, we
must choose a code size (bits) and know how many possible
values ​​can take our characters.
Let's put the code size to 12 bits, which means you can store
0FFF, or 4096, elements in our table
chains. Let's also assume that
We have 32 different possible characters.
(This corresponds, for example, a picture with
32 possible colors for each pixel.) To initialize the table, we
establish compliance with the code # 0 character
# 0, code # 1 to character # 1, and so on, until code # 31 and 
# 31 symbols. In fact, we indicated that each code from 0 to 31 
is the root. More in the table will not be other codes with 
this property. 

     Now we will begin to compress the data. Let's first define 
something called "Current prefix". This prefix, we will always 
remember and to compare with him here in the future. I will 
denote it as "[. c.]". Initially, current prefix contains 
nothing. Let us also define as the "current string", which 
formed the current prefix and the next character in the stream 
of characters. I will refer to the current chain as "[. c.] K", 
where K - some character. 

     Now look at the first symbol in
stream of symbols. Let's call it P. We make
[. C.] P the current chain. (At this point,
This, of course, the root P.) Now let's run
Search string table to determine
Does it [. c.] P. Of course, now
this happens because in our table
initialization have been placed all the roots. In this case we 
do nothing. Now make the current prefix [. C.] P.


     Take the next character from stream
symbol. Let's call it Q. Add current
prefix to form [. c.] Q, ie
current string. Do a search in the string table to determine 
whether the in it [. c.] Q. In this case, of course, will not. 
Aha! Now we need to do something. Add [. C.] Q (which

in this case is PQ) in the string table as code # 32, and 
derive the code for [. C.] to the codestream. Now start again

with the current prefix, the corresponding
root of P. We continue to add characters to the
[. C.], to form [. C.] K, until
long as we can not find [. c.] K in
string table. Then output the code for
[. C.] and add [. C.] K to the string table. In pseudo code the 
algorithm will be described approximately as follows:


     [1] Initialize string table;

     [2] [. C.] <- empty;

     [3] K <- next character in the stream

         characters;

     [4] Does [. C.] K in the table chain
         kidneys? (Yes: [. C.] <- [. C.] K;

         go to [3])

         (No: add [. C.] K in the table
         zu chains;

         output the code for [. c.] in the flow

         codes; [. c.] <- K; go to [3])


      How easy it is! Of course, when we execute step [3] and 
the input stream does not have more characters, you draw code 
for [. C.] and leave the table. All done.


      Want an example? Let's assume
that we have a 4-character alphabet:
A, B, C, D. Stream of characters looks like ABACABA. Let's 
compress it. First, we initialize our string table: # 0 = A, # 
1 = B, # 2 = C, # 3 = D. The first character is A, which is 
included in the string table, so [. c.] becomes A. Next we take 
the AB, which is not included in the table hence we derive code 
# 0 (for [. C.]), and add AB to the string table

code # 4. [. C.] becomes B.
Next we get [. C.] A = BA, which is not
included in the string table, so
output code # 1, and add BA to the table
chains with a code # 5. [. C.] becomes A. Next we get AC, which 
is not included in the string table. Output code # 0, and add 
AC to the string table with code # 6. Now [. C.] is equal to C. 
Next we get [. C.] A = CA, which is not included in the table.

Output # 2 for C, and add CA to the table as code # 7. Now [. 
C.] = A. Next, we take AB, which is a string table, so [. c.] 
becomes equal to AB, and we are looking for ABA, which is not 
in the string table, so we output the code for AB, which is # 
4, and add ABA to the string table as code # 8. [. C.] is equal 
to A. We can no longer take the characters, so we display code 
# 0 for A and ends. Consequently, the codestream is # 0 # 0 # 1 
# 2 # 4 # 0. 

      A few words (three), it should be said about efficiency: 
use hashing strategy. Search for the string table can be paired 
with significant computation and hashing significantly

reduce these costs. Note
that "direct LZW" compression runs the risk of
overflow string table - and you can
code that can not be represented
number of bits, previously established for the codes. There are 
several ways for To cope with this problem, and GIF

implements the most simple of them. We will
do well.

      The important point, which is
note is that at any point during the compression of the 
condition: if [...] K in the string table, then [...] also 
enters into it. This circumstance leads to an effective

method of storing the chains in the table.
Instead of memorizing the table
the whole chain, use the fact that any
chain can be represented as a prefix plus a character: [...] K. 
If you make [...] K to the table, you know that [...]

already there, so you can
remember the code for [...] plus closing
symbol K.

      It's all about what care should be taken
under compression. Disclosure, perhaps more
conceptually difficult, but software
it is easier to implement.

      We describe how to do it. We again
begin with the initialization string table.
This table is formed based on the knowledge we possess about 
the generators in the end the stream of symbols, such as the 
possible values ​​of the characters. In GIF-files, this 
information can be found in the title, as the number of 
possible values pixels. However, the beauty of LZW is

that is all we need. Compression
was carried out in such a way that we never meet in a stream of 
codes code, which we could not convert into a chain. 

      We need to define something
called the "current Code" to which we refer to as "", and 
"old code", which we refer to as "<old>". To begin unpacking 
take the first code. Now it becomes .

This code will initialize the table
chains as root. Derive a root in a stream of characters. This 
code does Legacy Code <old>.


     (*) Now we take the code and
assigning it . It is possible that
This code is not included in the string table, but
let's assume that it's there
there. Derive the chain corresponding
 a stream of characters. Now we find
first character in the chain, you
just received. Call it K. Add it to the prefix [...], generated 
by <old>, to get a new string [...] K. Add this chain to string 
table and set the old code <old> equal to the current code 
. Repeat on the place that I have outlined an asterisk 
and you're all done. Read this paragraph again if you just 
"run" on it! 

     Now let's consider the possibility that  not 
included in the string table. Get back to the compression and 
try to understand what happens when you input stream appears 
the chain-type P [...] P [...] PQ. Assume that P [...]

is already in the table, but P [...] P - no.
The encoder performs parsing
P [...], and finds that P [...] P is not in the table. This 
will lead to the conclusion code for P [...] and add P [...] P 
to string table. Then he will take P [...] P

for the chain and find that
P [...] P is in the table and give off
code for P [...] P, if it appears that
P [...] PQ in the table is missing.

      The decompressor is always "on
one step behind "the encoder. When the decoder sees the code 
for P [...] P, it is not add this code to your table right away,

because he needed the beginning character
P [...] P to add to the chain for the last code P [...], to form
code for P [...] P. However, when the decoder finds the code 
that it is still unknown, he will always be on a more recent 
addition to the table. Consequently, he can guess that the 
chain this code should be and, in fact,

will always be correct.

     If I decoder, and I saw the code
# 124, and my string table contains the latest code is only # 
123, I assume that code # 124 must be added to

to my string table and to bring the very chain. If code # 123 
generates a chain to which I refer here as the prefix

[...], Then code # 124 in this particular case
will be [...] plus the first character of [...]. So I have to 
add the first symbol [...] To itself. Not so bad.


     As an example (often
occurring), let us assume that
we have a bitmap image in which
the first three pixels have the same color.
Ie My character stream looks like:
QQQ .... For definiteness, let us say that we have 32 colors 
and Q corresponds to the color # 12. The encoder generates a 
sequence of codes 12.32 ,.... (If you do not understand why, 
take a minute to understand.) Recall that # 32 not included in 
the initial table, which contains codes from # 0 to # 31. 
Decoder see # 12 and transmits it as a color

Q. Then he will see # 32, the significance of
which he does not know. But if he thinks about it long enough, 
he can understand that QQ should be part of # 32

in the table and QQ should be the following chain of inference.

     Thus, the pseudo-code decoding can be represented as 
follows: 

     [1] Initialize string chains;

     [2] to take the first code: ;

     [3] output circuit for a 

         stream of characters;

     [4] <old> = ;

     [5]  <- the following code to
         current codes;

     [6], there is  in Table
         chains chains? (Yes: output chain
         Key to  a stream of symbols
         fishing ;[...] <- listening to

         <old>; K <- first character of the transit
         slyatsii for ; add

         [...] K to the string table;

         <old> <- ;) (no: [...]

         <- Listening to <old>; K

         <- First character [...]; output

         [...] K in a stream of characters and to
         The addition to the table the chains
         check; <old> <- )

     [7] go to [5];


      Again, if you discover in step [5], there are no more 
characters, you must complete. Conclusion chains and finding 
the initial character in them put on their own performance 
problems, but I not going there to offer ways to

solutions. Half the fun of programming is to allow these pieces!

      And now GIF'a variations on this theme. In the header 
part of the GIF-file, there is a field called the stream of 
raster Data code size. " This is a very confusing name for this 
field, but we must come to terms with it. In fact, this

"The size of the root." The actual size (in bytes) of 
compression codes actually changes during compression / 
decompression, and I I will refer to it here, as in

"Compression size."

     The initial table, as usual, contains codes for all the 
roots, but its top adds two special

code. Suppose we have a "code size", which is usually equal to 
the number of bits per pixel. Denote its N. If the number of 
bits per pixel is 1, N must be 2:

roots take up slots # 0 and # 1 in the initial
table and two special codes will take up slots # 4 # 5. In any 
other case where N equals the number of bits per pixel, the 
roots with the cells from # 0 to # (2 ** N-1), and

Special codes are (2 ** N) and (2 ** N +
1).

     The initial compression size will be equal
N +1 bit of the code. If you're encoding, you draw first codes 
of length (N +1) bits, and if you're decoding, you choose the 
first (N +1) bits of flow codes. As the special codes are used: 
 or cleanup code is (2 ** N), and <EOI> or end of 
information, equal to (2 ** N + 1).  Said encoder, which 
need to re-initialize the string table and then reinstall 
compression size equal to (N +1). <EOI> Means that the code is 
no more. If you're encoding or decoding, you should

start adding items to the string table  + 2. If you're 
encoding, you should bring  as the very first code, and 
then again each time, once you reach the code

# 4095 (hex FFF), because
GIF does not allow for a greater amount of compression
12 bits. If you're opening, you
should re-initialize your table
chains, as soon as you find .

     Variable length compression on the
really does not give much trouble. If you
're encoding you start with a size
compression in the (N +1) bits, and as soon as you
output the code (2 ** (compression size) -1), you
increase the compression size by one bit.
Consequently, the following code to your output will be one bit 
longer. Remember that the largest compression size is 12 bits, 
which corresponds to code 4095. If you reached this limit, you 
must withdraw  As the next code and begin

first. If you're decoding, you
must increase your compression size AS
Once you burn an element of # (2 ** (size
compression) - 1) in the string table. Next
the code that you read will be one
bit longer. Do not make the mistake of waiting until you need 
to add to the table code (2 ** compression size). You already 
missed the last bit of code. 

     Packaging codes bitstream
raster data is also a potential stumbling block for beginners
encoding and decoding. Least significant bit
code must match the lowest available
bit of the first available byte in the stream
codes. For example, if you start with a 5-bit compression 
codes, and your first three code, for example, , 
<fghij>, <klmno>, where e, j, and bits o # 0, your codestream

begin as:


       byte # 0: hijabcde

       byte # 1:. klmnofg


      Thus the distinction between conventional LZW and GIF LZW 
to lie in the presence of two additional special codes and a 
variable amount of compression. If you understand LZW, and you 
understand these differences, you understood everything!


      As P.S. You may have noticed
that the encoder has a little bit
flexibility during compression. I have described "greedy" 
method that selects the code before displaying so many 
characters, how possible. In fact, this method is standard for 
LZW and results in the best compression ratio. However, there 
is no rule that prohibits you to stop and withdraw code for the 
current prefix, regardless of whether there he was on the table 
or not, and add this string plus the next character in the 
string table. There are various reasons to wish to do so, 
especially if the chain is too long and creates difficulties 
when hashing. If you need it, do it.



       Hope this helps you.

__________________________________________ Steve Blackstock





Other articles:

Entry - About the authors of the magazine and the magazine.

Entry - About the authors of the journal

Entry - Guide to the shell of the magazine.

a rest - GLODING PROGRAMMING (Programming upwards diagonally)

Assembler - How to calculate the sine assemlere.

System - IBM: On algaritme compression Lempel-Ziw Welch and its implementation for format GIF.

Miscellaneous - about computer problems: PROS and Scorpio, IBM ...

a rest - "Fun with the parties."

Demorynok - Hit Parade music demonstrations.

History - Hackers - art "it" - about the history of the emergence of hacking.

History - The classification of hackers.

a rest - "How to break down half?".

Letters - Reviews the readers of the magazine.

Contest - Competition for the best puzzle!

Assembler - A quick calculation of the address by two coordinates.

IS DOS - Problems and Solutions

News - news of city.

Demolition - Game Description THE DOOBLE.

Demolition - Game Description BLOOD WYCH.

Review - New games: RETURN TO HOME 4, CITADEL, KLADEMINER, BRIDGE PLAYER, CRUSHER, AMERICAN TURBO KING, RAD RAMP RACER, KUNG FU MASTER, CHOY LEE, SIDERAL WAR, ARKARUM, DIRT TRACK RACER, DOUBLE DRAGON 2, NIGHT BREED, THE CYCLES, MOONTORC, KOMMANDO 2.

System - Description of system software: UNIVERSAL SPRATE STUDIO (USS)

Guests - Old friends: On the history of the Krasnodar group UNIT-5

System - Description of system software: ACCEPT PROTECTION SYSTEM V1.0.



Similar articles:

В этот день...   30 April