For Coderz - Arithmetic coding.

 Arithmetic coding
www.codenet.ru, author unknown. Published in abridged form.


              The idea of arithmetic coding.



   When arithmetic coding is submitted to the real numbers 
between 0 and 1. As the coding of the text displaying its range 
decreased and the number of bits for its presentation 
increases. Ordinary text characters cut the amount of space 
from the values of their probabilities, opredelyaemh 
modelyu.Bolee probable characters make this a lesser extent, 
than less likely, and hence add fewer bits to result.


   Before you begin, the text spacing is
[0, 1). When processing the next symbol its width narrows for
through the allocation of this symbol of the interval. For 
example, we apply to the text "eaii!" alphabet {a, e, i, o, u,! 
} Model with constant probabilities (the width of the interval 
equal to the probability of the symbol): 



    "A" [0.0; 0.2);

    "E" [0.2; 0.5);

    "I" [0.5; 0.6);

    "O" [0.6; 0.8);

    "U" [0.8; 0.9);

   "!" [0.9, 1.0).


   And encoder and decoder know that at the very beginning
interval is [0, 1). After watching the first character "e" 
encoder narrows the interval to [0.2, 0.5), which distinguishes 
this model character. The second character "a" narrow down the 
new interval to the first its fifth part, because "a" is 
selected a fixed interval [0.0, 0.2). The result is working 
interval [0.2, 0.26), as previous interval had a width of 0.3 
units, and one-fifth of it is 0.06. The next character "i" 
corresponds to a fixed interval [0.5, 0.6) that, with respect 
to the working interval [0.2, 0.26) narrows it to the interval 
[0.23, 0.236). Continuing In the same spirit, we have:




    In the beginning [0.0, 1.0)

    After watching the "e" [0.2; 0.5)

       -"-"-"- "A" [0.2; 0.26)

       -"-"-"- "I" [0.23; 0.236)

       -"-"-"- "I" [0.233; 0.2336)

      -"-"-"- "" [0.23354, 0.2336)


   Assume that all decoder knows about the text - is
finite interval [0.23354, 0.2336). He immediately realizes that
the first encoded symbol is "e", because final interval
lies in the interval, the selected model of this character 
under the first table. Now repeat the steps the encoder: 



    First, [0.0, 1.0)

   After watching the "e" [0.2; 0.5)


   Hence it is clear that the second character - it's "a", as 
it will lead to the interval [0.2, 0.26), which fully 
accommodates the final interval [0.23354, 0.2336). Continuing 
to work the same way decoder will extract all the text.


   Decoder is not necessary to know the values of both borders
the final interval, obtained from the encoder. Even a single 
value that lies within him, for example, 0.23355 is already 
enough. (Other numbers - 0.23354, 0.23357, or even 0.23354321 - 
quite suitable). 

   Five decimal digits, it seems too much for coding
Text of the 4 vowels. However, it is clear that different 
models give different entropiyu.Luchshaya model based on an 
analysis of the individual characters Text "eaii!", is the 
following set of frequencies of characters: 



          {"E" (0.2), "a" (0.2), "i" (0.4), "!" (0.2)}.



           The program for arithmetic coding.


   Shows a fragment of pseudo-unifying procedure of encoding 
and decoding. Characters in it are numbered as 1, 2, 3 ... The 
frequency interval for the i-th symbol is given by cum_freq [i] 
to cum_freq [i-1]. With decreasing i cum_freq [i] increases so 
that cum_freq [0] = 1. (The reason for such "reverse" the 
agreement is that cum_freq [0] will then contain a normalizing 
factor, which can be conveniently stored at the beginning of 
the array). Current Work interval specified in the [low; high], 
and will at the very beginning is [0, 1) and an encoder, and 
Encoder. 

   Unfortunately, this pseudo code is simplified, while in 
practice there are several factors that complicate and coding, 
and decoding.


/ / Algorithm for Arithmetic coding

/ / Each char. the text refer to the procedure encode_symbol ()
/ / Check that the "final" last symbol is encoded
/ / Display the resulting value of the range [low; high)



    encode_symbol (symbol, cum_freq)

      range = high - low

      high = low + range * cum_freq [symbol-1]

     low = low + range * cum_freq [symbol]

/ / ARITHMETIC DECODING ALGORITHM

/ / Value - it goes to the input number
/ / Address to the procedure decode_symbol (), until she returns
/ / "Final" character



    decode_symbol (cum_freq)

      search for a symbol that

     cum_freq [symbol] <= (value-low) / (high-low)  = Half) {

         output_bit (1);

         low = 2 * (low - Half);

         high = 2 * (high - Half) + 1;

      }

      else break;

   }


   Priraschaemy initial data input is done using the number of 
named value. Processed beats move to the top, but again 
received come in lower. First start_decoding () fills the value 
received bits. After determining the next input symbol 
procedure decode_symbol () removal is unnecessary, the same in 
low and high, the highest-order bits of the shift value for the 
number of digits (bits derived restored with the introduction 
of new lower-end). 



    for (;;) {

      if (high <Half) {

        value = 2 * value + input_bit ();

        low = 2 * low;

        high = 2 * high + 1;

      }

      else if (low> Half) {

        value = 2 * (value - Half) + input_bit ();

        low = 2 * (low - Half);

        high = 2 * (high - Half) + 1;

      }

      else break;

   }



            The proof of the decoding of inequality.


   Let "[]" denotes the integral part - a division
truncating. We believe:



                      (Value-low +1) * cum_freq [0] -1
 cum_freq [symbol] <= [] <

                                high-low +1

                                           0.



                    range * cum_freq [symbol-1]
 (A) high '* low + [] - 1> =

                           cum_freq [0]



              range (value-low +1) * cum_freq [0] -1

  > = Low + [+ 1 - e] - 1

          cum_freq [0] range


   from the expression (1) we have:



               range a range-1

  > = Value + [- + 1 -] = value.

            cum_freq [0] range range



                   Negative overflow.


   As shown in the pseudocode, arithmetic coding works
by scaling the accumulated probability delivered
model in the range [low; high] for each transmitted symbol. 
Assume that the low and high so close to each other that resize 
operation results obtained from different models characters to 
a single integer, a part of [low; high]. In this If further 
coding is impossible to continue. Consequently, the encoder 
must ensure that the interval [low; high] has always been quite 
wide. The simplest way to do this is to ensure that the width 
of the interval not less Max_frequency - the maximum amount of 
cumulative frequency. 

   How can we make this condition is less severe? Explained 
above bit shift operation ensures that the low and high can only

then it becomes dangerously close when enter into an
Half. Suppose they get so close that



       First_qtr <= low <= Half <= high 
order, alternating to reflect changes perekodirovochnye table. 
As a result, the procedure increases the value of the 
appropriate frequency counter, and puts in order the 
corresponding accumulated frequency.




                      Compression efficiency.


   When encoding text arithmetic method the number of bits in 
the encoded string is equal to the entropy of the text relative 
to that used to encode the model. Three factors cause the 
deterioration of the characteristics: 



   * Costs of completing the text;

   * Use of Finite precision arithmetic;

  * Is a scaling counter that their amount does not exceed
Max_frequency.


   As shown, none of them are not significant. In order
highlight the results of the arithmetic coding model will
regarded as a strict (in the sense defined above).

   Arithmetic coding has the additional bits shall be sent to 
the end of each text, thus making additional efforts to 
complete teksta.Dlya eliminate confusion with the last symbol 
procedure done_encoding () sends two bits. In the case when 
before you encode the bitstream should be blocked in 8bitovye 
characters, it will be necessary to round the end of the block. 
Such a combination may additionally require 9 bits. 

   Costs when using finite precision arithmetic are manifested 
in the reduction of residues delenii.Eto seen by comparing the 
theoretical entropy, which displays the frequency of the 
counters exactly the same scalable encoding. Here the costs are 
negligible - about 10 ^ -4 bits / symbol. 

   Additional costs for scaling counter part
more, but still very small. For short texts (less than
2 ^ 14 bytes) does not. But even with the texts in 10 ^ 5 - 10 
^ 6 bytes of overhead costs, estimated experimentally to be 
less than 0.25% of the encoded string.


   Adaptive model in the program, with the threat of exceeding 
the total the cumulative frequency value Max_frequency, reduces 
all counters. This leads to what is to weigh recent events

heavier than earlier. Thus, the figures tend to track changes 
in the input sequence that can be very useful. (We have 
encountered cases where restriction counters to 6-7 beats gave 
better results than improve the accuracy of arithmetic.) Of 
course, it depends on the source, is applied to the model. 



                   Implementation limit.


   Limitations associated with the length of words and 
associated opportunity overflow can be summarized as the belief 
that the frequency counter are presented f bits, and 
code_values - c bits. The program will work correctly when f 
<= c-2 and f + c <= p, where p is the precision of arithmetic. 

   In most implementations in C p = 31, if used whole
the number of type long, and p = 32 - for unsigned long. In our 
program f = 14 and c = 16. With appropriate changes in ads for

unsigned long can be used f = 15 and c = 17. Assembly language
c = 16 is the natural choice because it speeds up some 
operations of comparison and manipulation of bits. 

   If we restrict p 16 bits, the best possible value
c and f are respectively 9 and 7, which does not encode a full 
alphabet of 256 symbols, as each of them will have counter 
value is not less than one. With a smaller alphabet (for 
example, of 26 letters or 4-bit values) you can still cope. 



                           Conclusion.


   Upon completion of the encoding process is necessary to send 
a unique terminating character [he needed if the decoder is 
unknown length of the text], and then send the following 
sufficient number of bits to ensure that the encoded string 
will get into the final operational range.


   Because procedure done_encoding () can be confident that low 
and limited to either high expression (1a), or (1b), it need 
only transmit 01 or 10, respectively, to remove the remaining 
uncertainties. Convenient to do with the previously considered 
procedures bit_plus_follow (). Procedure input_bit () at the 
actually going to read a little bit more, from those that led 
output_bit (), because she needs to keep filling the lower end 
of the buffer. It does not matter what value are these bits, 
because EOF uniquely determined by the last bits transmitted.




   All exact references to the author's C-program I have 
destroyed, but however, left the information on the ratio of 
variable names and procedures, which is enough to restore the 
logic program. The program is very poorly framed, and therefore 
unlikely is useful to you (in C you can easily write his), so 
we put its assembler option developed Vitamin'om and accelerated

I have no change in the algorithm. To achieve higher speed 
decompression (packing speed is less important) it should be 
changed in terms of the model update and retrieval. Algorithm 
in almost any If need to change, because only 256 supported 
literals and kept only one model at the same time - this is not 
enough to write a good packer. See ARIF16m.H in the appendix.