Forum - compression programs.

      Compression programs


    To begin with sore topic:
compression of the text. Almost
ideal is
compression of the text by the method of Huffman (net) - for 
text (large size), this method is more progressive than

LZSS (RLE generally keep quiet about). About
Huffman compression and will
speech.


            Theory


   Texts usually written either in Russian or English, so often 
repeated would be no more than 90 (Latin) or 120-140 characters 
(Russian). And the rough estimated 30% of them occupy

very frequent letters.

   From all this it follows that
we need to convert the algorithm
Huffman compression, so that
He worked in these conditions.

   The first thing we need to do - make a table of frequency 
used items - array of 256 bytes. In the first

place should be the most common symbol, at last - the least. 
Examples will not lead, as do

it is very easy.


   Thus, the array is obtained, now
most important: developing
format in which data is stored. The principle is: all
numbers are divided into 5 groups according to the number of 
bits: 


     Group Number of bits

       January 1

       February 2

       March 3

       4 4-8


   Numbers with the first of the third group
coded as follows: first there is a N
included bits - which number
group, as many bits. Then
is the sheer number (of bits in the group), the first of its 
value set to 0. (For ease of mastering the material, see 
ZXRevyu 1 / 95). Example: 


   Number 0. Occupies 1 bit:% 0.
The number of the first group. Hence, it
will appear as% 10 - total
2 bits.


   Number 3. Occupies 2 bits:
% 11. Group 2:% 11 01 - four
bits. Number itself is transformed
from% 11 to% 01, the program-decompressor could count
the group number.


    As you can see, the more the group, the smaller the payoff:


   Group Gain (bits)

     June 1

     April 2

     March 2


   But still a win
great. But if you go further on
this way, the 8-bit numbers
will encode 16 bits
7-bit - 14, that does not suit us.

   Therefore, for 4-8 teams have a special format: instead of
units with the number of recorded bits = 0 - sign spetsformata. 
Then we write more one bit: if it is 0, then for

it is normal octet
number (in the case of 6, 7, 8 groups).
If it is 1, then followed by the
5-bit number (4, Group 5).

 Group (bits) Gain (bits)

      4 +1

      5 +1

      6 -2

      7 -2

      8 -2


   Example: number 255. 8-th group,% 11111111. After 
compression: % 00% 11111111 (loss of 2 bytes).



   Number 64. 7-I group,% 01000
000. Result:% 00% 01000000
(Again, a loss).


   Number 8. Group 4,% 0000100
0. Result:% 01% 01000 (gain 1 bit).


    But the numbers with the numbers
32 - 255 the vast majority - you might say. Here and
makes itself known Huffman algorithm: in fact in the table, we 
coded most occurring less character codes, and especially in 
the texts, such symbols and will make us the majority (in

Russian and English - the vowels
spaces).


   Algorithm: uncompressed
program should rasskompressirovat code, then the table
find the corresponding symbol.


   An example of such a decompressor:
(Program damp, not relotsiruemaya, not optimized).


   This program is only to perform the conversion code
from the compressed file. You
need to write your procedure to obtain the source code from the 
table (calculated from the resulting Opcode:


LD L, N; N-number code
LD H, 0, in the form of double-byte
LD DE, TABLE; address table
ADD HL, DE; compute address
LD A, (HL); received a code


    Log in to the program: IY - Address
compressed block, IX -
address where decompression occurs. Compressed file format: 
first go 2 bytes - length of the file before compression (how 
many codes) .140. 


        ORG 32768


        LD E, (IY +0); take long

        LD D, (IY +1)

        INC IY

        INC IY; in IY-address file

        LD (LONG), DE; keep the length

        LD D, 7, bit 7.
MAIN CALL GRUPPA; take a number of

        CALL PRIEM; calculate the code

        PUSH BC

        LD BC, (LONG)

        DEC BC; reduced the length of

        LD (LONG), BC

        LD A, B

        OR C

        POP BC

        JR NZ, MAIN; in the cycle

        RET
LONG DW 0; stores the length

; Increase the bit address

INCBIT PUSH AF; saved A

        LD A, D; number of bits

        AND A; if = 0, then

        JR Z, NEWBYTE; transition

        DEC A; reduced the number

        LD D, A; in the register D

        POP AF; restored A

        RET; Output
NEWBYTE LD D, 7; new bayt.Bit 7

        INC IY; increased byte

        POP AF; adres.Vosstanovili

        RET; A and output

; Check bits from the address
And if bit = 0, then A will be 0 if bit = 1,
, Then A will be 255.

GETBIT LD A, # 47; code BIT 0, A
GETBIT1 PUSH BC; remember BC

        LD (BT +1), A; pasted the code

        LD A, D

        AND A; if a bit is 0, then

        JR Z, PROH; transition

        LD B, A; counter bits
UST LD A, (BT +1); set

        ADD A, 8; team BIT 0, A

        LD (BT +1), A; 1,2,3,4,5,6,7

        DJNZ UST
PROH LD A, (IY +0); take a byte
BT BIT 0, A; check bits

        POP BC; restore BC
IZ DB 0

        JR Z, BIT0; if bit = 0, the transition
BIT1 LD A, 255, bit = 1

        RET
BIT0 XOR A; bit = 0

        RET

, Inclusion of bits of address

SETBIT PUSH AF; remembered A

        LD A, # C9; use GETBIT

        LD (IZ), A; of SETBIT: code RET

        LD A, # C7; code SET 0, A

        CALL GETBIT1; resulting in an A-B

        LD (IY +0), A; byte address

        XOR A; restored

        LD (IZ), A; GETBIT

        POP AF; restored A

        RET

; Getting a number of
; Entry: bit address on the tag team

GRUPPA CALL GETBIT; take a bit

        AND A; equal to 0?

        JR Z, GRUP58; 5-8 bit number

        CALL INCBIT; increased address

        LD C, 1 count of
GR1 CALL GETBIT; took bits

        AND A; equal to 0?

        JR Z, GR2; If yes, send the number of

        INC C; increased group

        CALL INCBIT; increased address

        JR GR1; in the cycle
GR2 CALL SETBIT; included a bit (because 1

        LD B, C; bit number equal to 1

        RET; set to 0
GRUP58 CALL INCBIT; increased address

        CALL GETBIT; took bits

        AND A; equal to 0?

        JR Z, GRUP5; 5-bit number
GRUP8 CALL INCBIT; 8-bit number

        LD B, 8, Group 8

        RET
GRUP5 CALL INCBIT; increased address

        LD B, 5, Group 5

        RET

, Getting numbers. B-group number
; Bit address is a number.

PRIEM XOR A; resets rotating

        LD (BYTE), A; byte receiver
PR CALL GETBIT; took bits

        AND A; if not equal to 0, then

        CALL NZ, WKL; include bits BYTE

        LD A, (BYTE); rotation bytes
        RLA; receiver

        LD (BYTE), A

        CALL INCBIT; increased address

        DJNZ PR; number of cycles =

        LD (IX +0), A; received a code

        INC IX; increased address

        RET; receiver
WKL SCF; inclusion of bits

        RET
BYTE DB 0, byte-receiver
2

   From this follows several
Huffman compression features:


   1. On small files (estimated at up to 3 kb), the compressor 
does not will win because she decompresses the program will

quite large - you need to save the table (already 256 bytes) and
etc.


   2. The larger the file and the
less than it used characters, the more powerful compression.


   3. Main feature: unlike other compressors
(LZSS, RLE), in a compressed
file contains no data, and their
Codes! Therefore, to modify them in any
Do not. (And here in
LZSS and, especially, in the RLE, you can:
got a doctor in a file on disk,
searched its inscription - perhaps
she does not compress - and
If found, then changed as needed. Here you not only do
not find, but also ruin
The entire file ...)


   4. Low speed of the decompressor (as compared to other
technologies).


   But that's the theory - so that
who tries Huffman
in practice - write a Review.


   Ca. Ed.: You can significantly reduce
frequency table of symbols, if we consider that
symbols with numbers 32-255 are encoded the same number of bits 
(ten bits), and therefore, the order they appear in

Table does not affect the size of the compressed file.
Thus, the table is sufficient to store only the 32 most 
frequently used symbol.


   And yet. Try to implement an improved algorithm for Huffman, 
which are encoded as individual characters, and most frequently 
occurring pairs of characters. For example, in the English 
language often found a pair of characters "th". Obviously, the 
length of the code corresponding to this pair characters will 
be less than the sum of the lengths of character codes "t" and 
"h". Due to this superiority is achieved in the compression. 


           *

(C) Ivan Roshchin, Moscow, 1997


     Features assembler

           ZX ASM 3.0


   When the command "Load
sts "(in the menu Setup) in the case
if a file with the specified name
is found, an error message is issued.


   If you select Edit
mode = Text, you lose access to all points
Menu Run, except Inspect, which is not
always convenient.


   Not a message about
error when assembling teams to load single-byte register 
double-byte numbers. For example, the command LD B, # 1234 will 
be assemble as LD B, # 34.



   ZX ASM somehow believes that
registers, HL, IX, IY equal,
ie in any team that uses the HL, can be used and registers IX, 
IY. If the teams type LD A, (HL) is really

so, to commands such as ADD HL,
HL, and some others do not
applies. Let us illustrate this by
Example:

 Compiles source code


   ADD HL, IX ADD HL, HL

   SBC HL, IY SBC HL, HL

   EX DE, IX EX DE, HL


   This error may lead to
a very unpleasant situation. For example, you wrote in your 
program ADD HL, IX, forgetting that this team does not. 
Assembler also "saw nothing", and then 

you wonder why your program does not work.


   In teams, working with the halves of the index registers,
have used the notation XL, XH, YL, YH. Meanwhile, STS
in the same team uses
notation LX, HX, LY, HY.


   Ill-treated
situation when you define
variable with a DS, and the number of bytes allocated for it
zero. Such a case can
arise, for example, if you
first defined a variable with
by DB, and then decided to replace the DB on the DS, while 
forgetting replace the 0 to 1:


 NAME DB 0 -> NAME DS 0


   If the text of the program changed since the last recording
CD, ZX ASM before performing
"Dangerous" operations prompted to save it. But what about ZX 
ASM determines changes the text or

No? Obviously, simply by comparing a checksum of the text
in-memory, and text
the last time recorded on
disc. But this method in some
cases, gives an incorrect result, and ZX ASM believes that
the text has not changed:

 a) LD HL, # 1234 LD HL, # 3214

    DB% DB% 01100110 11001100

(Because changing the order of characters
 does not affect the checksum
 text)

 b) LD A, 5 LD A, 4

    LD B, 6 LD B, 7

(As ASC-character codes "5" and
 "6" in the amount equal to # 35 + # 36 = # 6B,
 and the same value equal to the sum
 character code "4" and "7").


           *