Inferno #07
31 мая 2005 |
|
For Coderz - Writing archive. Practical principles LZ packaging.
Practical principles LZ packaging Before I tell you about the unpacking and the structure of packed Bathrooms data. It's time to talk about how these data obtained. 0. LZ (A. Lempel, J. Ziv) - packaging method in which to file are found duplicate fragments are grouped together, as a rule, in reference (of course, not everything can be represented as links - bu FLS and individual characters), the parameters of which there are links encoded usually by Huffman. Link - the fact of repeating fragment (substring): contains the distance disp (from the current pointer position upakovschi ka or unpacker back to the moment of meeting the same piece) and length len (how much, in fact, coincided bytes). Window, it is the volume of the dictionary - the maximum distance for disp references. Next packer is not looking, but coding usually does not exceed envisions codes to indicate the more distant links. The method of Huffman (DA Huffman) - method of entropy coding of, in which the length of characters (literal) are different (but always have an integer number of bits): for parts - in short, for the rare - longer. And it differs from the similar (eg, Fano code) that the codes (or rather, the length of characters, because all the options set codes for a given set of characters ravnovygodny lengths) are chosen the most effective manner, in short you can imagine. Tree - a set of codes of all characters used in Huffman coding. The tree can store a variety of ways. When unpacking the most convenient as a real tree (Met 0 - go to B ahead, meet 1 - Moving forward as much as indicated in the current cell). When packed VKY - as a table (each symbol corresponds to a code and its length). In a packed file - a compressed form (lists of length all the characters, and this sequence is somehow encoded, at least also for Huffman, with a shorter tree - both in the RIP and RAR). Example of a tree of a canonical form, ie, where the tiers deeper left to right and inside-tier characters are sorted alphabetically (It is this tree is used in the RAR): root / / / 1-st stage (it is empty) e / / / 2-nd stage a b d f h / third tier c g 4-th stage Progress in the tree on the left is coded zero, right - unity. Here, the character code "e" - 00 and the character code "h" - 110. And so on 1. Packaging, as we understand, from a single location in memory another, can overlap in memory with the original data (losing in the as a package) can be from memory to a file, the file can be in memory can be from file to file. From the file pack quite samples lematichno and slowly, as difficult to find recurring fragments, so in practice archivers podgruzku obrabaty File under consideration, so that the memory has always had its contents from the current pointer back to the length of the window. It's enough to have a little more than a buffer size of the window. But it is not so easily formed bot: when cycled window search procedure repeats will be ve sma complex, and at nezatsiklennom it must regularly move (And not only him but also the hash table, on which below), which is very slowly. Therefore, and because of the imposition of compressed data on unpacked, I do not support obsessed dictionary ZXRar and actually rezhu file into pieces of sectors 127-128. How to find duplicates? Search ahead (CPIR or CPDR) - netorop livy process whose rate falls in a square the size of the window. John we have used the words: "pack slowly as PCD "- this is said about the search ahead. Accelerates search using keys and the hash table, which I tell you. Table keys. This is a very common idea. Its essence is that the group bytes (where 2 or 3 bytes, and ALASM'e, for example, to 58, POSCO lku key is used to find tags by name) is encoded several key bits of the sum (of course, for different groups bytes, these codes can be the same). Create a table in which at the amount of each key will be written the address to which it is last met, or the address below to start the windows against case. Thus, reading the 3.2 bytes from the pointer current position in the file, we can quickly find the address to which rum with high probability will be the same 3.2 bytes. More information about ALASM: Table tag consists of 64 or 128 (depending on version) Spies Cove, which combine tags with the same key summami.Eto no need to display labels, for example, STS spits on this structure. 0 - less than 6 bits - the entire length of the label (ie, the length of the label name plus 5). If bit 6 = 1, then the label - the name of the macro, if bit 7 = 0, then label is defined. STS on it too spits. For the latter (in terms view STS'a) label contains 0 0. It STS stops. 1.2 - a number that is a label. STS crawling on the label, while number does not match fit him. 3.4 - Address the following tags in this list. STS'u not needed as it moves through the list and generally in reverse order. +5 - Tag name backwards. In STS 5.x, 7.x: in cell # fe88 first page number labels (Port # 7ffd), number of other pages are not transmitted, in cell # fe7c, 7d address tail table tags, which begins with search is (+1). Unreal Speccy (v0.27) is looking at # 17, # 57 pages of STS and takes data about the page and the address of the beginning of the table tags from it. Hash table. Each cell of the window is aligned 2 more bytes in whose address is given next (back) of the cell in which 3.2 bytes form the same amount of a key. 2 or 3 bytes should be selected for the hash? This is like la. In Hrust 2.x had 2-byte and 2-byte short links (a 2baytnye links are always short - less than 256 bytes, otherwise bene dnee simply encode these 2 characters) were sought in the main cyclotron le search. In ZXRar - 3 bytes, and 2-byte references are sought in a separate Mr. cycle. In this case, the packing mode. Rzx (for text - there 2-byte references not) works a bit faster, but the regime. Rar (For everything else, there are links there) - a few slow her. In the source file ZXRar can include 2-byte keys, but the regime . Rar from this is not accelerating, it is still organized separately (Search for 2-byte link is, if an ordinary search of nothing ho roshego not found). You can, of course, to organize two hash tables and Two tables of keys - one pair to search for 2-byte matches, second - for all others. But you need more memory - 2 .. 5 dopo lnitelnyh kilobytes. Total obtain the size of memory required: window * 3 + keys, which for 12-bit keys and 32-kilobyte dictionary gives 32 * 3 +8 = 104k. In Hrust 2.x uses 13-bit keys (hold page), in ZXRar - 11-bit (occupying 2 / 3 screen). In the same source file ZXRar You can enable 13-bit keys, which lie at the top memory. Difference in speed between 11 - and 13-bit keys practice virtually none (as far as I remember, for measurements was about 5%). As we see in 128k of memory by this method can only handle 32k, 40k on the strength of the window. Perhaps, therefore, will in the near time Meni go to the packers, calculated on the 256k car. Search method with hashing is slower CPIR / CPDR on length of a group of repeated bytes. Someone tell me the solution as speed? In ZXRar it very much! AV Kadacit proposes restrictions stricting search for hashes a few dozen jumps, correspondingly Twain, lossy compression ... Hrumer> I thought for Laser Compact around the problem as follows: In particular refers to a sequence of bytes AAAAAA ... and ABABABAB ..., length L, more certain, selected on the tests (I think, from 3 .. 5 bytes). They are very often found in pictures. When completing the table - the one where are the addresses of previous occurrences of a character (or characters, if the hashing is not one byte): Fill in the address for the first AA instead of the address for the second A record length of a sequence of bytes AAA ... (ABAB. ..). When search for previously recorded sequence length used to determine the exact address where the search string. An example of this: earlier met after. AAAAAAAB. Current seq.: AAAAC. As a result, we compare the line AAAAB and AAAAC. Natural but about the fact that A, we agree, we know. It remains to compare vat with the characters B and C. But what if they match, and the length of the followers lnosti will be equal to 5 or more ... But the biggest benefit is obtained due to the fact that in this table does not frequent references inherent in the processing of the above sequences. In fact, we are not hammering the table in some places. And to ensure that the quality of the packing did not fall, we should all same complete table of symbols AA, the preceding symbol B from the sequence AAAAAAAB. Although it is not necessary and there several options, one might think, and experiment. Set IDE depends on how many bytes are hashing. Since the LC search is still upside down, then the scheme has been slightly more complicated, and even then - is painted on paper. Unfortunately, I have it and not realizoval.Seychas all written from memory, some moments I have omitted. I can still add: perhaps a more optimal is that by recurring characters such as AAAAA longer than a certain better to have a separate table of 256 elements. This field is for experiments, again. 2. Hrust encodes the links as soon as he found them. But his trees are fixed. For serious as before archiver coding necessary to build optimal trees. Where to store the integral formation, based on which we build them, and where to store information that we are actually going to code? After all, we encode not the source file, and LZ thread with links and search Links for the second time expensive, is not it? Therefore it is necessary for LZ-file is viewed remember: 1. Number of occurrences of all characters of text and all spetskodov LZ (statistics can be several: for example, spetskody displacements RAR encoded in a single tree). 2. ALL LZ-flow unpacked (this requires additional tional buffer that does not coincide with the buffer window. You can, of course, climb and the beginning of the buffer window, watching in LZ-searching for his change gating the size, but then compression will be worse). All characters including spetskody LZ, may be greater than 256, so I, for example, ZXRar in store once, 8 th, a bit apart, which is organized interspersed bit stream a la Hrust. In the byte stream (Without the 8-th bit) should be based on the parameters of LZ-spetskodov, ie length reference (if it is not specified in the LZ-spetskode obviously) and the offset. View Only a substantial portion of the file (tens of kilo B), we can construct the Huffman tree and encode the resulting LZpotok for Huffman, writing a. rar-archive. Then you can start new LZ-stream and new statistics. Huffman method I described in general terms in IG # 5. The essence of his what it is. 1. Different bytes depending on their likelihood, are encoded codes of different lengths. 2. Length codes in the stream each time did not lie, instead of a binary tree of the general form: if you read from the stream zero, then go to the tree to the left, if the unit - right. Have reached sheet - an end; the sheet is written the symbol (B). How to keep tree in the stream, I have already told in the Inferno # 4 (in the format description . Rar). As it stored in memory - it specific unpacker, but it is best to use decoders from unpackers derip or smallunr. Tree, they should. Each node - 4 bytes, and 2 nearest nearest-the root of his descendants are at the beginning of the buffer tree: 2 bytes for the left and 2 bytes for the right child, the "descendants" is written any character (then 2-byte <# 40), or the address where this child stored as a node (if the 2 nd byte is the high byte address). 3. Huffman codes, in contrast to the suboptimal codes ShennonaFano, are chosen as follows obrazom.V noncommutative buffer for sorting have a table of symbols and their probabilities, combine the 2 most rare characters and store them in the same table as a tree node probability is equal to the total probability of these characters, the same old characters are deleted from the list below and again look for the 2 most Rare characters (you can read and has already received site de roar) and just combine them, and so on, until there is a node, which is no longer combine with anything - it will be our tree. Keep a list of a tree during a better sort of 4 bytes for each element. 2 bytes on the frequency, 2 bytes for the contents (Character number or address of left child, really - four bytes later). Accordingly, after the unification of symbols / nodes they are transferred to another buffer, which is organized at the end buffer, where we keep the characters, and grows downward. Symbols same be sorted by the probabilities (ie, in practice - the frequencies meeting in the file) in advance, and make the combined unit to the right place in the list, according to its probability. After this sort, we can not directly save a tree by Standard Rar / Rip. Standard them such that (the description of a tree Islands) for each character in the packed file is saved only long on this character (from 0 to 15, where 0 - the symbol is missing). A Sim symbols are allocated to each tier (set of characters equal length) in alphabetical order. In Rip - in reverse alphabetical but it does not matter. Therefore, the tree needs to be rebuilt, preliminary Satisfactory finding the lengths of all literals. Here's how. In accordance with the first tree (obtained after sorting) draw up a table: for each character - its length (1 .. 255). Makes up for it from our root node (which, as you know Those are now all alone lies at the beginning of the buffer for sorting) psevdorekursivno: at the beginning of bypass stored in the stack "some code"; go left and at the same time stored in the stack on the right path (address and current depth) before reaching the leaf (character), a long journey find out from the stack, and if there is "some code" that tree over. Kadacit says in his thesis (p. 60) that there is a way of finding the lengths of all literals without boring about process search-join-discard-shift (ie, without any compilation of the first tree), but the method itself does not. Further, whatever way we may build the first tree, yet to remove the tiers above the 15 th because the 15 - max length of characters allowed in Rar, Rip, and the like upakovschi framework of. My way to the next. When the above-described construction of Tabley particle length of characters in parallel we will build a table of characters for each particular length. Made? Fine. Looking for the longest length, which generally is. Suppose that such a length (call it N +1) corresponds to C symbols. This is not good; These symbols must be omitted on the level (tier) below (the N-th) per by raising the character length N-1 (also raising the N-th UB Wen). For every 2 characters long drooping N +1 (and their number is C, goat clear, even) have a raised symbol length N-1. This it would be nice if the characters in length N-1 could always be find. However, they are sometimes not enough (and may never be) therefore have to settle for a show characters in length N-2 (raise to N-1-tier): every such elevated Sim parish accounts for 4 omitted the symbol length N +1. If there is no N-2, take the N-3 - for each such raised accounted for 8 descended N +1. And so on. Eventually, after several operations Op kaniyu characters maximum length, the lengths of> 15 are left. There is "sinking" on stage - is moving up the vysheprive dennomu figure. The fact that the trees were taken to draw up root. Having an array of length characters, the packer can make an array of character codes: from 1-tier, walking on stage in our alphabetical order, assigning the symbols codes 0 and then (after 0 following schy code will be 1 - Subtotal two codes, but one is used only in case case, if the tree has only a 1-th stage, ie, there are only two character). Suppose we have found on the 1 st floor of one character, hence Indeed, were not used code 1. Ascribe to this code but face (total scores 10) and go the same way on 2 nd tier by assigning characters codes from 10 to 11 (if there were no symbols on the 1-meter ravine Behold, then we moved to code 00, 01, 10, 11). Assume that we are not found on the 2 nd tier anything. And our code is harvested, as I have say, 10. Ascribe to him toe, go to the next tier ... Etc. You can start with a 15-tier, and sort out the codes from 111111111111111. The result will be the same because our array lengths of the characters is taken not from the ceiling - in fact we have built a fully downtrodden tree, which means that all codes are necessarily engaged. ZXRar, in Anyway, starting from the 15 th. 3. Packaging strategies (ie strategies for choosing the optimal partition eniya text on the "rags"), there are several. The simplest is implemented in all Spectrum packer except, perhaps, RIP (for example, it is in ZXRar: Mode Fast). The strategy is as follows. We are looking inside the window (say, looking from the current pointer back further and further) occurrence of a string lying ing under the pointer. Suppose we found him at a distance disp, and coincided in the occurrence of N symbols. Remember this result. Looking for the following vhozhdenie.Smotrim, whether it BETTER than those already found. The concept of "luchshesti for packers with a given tree quite clear: fewer bits need to encode a links (good) or more (bad)? Packers such as RIP and RAR is more complicated. Firstly, it is clear that our quest ago finding links (occurrences) with the same length, which has already been found (N), anything for packaging does not, because of the two links of equal length (N) more advantageous to take the one in which more than offset a close. Analogues chno for links for which N is even smaller. About links of length 2 in general a separate conversation. Secondly, we believe that the increase in length (N) by 2 symbols - it is profitable (although, strictly speaking, this is not always so). Then (third), after some experimentation it turns out, that increase the length of a reference to a benefit only if the old Link had a great disp. More precisely: after the link with the displacement # Ff00 .. # ffff - if the new link is not farther than # f400; and after reference offset # fe00 .. # feff - if the new link is not on, than # e800. After finding the best links we should not only remember link itself to the LZ-stream, but also add a hash table and the table keys for each of the characters found the line. If we do not found a good reference, and preferred to encode a symbol, then the additional thread table is necessary only for a current byte and only One key corresponding to the place of the text (ie, key formed from the current byte and one or two the next). Old address, which was lying in a table of keys on this key, we need positive live in a hash table for the current address, and the current address - in the same place tables klyuchey.Protsedura search Dmitry Pyankov, remade by me for ZXRar, updates the table for one current address (see the input and output in the procedure). But we can to do so and without the procedure. The following strategy is applied in ptsshnoy library, Zlib, and in ZXRar (perhaps in ptsshnom Rar) mode Best. This "Lazy evaluation". The strategy is that by finding the link, we will did not immediately introduce into the flow, but first try to see what happens tsya if we encode the current symbol as a symbol and a link will be look for the next byte. Comparing the two found the optimal references (one from the current pointer, the other - from the next buy ma), decide which one is more profitable. May be advantageous to 2-I. What do you mean "profitable" in this case? Cheshem in the head and once biwa range of all possible disp into subranges, each of which an educated bet find the maximum disp, to tory will benefit by increasing the length (N) for 1 symbol. In ZXRar to the latest version was the adjuster, in which the client could have been pick up these limits for fun, but none have How many I know, this fun was not. Therefore, the adjuster UB ran.Vo any case, the setting for the method Rar (for natural for him on the partition ranges) are given in help'e to ZXRar: Options: The top two - the boundary beyond which is permitted to increase the length of links per unit (in the main loop) after finding the link with shift # ffxx and # fexx respectively. Increase by 2, 3, ... is no question as profitable. The remaining 15 - the boundary beyond which is permitted to increase the length Well linked unit (in lazy evaluation, only the mode best) After finding the links in the following ranges: # Ff80-# ffff # Ff00-# ff7f # Fe80-# feff # Fe00-# fe7f # Fdxx # Fcxx # Fa00-# fbff # F800-# f9ff # F400-# f7ff # F000-# f3ff # E800-# efff # E000-# e7ff # D000-# dfff # C000-# cfff # A000-# bfff Note that when crossing the border # e000 shifts encoded length. That is not known how profitable this link code: (# Dxxx, 5) or (# exxx, 4). Old settings (up to 07-second versions): F0 E8 FE F8 F4 F4 FA FA F8 F4 F0 E8 E0 D0 C0 A0 80 New settings: F4 F0 FD F4 F0 F0 FA FA F8 F4 F0 E8 E0 D0 C0 A0 80 For the RIP bands are encoded the same way, so a similar partition Similarly, and more profitable settings are likely to be close to it. Difficult to combine this method with a hash: have recovered establishes the state table after a 2-second search if the 1 st was more profitable (in fact continue to carry out the procedure for updating tables all symbols of the 1 st reference, except the 1 st! A 2-nd just back included). More precisely, it is necessary to restore the jump table by key (Search he brought back the current address). Although the search has raised his old address, former places in the hash table, but any impact is not provides: it would still have to go to the very logic of Record making a hash table. For updating the tables for all characters links, except 1, in responsible ZXRar cycle FILPO1. Lazy evaluation strategy in 1.5Ў2 times slower than normal, but gives a noticeable gain in the package. The best strategy, "Optimal LZH", described in the IG # 6. On the ZX it is still nowhere to be implemented. I draw your attention! Unpacker does not matter what country tegiey enjoyed Archiver, the difference is - only in file size! This observation has to do, because there are people who confuse strategy with the compression method (see this in a magazine iS-Files # # 1 and 3). And if there are such people, it probably is a source of such misinformation. Strictly speaking, these people also represent from a source of disinformation - to their readers. 4. Should say a few words about how to create the archive, rather, it contains a file with headers. When the user command is received the package, our pro program should check the archive with the name specified in the disk. If it does not exist, then we, logically, should be create it (small file consisting only of a header ar hive). If it can not be created or if it is, but not at the end of Thesis ka, then the easiest way out - swear and go to the start point Since the program (show catalog, etc.). If it is possible to create or He is already at the end of the disk, it is necessary to continue the operation. Does not arise swearing and erase the archive of the same name, creating a new one in end of the disk. Endure the same old archive to the end of the disc - Mash flax pleasure, because the archives are large and floppy not rubber. Continuing the thought, we can offer a a new satellite of the old archive (SAT - a file with the same name, but with a modified 1-th letter of extensions: 0, 1, 2, etc., regarded as under consideration TR-DOS-programs as a continuation of the previous file) separation from him, although the processing of such separated pieces of the archive will not be easy. Separate conversation - the packaging on the HDD, there can be deployed in full force. To find a file on disk TR-DOS provides a function # 3d13 C = # A, it returns the register number of the C directory entry, owls incident with a given (# 5cdd-e5) or # FF, if such things happen. How can I create? There are two fundamentally different ways. 1 st again provides a TR-DOS: function # 3d13 C = # B, HL = address beginning of the file in memory, DE = length. 2 nd - manual formation des kriptora in the directory. Number of this descriptor are taken from the cell # e4 Systems sector (hereafter - the 8 th sector 0-th track, assuming that Taya both numbers from zero), the sector and track - from # e1 / 2. After manual a descriptor should be written followed by the byte # 0 and fix the three values in the system sector: free (# e5 / 6), the first free sector and its track (# e1 / 2), number of files (# e4). Now, attention! Since in our archive can get packed en blocks of contiguous memory is longer, then append we definitely have to file in parts, right? Hence, we need routine record bytes in "output", which will be null written in the memory buffer until it overflows, and when the overflow - Blame it all on a floppy disk (and re Index, indicated strate in a memory location to write at the beginning of our buffer memory). At the end of the same package - to unload the part of the buffer which is filled (if there really is something empty. Vyg Rouge 0 bytes pointless) and then re-form des kriptora file (!) and system sectors. Why are multiple numbers word? Because the archive when adding a new unit could exceed 255 sectors (or 255 * N sectors, depending on the number of already existing pieces). In his treatment is again two options: either to assume assume for the beginning of the file that satellite (piece), which was the last at the beginning of the package or take over signified the onset of action Indeed the beginning of the file. In the latter case, the creation of new descriptors need to see all the old ones. The trouble is still such that, in our archives, broken into chunks of 255 sectors, we do not know a priori extension of the last piece. Fact the availability of an archive-we can determine, but positioned the last piece - already a problem ... Still need a manual search. What is the initial contents of the buffer? That's right, serve them Last nedozapolnenny sector archive that we precede flax read from disk. If we create the archive, then do not read - nedozapolnenny this sector we are just that and recorded, and with He holds the title of the archive. In ZXRar it all organized somewhat more complicated because of the needs soon STI. But to begin with would still have to write a slow variant otherwise it is simply impossible to debug the program. Processing of the disk is full you can arrange on-anyone an amateur. The easiest option - to remember the position of the stack starting point of the program (see above about her) and fall out there when an overflow situation. In this case, the archive is likely to will be erased, and maybe not, depends on in which the SOS Toyan you left 0-th track before packing unit. 5. Deviated a bit from the topic, I'll do a brief comment about swap'a when decompressing. In short, as implemented in ZXUnRar? Podgruzka the disc goes through LOADBLK (this is very Podgruzka archive, to swap it does not apply). Swap is implemented in several cycles of type out frPG ld a, (de) out curPG ld (hl), a inc e call z, restore inc l call z, store This operation of copying a substring output file from somewhere back to the current address. Here, DE - is the input address. In one If it lies in the output buffer, and in another - in the buffer podgruzki. (Track-Sector to load calculated wiser than you can admire in the source ZXUnRar, I confused myself there:)) restore engaged podgruzku the next sector of the swap. store increases the H, and if it exceeded the limit (the output buffer overflowed), writes the entire buffer to a file. Swap, clear it is this very best output. I wish you a pleasant arhivatoropisaniya;) to integrate their alasmy - and in a way ... A. Coder
Other articles:
Similar articles:
В этот день... 21 November