Miracle #03
16 июля 1999 |
|
kodit Who's there? - Guru meditates: optimization programs for execution time and size.
(C) PSV / Dream Team ------------------ OPTIMIZATION PROGRAM ON-TIME PERFORMANCE PROLOGUE At the current stage of development of science and Technology can not imagine any one area of their application, which gradually penetrated to a microprocessor technique. And gone are the days when the result of simple calculations, we had to wait a long tiring. Modern problems require solutions not just fast, and issued in "real time", ie the rate-making should be higher than the rate of production problems in mode, continuous time processes. Our door is now knocking on the device, recognizing and generating human voice and moving images of real objects. ... Unfortunately the giant speed processing of information available yet only super-computers and do not show today and tomorrow. Until available to most microprocessor equipment operates at speeds of about several millions of elementary operations per second (for clarity, I will say that acceptable to the eye parameter screen this computer is not up to speed computing: he was just keeping up with allotted to one frame time, just to screen a predefined picture). It is not surprising desire to improve the capacity of programs due to their literate writing. And sometimes obtained remarkable results: the speed of the program increases are sometimes a few times just for through some minor changes, and only 5 - 10% can expedite any "raw" program (of course we are talking about programs in assembler, as a first step acceleration of the program - write it on low-level language). Fundamentally differ in the two approaches to the optimization process: the heuristic method is based on logical thinking and requires no special knowledge of any language no CPU or hardware environment, but it requires a deep knowledge of algorithms, the method of hardware fitting more accurate (it even possible to implement lay on your computer) and is based on actually a very deep knowledge of all properties of those devices with which the program works. Program optimization is usually in two stages: first - the original drawing program with some too slow, the second - modification-established programs to improve parameters. Here are some that I know of optimization techniques in software According to this classification. Immediately it should be noted that due to scant coverage of this issue in the literature, almost all have been developed by a small circle of elite programmers (Hacker), and me and my friends. As a result, the material has chaotic character. CHAPTER ONE (On how to write fast programs without knowing the specific equipment) First of all, if your program running fast enough to think about all the other ways of achieving goal and, perhaps, a priori, there are faster algorithms. For example, conversion of coordinates of the object in three-dimensional space does not necessarily done by Euler equations, the same results can be achieved in several different methods and choose from them optimal. Another example - Automatic strip conductors in the design of the PCB, which can make a few notable ways. All methods need to explore and choose again - still the fastest. If the algorithm has already been selected and implemented, then the easiest way to do program faster - to remove from it cycles by repeating the program of the loop body. When the cycle holds less than 100 times a loop body simply insert the number of times time should run a cycle. If the cycle longer than 100 times, then repeat it as many times as possible, and the outer loop leave, but running less number of times. By this method the gain in speed is obtained by cutting away from the software processes such as change loop variable, its monitoring and the operators go to the beginning of the cycle. Also exempt register occupied previously by loop variable. In any program that goes beyond the transfer of data, there are algebraic operations. Accumulated experience allows programmers to derive some rules, guided by which these operations can be written so that they run faster. For example, A = BxB will always work faster than the A = B ^ 2 (if Of course, these operations are written as universal procedures, but not specifically optimized, as will be discussed below). Also faster to run A = Bx (C + D) compared with A = BxC + BxD, so as even if the processor is provided multiply there will always be slower than addition. We generalize this subject: here in what order should speed perform algebraic operations (from fastest to slowest): addition, subtraction, multiplication, division, raising to degree, taking root. And if you torment the question "which is faster: two multiplications Addition or four? ", then you have engage in a more detailed study of the equipment (see potaktovoe execution teams). Next, we consider more complex features such as: COS, SIN, TG, LOG, EXP. On this occasion it should be noted that in most cases the computation these functions, you can write integer, if their scaled, which will enormous gain in speed. Example: suppose that we need to calculate the expression A = BxSIN (C). In the calculation of sine, we calculate it is not in fractional numbers, but in the whole (due to the fact that the result will not change from 0 to 1, and from 0 to assume 256), multiplying everything on 256. Next, multiplying the resulting result should be divided by 256 (the is simply discarded low byte). Another way to avoid the application of complicated computation (and hence slow) functions - to calculate them in advance and set in an array. This path finds in recent years more and more widely used as memory expansion is much easier and cheaper to increase performance. For example, instead of calculating the sine series expansion in Most programs can be universal pre-calculate its value to say increments of one degree and within 4.5 decimal places that will take some 720 bytes of memory, but will speed up the program several times. Another area of application of the accelerating efforts - programs that most Part-time work does not require performance, but life spoil rare moments when you need something quickly do (for this type are almost all interactive programs - first they do not give and are waiting for commands from slow compared with human and then, when the team filed, they forced to wait for a man, until execute some complex action). In then helps the remedy - while the program is free bars (Assuming the keyboard is not possible to interview more than 100 times per second, which is not take even one per cent speed processor), it tries to predict What follow-up and prepare everything to the instantaneous processing: to find a free place on a medium to calculate the required constant check errors and to things like that. Even if cooking will be in vain harm from them is not be. But if you come in handy - required work is accomplished much bystree. The above mentioned techniques make it difficult and the program is extended, but there is one trick to reduce not only Execution time of procedure, but also reduce the required memory of her. We are talking about self-modifying programs. Consider the following example: Suppose a program needs to move some information from one medium to another, adding at the same time to each byte constant whose value every time when you start is defined by the user. Naturally in this case to allocate to constant of any memory cell anywhere in unoccupied areas. Let her address will NNNN. It should be action will be written like this: to take the next byte MOVE register, (NNNN) ADD another byte register put the next byte In this case, firstly in the team MOVE featured some of no importance to the algorithm address, but the program in memory is necessary take place under the constant. Using same self-modification techniques, we can write as follows: to take the next byte MOVE register, constant ADD another byte register put the next byte In this case, one of the teams primary cycle, running many thousands of times (That is, the speed of which must be fight), we get almost 2-fold gain in execution time due to removal of indirection to a constant. But how? Because of the constant condition is not fixed but set by the user, and our program will have to enter one specific number! A case in that instead of a specific number of would always be exactly the constant which is entered by the user, because, as known, in memory of the program is as a set of numbers (bytes), and our certain number will also take some - that The number of bytes in memory is quite particular address, and what this address is worse than that which we wish to allocate to keeping a constant? So it turns out that and the program quickly, and the head does not hurt when searching for free memory for the constant, and the program is shorter, and memory for data must be less. In addition to the constants in program can also change the code operations, which is much harder to write and requires a thorough knowledge of equipment. CHAPTER TWO (On how to write for a specific hardware) To apply the methods of the type described below is necessary to fully study the specifics of all the hardware which will operate the program, and First and foremost, a processor and storage media. Moreover, the study does not should restrict the logic of devices, and should include the study of temporal characteristics. Huge role in the performance of the program is the exchange rate with carriers information. In a large number of cases program is running slow not because of processor, but because of the slow exchange data from the media. As a rule, even in electronic media (memory ICs) speed is much lower than the processor, and is easy to see the lion proportion of teams just up and running with the memory (Remember that even the team retrieved the processor from memory). In such media as magnetic and optical discs speed and did a funny. The moral: do not be lazy, and thoroughly examine the characteristics of the carrier of your information and you will not regret it. Usually immediately find ways to increase the speed exchange: it turns out that some part of memory faster than the other, it turns out that the media can be cached (to provide fast buffer memory), then suddenly found that with even addresses work faster than with odd and so on. If there are differences in the characteristics necessary to have the most used data in faster sites. For example, many computers, the memory used as a video memory used to store non graphic information that provides more memory for the data. But having studied the rate of memory operation is usually found to that this memory is much slower and There is desirable to store only rarely used data. The next step - studying potaktovogo instruction execution by the processor. Immediately there is a numerical criterion of performance: you write well - 1200 cycles, and think and write simpler - 750 cycles. Also there is a nice opportunity to evaluate the performance of the program even introducing it into the car, and considering required to complete the number of cycles. For example, making the program that processes signals with frequencies up to 100 hertz income pointless to enter into computer program for signal processing, if one cycle it will take for the calculation more than 10000 cycles (when clocked at 1 MHz). At this stage, be useful to have a for a black list and bring back team running for an excessively large number of cycles to always avoid their use. If your program is working with any other intellectual partners (drives, graphics / arithmetic / music and other coprocessor or another computer on the network), then work with them can accelerate the also study in detail their logical and time parameters. For example do not need write a program drawing vectors given coordinates, if it is able to quickly making himself the video processor (a "smart" video processors are able to draw not only vector but also the three-dimensional polygons on a given picture cards, lighting and so on). Another example: why program to print text on the paper to align it along the edges where it might do the printer itself without spending while neither the time nor the memory. Or anything to control the checksum of the array when transmitting data, summing up a vast array of numbers, if it's hardware level is controlled by I / O port and only one bit of control in its flag state. These and similar errors lead to an annoying pauses where they can be eliminated with proper approach to speed of the program during its writing. In addition to "Iron" nice to know and features of human perception, if Your program provides its participation. For example, in recent programs with the creation of vysokorealnyh computed images using a strong simplification of computing the scene with its rapid motion, since the human eye all will not have time to consider the details. Also on these properties are calculated algorithms for a strong package of images and sound due to the disappearance of small and imperceptible parts. CHAPTER THREE (Techniques for writing fast programs KR580 and not far from him gone Z80) Despite the fact that an 8-bit processors are obsolete (by the way outdated and a 16-bit), they remains one of the most common in our processor, which makes Squeeze of them did not even see their developers speed. The most common considerations when writing programs: how the largest possible number of variables to keep in registers rather than memory; small routines do not execute as subroutines, and paste into the required place; try whenever possible to use the increment / decrement, and shift left / right instead of addition / subtraction and multiplication / division, respectively (DCR B: DCR B instead of MVI A, B: SUB 2: MVI B, A); apply when dealing with DBCS variable processing only junior bytes if it is known that for double-byte handling the high byte is still not change (eg if necessary handle 5000 bytes blocks of 4 is within one block from the counter will change only the low byte). Next Several techniques for specific common cases: 1. Operators of the branch (only for Z80): JP at 2 cycles faster than JR (albeit to a byte in short: someone that). 2. Transferring addresses from one register pair to another without thinking write this: Z-80 KR580 PUSH HL PUSH H POP BC POP B And almost three times faster to write as follows: LD B, H MVI B, H LD C, L MVI C, L 3. Fast cycle for double-byte variable. Usually written as: LD BC, chisloLXI B, the number of M: Cycle M: cycle DEC BC DCX B LD A, B MOV A, B OR C ORA C JP NZ, M JNZ M In the cycle but useful body spent a further 24 cycles, so we write this: LD BC, the number of LXI B, the number of M: Cycle M: cycle DEC C DCR C JP NZ, M JNZ M DEC B DCR B JP NZ, M JNZ M and spends only (14 +14 / 256) = 14.05 clocks per cycle. 4. Bit processing memory. Usually write like this (in a pair of HL - address in register C - mask): bit processing RRC C JR NC, M INC HL M: cycling It "eats" (20x7 +21) / 8 = 20.125 cycles for each transition to the next bat, and I propose to write like this: processing bits for KR580 RRC C this method JR C, M2 is not very M1 loop effective ... M2 INC HL JP M1 that "eats" (15x7 +36) / 8 = 17.625 cycles, although at first glance seems worse. 5. The fastest memory fill (Often used as a cleaning the screen buffer or the screen itself). As usually write will not even resemble - everyone writes their own way. But to squeeze all what can this processor can be as follows: LD SP, konets_ochischaemoy memory LD HL, kod_zapolneniya LD B, length / (number of PUSH in the cycle) / 2 M: PUSH HL PUSH HL ... DJNZ M The more write PUSH'ey - the faster to fill. And do not forget restore the stack pointer at the end cycle! The average is 1.5 times faster than conventional filling. By the same the principle of writing and transfer program memory (again, yet their performance obtain the maximum attainable for processor), but they are much more cumbersome. 6. Rapid procedure for multiplying (Parameters: Register E - the first factor, the register A - the second factor, HL - The result of the operation). A big plus program - selects the time is almost does not depend on the factors. LD HL, # 0000 to be KR580 LD D, L differ only RRA mnemonics JR NC, M1 ADD HL, DE M1: ADD HL, HL RRA JR NC, M2 ADD HL, DE M2: ADD HL, HL RRA JR NC, M3 ADD HL, DE M3: ADD HL, HL RRA JR NC, M4 ADD HL, DE M4: ADD HL, HL RRA JR NC, M5 ADD HL, DE M5: ADD HL, HL RRA JR NC, M6 ADD HL, DE M6: ADD HL, HL RRA JR NC, M7 ADD HL, DE M7: ADD HL, HL RRA RET NC ADD HL, DE RET Saving memory can collect repetitive action in a cycle, but is strongly lose in performance. On this for now. Happy optimizing!
Other articles:
Similar articles:
В этот день... 21 November