kodit Who's there? - Guru meditates: optimization programs for execution time and size.

<b>kodit Who's there?</b> - Guru meditates: optimization programs for execution time and size.

    (C) PSV / Dream Team

    ------------------

    OPTIMIZATION PROGRAM

    ON-TIME PERFORMANCE



            PROLOGUE


    At the current stage of development of science and
Technology can not imagine any
one area of their application, which gradually penetrated to 
a microprocessor technique. And gone are the days when

the result of simple calculations, we had to wait a long 
tiring. Modern problems require solutions not just fast, and 
issued in "real time", ie the rate-making should be higher than 
the rate of production problems in mode, continuous time 
processes. Our door is now knocking on the device,

recognizing and generating human
voice and moving images of real
objects.


    ... Unfortunately the giant speed
processing of information available yet
only super-computers and do not show
today and tomorrow. Until
available to most microprocessor
equipment operates at speeds of about
several millions of elementary operations
per second (for clarity, I will say that
acceptable to the eye parameter screen
this computer is not up to speed
computing: he was just keeping up with
allotted to one frame time, just
to screen a predefined
picture). It is not surprising
desire to improve the capacity of programs
due to their literate writing. And
sometimes obtained remarkable results: the speed of the program 
increases are sometimes a few times just for through some minor 
changes, and only 5 - 10% can expedite any "raw"

program (of course we are talking about programs in assembler, 
as a first step acceleration of the program - write it on

low-level language).


    Fundamentally differ in the two approaches to the 
optimization process: the heuristic method is based on logical 
thinking and requires no special knowledge of any language

no CPU or hardware environment,
but it requires a deep knowledge of algorithms, the method of 
hardware fitting more accurate (it even possible to implement

lay on your computer) and is based on
actually a very deep knowledge of all
properties of those devices with which
the program works.


    Program optimization is usually
in two stages: first - the original
drawing program with some
too slow, the second - modification-established programs to 
improve parameters.



    Here are some that I know of optimization techniques in 
software According to this classification. Immediately it 
should be noted that due to scant coverage of this issue in the 
literature, almost all have been developed by a small circle of 
elite programmers (Hacker), and me and my friends. As a result, 
the material has chaotic character.




         CHAPTER ONE

   (On how to write fast programs

      without knowing the specific equipment)



    First of all, if your program
running fast enough to
think about all the other ways of achieving
goal and, perhaps, a priori, there are faster algorithms.
For example, conversion of coordinates of the object in
three-dimensional space does not necessarily
done by Euler equations, the same
results can be achieved in several
different methods and choose from them
optimal. Another example - Automatic strip conductors in the 
design of the PCB, which can make a few notable ways. All 
methods need to explore and choose again - still the fastest.



    If the algorithm has already been selected and implemented, 
then the easiest way to do program faster - to remove from it 
cycles by repeating the program of the loop body. When the cycle

holds less than 100 times a loop body
simply insert the number of times
time should run a cycle. If the cycle
longer than 100 times, then repeat it as many times as 
possible, and the outer loop leave, but running less

number of times. By this method the gain in
speed is obtained by cutting away from the
software processes such as change
loop variable, its monitoring and the operators go to the 
beginning of the cycle. Also exempt register occupied 
previously by loop variable.



    In any program that goes beyond the transfer of data, there 
are algebraic operations. Accumulated experience allows 
programmers to derive some rules, guided by which these

operations can be written so that they
run faster. For example, A = BxB will
always work faster than the A = B ^ 2 (if
Of course, these operations are written as
universal procedures, but not specifically
optimized, as will be discussed
below). Also faster to run
A = Bx (C + D) compared with A = BxC + BxD, so
as even if the processor is provided
multiply there will always be
slower than addition. We generalize this subject: here in what 
order should speed perform algebraic operations (from

fastest to slowest): addition, subtraction, multiplication, 
division, raising to degree, taking root. And if you torment 
the question "which is faster: two multiplications Addition or 
four? ", then you have engage in a more detailed study of the 
equipment (see potaktovoe execution teams). Next, we consider 
more complex features such as: COS, SIN, TG, LOG,

EXP. On this occasion it should be noted
that in most cases the computation
these functions, you can write integer, if their scaled, which 
will enormous gain in speed. Example: suppose that we need to 
calculate the expression A = BxSIN (C). In the calculation of 
sine, we calculate it is not in fractional numbers, but in the 
whole (due to the fact that the result will not change from 0 
to 1, and from 0 to assume 256), multiplying everything on

256. Next, multiplying the resulting
result should be divided by 256 (the
is simply discarded low byte).


    Another way to avoid the application of
complicated computation (and hence
slow) functions - to calculate them in advance and set in an 
array. This path finds in recent years more and more widely 
used as memory expansion is much easier and cheaper to increase 
performance. For example, instead of calculating the sine 
series expansion in Most programs can be universal

pre-calculate its value to say
increments of one degree and within 4.5 decimal places that 
will take some 720 bytes of memory, but will speed up the 
program several times. 


    Another area of application of the accelerating
efforts - programs that most
Part-time work does not require
performance, but life spoil rare
moments when you need something quickly
do (for this type are almost
all interactive programs - first
they do not give and are waiting for commands from
slow compared with human
and then, when the team filed, they
forced to wait for a man, until
execute some complex action). In
then helps the remedy -
while the program is free bars
(Assuming the keyboard is not possible to interview
more than 100 times per second, which is not
take even one per cent speed
processor), it tries to predict
What follow-up and prepare everything
to the instantaneous processing: to find a free
place on a medium to calculate the required
constant check errors and to
things like that. Even if cooking
will be in vain harm from them is not
be. But if you come in handy -
required work is accomplished much
bystree.


    The above mentioned techniques make it difficult
and the program is extended, but there is one
trick to reduce not only
Execution time of procedure, but also reduce the required 
memory of her. We are talking about self-modifying programs. 
Consider the following example: Suppose a program needs to move 
some information from one medium to another,

adding at the same time to each byte
constant whose value every time
when you start is defined by the user.
Naturally in this case to allocate to
constant of any memory cell
anywhere in unoccupied areas. Let her
address will NNNN. It should be
action will be written like this:


    to take the next byte

    MOVE register, (NNNN)

    ADD another byte register

    put the next byte


    In this case, firstly in the team MOVE
featured some of no importance
to the algorithm address, but
the program in memory is necessary
take place under the constant. Using
same self-modification techniques, we can write
as follows:


    to take the next byte

    MOVE register, constant

    ADD another byte register

    put the next byte


    In this case, one of the teams primary
cycle, running many thousands of times
(That is, the speed of which must be
fight), we get almost 2-fold
gain in execution time due to
removal of indirection to a constant. But how? Because of the 
constant condition is not fixed but set by the user, and our 
program will have to enter one specific number! A case in that 
instead of a specific number of would always be exactly the 
constant which is entered by the user, because, as

known, in memory of the program is
as a set of numbers (bytes), and our certain number will also 
take some - that The number of bytes in memory is quite

particular address, and what this address is worse than that 
which we wish to allocate to keeping a constant? So it turns 
out that and the program quickly, and the head does not hurt

when searching for free memory for the constant, and the 
program is shorter, and memory for data must be less. In 
addition to the constants in program can also change the code

operations, which is much harder to write and requires a 
thorough knowledge of equipment. 



         CHAPTER TWO
(On how to write for a specific hardware)



    To apply the methods of the type described below is 
necessary to fully study the specifics of all the hardware

which will operate the program, and
First and foremost, a processor and storage media. Moreover, 
the study does not should restrict the logic of

devices, and should include the study of temporal 
characteristics. 


    Huge role in the performance of the program is the exchange 
rate with carriers information. In a large number of cases

program is running slow not because of
processor, but because of the slow exchange
data from the media. As a rule, even in
electronic media (memory ICs)
speed is much lower than the processor, and is easy to see the 
lion proportion of teams just up and running with the memory

(Remember that even the team retrieved the processor from 
memory). In such media as magnetic and optical discs

speed and did a funny. The moral:
do not be lazy, and thoroughly examine the
characteristics of the carrier of your information and you will 
not regret it. Usually immediately find ways to increase the 
speed exchange: it turns out that some part of

memory faster than the other, it turns out
that the media can be cached (to provide
fast buffer memory), then suddenly
found that with even addresses
work faster than with odd and so
on. If there are differences in the characteristics necessary 
to have the most used data in faster sites. For example, many 
computers, the memory used as a video memory used to store non 
graphic information that provides more memory for the data. But 
having studied the rate of memory operation is usually found to

that this memory is much slower and
There is desirable to store only rarely
used data.


    The next step - studying potaktovogo
instruction execution by the processor. Immediately
there is a numerical criterion of performance: you write well - 
1200 cycles, and think and write simpler - 750 cycles. Also 
there is a nice opportunity to evaluate the performance of the 
program even introducing it into the car, and considering

required to complete the number of
cycles. For example, making the program that processes signals 
with frequencies up to 100 hertz income pointless to enter into 
computer program for signal processing, if one cycle it will 
take for the calculation more than 10000 cycles (when clocked at

1 MHz). At this stage, be useful to have a
for a black list and bring back
team running for an excessively large number of cycles to always
avoid their use.


    If your program is working with
any other intellectual
partners (drives, graphics /
arithmetic / music and other
coprocessor or another computer on the
network), then work with them can accelerate the
also study in detail their logical and
time parameters. For example do not need
write a program drawing vectors
given coordinates, if it is able to
quickly making himself the video processor (a "smart" video 
processors are able to draw not only vector but also the 
three-dimensional polygons on a given picture cards, lighting 
and so on). Another example: why program to print text on the 
paper to align it along the edges where it might do the printer 
itself without spending while neither the time nor the memory. 
Or anything to control the checksum of the array when 
transmitting data, summing up a vast array of numbers, if it's 
hardware level is controlled by I / O port and only one bit of 
control in its flag state. These and similar errors

lead to an annoying pauses where they
can be eliminated with proper approach to
speed of the program during its
writing.


    In addition to "Iron" nice to know and
features of human perception, if
Your program provides its
participation. For example, in recent programs
with the creation of vysokorealnyh computed
images using a strong simplification of computing the scene 
with its rapid motion, since the human eye all

will not have time to consider the details.
Also on these properties are calculated
algorithms for a strong package of images and
sound due to the disappearance of small and
imperceptible parts.


         CHAPTER THREE
 (Techniques for writing fast programs

  KR580 and not far from him gone Z80)



    Despite the fact that an 8-bit
processors are obsolete (by the way
outdated and a 16-bit), they
remains one of the most common in our processor, which makes
Squeeze of them did not even see their developers speed. The 
most common considerations when writing programs: how

the largest possible number of variables to keep in registers 
rather than memory; small routines do not execute as 
subroutines, and paste into the required place; try whenever 
possible to use the increment / decrement, and shift left / 
right instead of addition / subtraction and multiplication / 
division, respectively (DCR B: DCR B instead of MVI A, B: SUB 
2: MVI B, A); apply when dealing with DBCS

variable processing only junior
bytes if it is known that for double-byte
handling the high byte is still not
change (eg if necessary
handle 5000 bytes blocks of 4 is
within one block from the counter will
change only the low byte). Next
Several techniques for specific common cases:


    1. Operators of the branch (only for
Z80): JP at 2 cycles faster than JR (albeit to a byte in short: 
someone that). 


    2. Transferring addresses from one register pair to another 
without thinking write this: 


         Z-80 KR580


        PUSH HL PUSH H

        POP BC POP B


    And almost three times faster to write
as follows:


        LD B, H MVI B, H

        LD C, L MVI C, L


    3. Fast cycle for double-byte variable. Usually written as:


       LD BC, chisloLXI B, the number of

    M: Cycle M: cycle

       DEC BC DCX B

       LD A, B MOV A, B

       OR C ORA C

       JP NZ, M JNZ M


    In the cycle but useful body spent a further 24 cycles, so 
we write this: 


       LD BC, the number of LXI B, the number of

    M: Cycle M: cycle

       DEC C DCR C

       JP NZ, M JNZ M

       DEC B DCR B

       JP NZ, M JNZ M


    and spends only (14 +14 / 256) = 14.05
clocks per cycle.


    4. Bit processing memory. Usually write like this (in a 
pair of HL - address in register C - mask): 


        bit processing

        RRC C

        JR NC, M

        INC HL

    M: cycling


    It "eats" (20x7 +21) / 8 = 20.125
cycles for each transition to the next
bat, and I propose to write like this:


        processing bits for KR580

        RRC C this method

        JR C, M2 is not very

    M1 loop effective

        ...

    M2 INC HL

        JP M1


    that "eats" (15x7 +36) / 8 = 17.625
cycles, although at first glance seems
worse.


    5. The fastest memory fill
(Often used as a cleaning
the screen buffer or the screen itself). As
usually write will not even resemble -
everyone writes their own way. But to squeeze all
what can this processor can be as follows:


        LD SP, konets_ochischaemoy memory

        LD HL, kod_zapolneniya

        LD B, length / (number of PUSH in the cycle) / 2

    M: PUSH HL

        PUSH HL

        ...

        DJNZ M


    The more write PUSH'ey - the
faster to fill. And do not forget
restore the stack pointer at the end
cycle! The average is 1.5 times faster
than conventional filling. By the same
the principle of writing and transfer program
memory (again, yet their performance
obtain the maximum attainable for
processor), but they are much more
cumbersome.


    6. Rapid procedure for multiplying
(Parameters: Register E - the first factor, the register A - 
the second factor, HL - The result of the operation). A big plus

program - selects the time is almost
does not depend on the factors.


        LD HL, # 0000 to be KR580

        LD D, L differ only

        RRA mnemonics

        JR NC, M1

        ADD HL, DE

    M1: ADD HL, HL

        RRA

        JR NC, M2

        ADD HL, DE

    M2: ADD HL, HL

        RRA

        JR NC, M3

        ADD HL, DE

    M3: ADD HL, HL

        RRA

        JR NC, M4

        ADD HL, DE

    M4: ADD HL, HL

        RRA

        JR NC, M5

        ADD HL, DE

    M5: ADD HL, HL

        RRA

        JR NC, M6

        ADD HL, DE

    M6: ADD HL, HL

        RRA

        JR NC, M7

        ADD HL, DE

    M7: ADD HL, HL

        RRA

        RET NC

        ADD HL, DE

        RET


    Saving memory can collect
repetitive action in a cycle, but is strongly
lose in performance.

On this for now.
Happy optimizing!