Bugs #01
30 сентября 1999

Small Coding - Procedures for squaring and square root.

<b>Small Coding</b> - Procedures for squaring and square root.
(C) The Devils Corporation 1999.



                     Squaring.


    As you have guessed from the title of the article, here it 
goes as just about squared. This feature is one of the most 
common in the implementation of the computer. I offer two 
options for obtaining square numbers.


1. 8-BIT VERSION.

HL = A ^ 2


        LD H, 128; 7: 2

        LD L, A; 4: 1

        LD A, (HL); 7: 1

        INC H; 4: 1

        LD H, (HL); 7: 1

        LD L, A; 4: 1

        RET; 10: 1

        -----------------------
                         8 bytes.

Time = 43 clock cycles.


    It is seen that uses a table of 512 bytes, which is 
pre-placed squares of all the 256 numbers. So to speak - no 
brainer I did not invent and do not intend to. Again, recall 
that the subroutine can be-optimized if: 

a) Receiving no response to HL.
b) A number that is squared - send in L.
c) Get rid of the RET, introducing a subroutine in the main 
body. d) Keep the number 128 somewhere else.


Example for BC = L ^ 2:


        LD H, D; 4: 1 D = 128

        LD C, (HL); 7: 1

        INC H; 4: 1

        LD B, (HL); 7: 1

        ----------------------
                       4 bytes.
Time = 22 clock cycles.

Yes, I almost forgot to complete the picture lead the process 
of creating a table of squares of 8-bit numbers. 


       LD HL, 32768; start address table

       LD D, L

       LD E, D

       LD B, D

       LD C, 1

   M1: LD (HL), E

       INC H

       LD (HL), D

       DEC H

       EX DE, HL

       ADD HL, BC

       EX DE, HL

       INC BC

       INC BC

       INC L

       JR NZ, M1

       RET
-----------------
       21 bytes.

Well, that's cool? How not??

Okay, well:


       LD HL, 32768; start address table

       LD D, L

       LD E, D

       LD B, D

       LD C, D

   M1: LD (HL), E

       INC H

       LD (HL), D

       DEC H

       INC BC

       EX DE, HL

       ADD HL, BC

       EX DE, HL

       INC BC

       INC L

       JR NZ, M1

       RET
-----------------
       20 bytes.

What - again crap?
Granted, you can short - my limit was 18 bytes.


2. 16-BIT VERSION.


    Must be squared BC. From school for all of us (and may be
not all) is well known that:


         (A + B) ^ 2 = A ^ 2 + 2 * A * B + B ^ 2

So, if you take the register-pair BC and expanded it to 'A'
and 'B', then we obtain the following: (B * 256 + L) ^ 2.
In this and squaring based on my version:

A) First, create a table of squares of 8-bit numbers. (Above)
B) And now, the procedure itself:


   The principle is: DE: HL = BC ^ 2


        LD H, 128

        LD L, C

        LD E, L

        LD A, (HL)

        LD (M1 +1), A

        INC H

        LD C, (HL)

        LD L, B

        LD A, (HL)

        DEC H

        LD B, (HL)

        LD HX, A

        LD H, L

        LD D, 0

        LD L, D
And now 8 times the following three lines - this multiplication!

        ADD HL, HL

        JR NC, $ +3

        ADD HL, DE
; ---------------------------------
        LD D, HX

        ADD HL, HL

        JR NC, $ +3

        INC D

        ADD HL, BC

        LD E, H

        LD H, L
M1: LD L, 0

        RET NC

        INC D

        RET
------------------
        66 bytes.

Time = 400 cycles.

Written outright and not optimized. So go for it!
To begin, try to understand - what actually happened?



                                    Here, here purely, purely 
here 

                                    You understand that you are 
not cool! 


                         Square root.


   In this article I'll show you how literate guys calculate 
the square root. Just want to clarify that I calculate the root 
with no fractional part and the answer is rounded to the 
nearest integer. Well, so happened with me, ie:  SQR (16384) = 
128, not 127 or 129.  SQR (17000) = 130, not 129, and certainly 
not 130.3840484619141. It should be noted that I was satisfied 
with this round up to a certain extent only because of the 
speed of calculation. 

All the procedures operate on the principle A = SQR (HL)


1. Brake which no one saw!


      LD DE, 1, 10: 3

      XOR A; 4: 1

  M1: SBC HL, DE; 15: 2

      RET C; 11 / 5: 1

      INC DE; 6: 1

      INC DE; 6: 1

      INC A; 4: 1

      JR M1; 12: 2

      --------------------
                     12 bytes.

Min. time = 14 + 26 = 40 cycles
Maks.vremya = 14 + 255 * 48 = 12254 cycles.

Well, let's say - it works based on the fact that the number of 
responses with each successive integer response increases by 2. 
Written by me for the first time in 1994. 


2. Quickening brake ...


      LD DE, 65535; 10: 3

      XOR A; 4: 1

  M1: ADD HL, DE; 11: 1

      RET NC; 11 / 5: 1

      DEC DE; 6: 1

      DEC DE; 6: 1

      INC A; 4: 1

      JP M1; 10: 3 Place JR - will be shorter!

    --------------------
                     12 bytes.

Min. time = 14 + 22 = 36 cycles
Maks.vremya = 14 + 255 * 42 = 10724 cycles.

From quickening becomes evident only one brake - he is still 
quickening brake. The only advantage of these procedures - they 
are very short. 

3. The most appropriate size and speed.


      LD DE, 64, 10: 3

      LD A, L; 4: 1

      LD L, H; 4: 1

      LD H, D; 4: 1

      LD B, 8, 7: 2

  M1: SBC HL, DE; 15: 2

      JP NC, M2; 10: 3 (or JR NC, M2)

      ADD HL, DE; 11: 1

  M2: CCF; 4: 1

      RL D; 8: 2

      RLCA; 4: 1

      ADC HL, HL; 15: 2

      RLCA; 4: 1

      ADC HL, HL; 15: 2

      DJNZ M1; 13 / 8: 2

      LD A, D; 4: 1

      RET; 10: 1

    ----------------------
                  27 bytes.

Execution Time = 838 cycles.

Go tell - JR and JP does not matter.
The very method of calculation was taken, as I recall, from the 
ROM 48. 

4. Implicit spreadsheet. (Beware - an explosion of brains!)


      LD A, H; 4: 1

      CP 16, 7: 2

      JR NC, M1; 12 / 7: 2

      OR 128, 7: 2

      LD H, A; 4: 1

      LD A, (HL); 7: 1

      RET; 10: 1

  M1: LD L, H; 4: 1

      LD H, 127; 7: 2

      LD H, (HL); 7: 1

      LD L, A; 4: 1

      ADD A, (HL); 7: 1

      RET; 10: 1

    ---------------------
                   17 bytes.

When HL = 0 to 4095, time = 46 cycles.
When HL = from 4096 to 65535, Time = 62 clock cycles.

Table size 5120. (Memory address 32768)
A table is:
0000-4095 - responses to HL from 0-4095
4096-4351 - interpolation range 4096-19199
4352-4607 - interpolation range 19200-33791
4608-4863 - interpolation range 33792-34815
4864-5119 - interpolation range 34816-65535

Why so - you can not ask.
Additional 256-byte table with the address 32512.
Ingredients:

        DS 75, 144

        DS 57, 145

        DS 4, 146

        DS 120,147

 ===== Well, that - TOWER tear?? =====

Now let go, when to tell about the accuracy of the calculations:

Absolutely = 50046 numbers. 77%
Error on -1 = 10321 number. 15%
Error on +1 = 5199 numbers. 8%

Well, what to say in his defense - should be better to 
interpolate! (But vpadlu ... yes and no time at all ...)


Further, in principle, can not read anything until paragraph 5. 
!!!!!!! -------------------------------------------------- 
------------ You can get rid of 4096-byte table if you modify 
the program. 


       LD A, H; 4: 1

       CP 16, 7: 2

       JR NC, M1; 12 / 7: 2


       LD DE, 65535; 10: 3

       XOR A; 4: 1

   M2: ADD HL, DE; 11: 1

       RET NC; 11 / 5: 1

       DEC DE; 6: 1

       DEC DE; 6: 1

       INC A; 4: 1

       JP M2; 10: 3


   M1: LD L, H; 4: 1

       LD H, 127; 7: 2

       LD H, (HL); 7: 1

       LD L, A; 4: 1

       ADD A, (HL); 7: 1

       RET; 10: 1

       ---------------------
                    24 bytes.

Of course, you guessed it, I put way 2 inside.
The table has a total of 1024 bytes.
All HL, which are greater than 4096 on the table, but
the other for:
Min.vremya = 32 + 22 = 54 clock cycles.
Maks.vremya = 32 + 42 * 63 = 2678 cycles.

Well, that's probably very smart to add - consider themselves
steep optimizers. Only a fool does not see that here you can
unroll the loop, get rid of two DEC DE and a JP and get 
something like this: 

; First three lines of procedure, see above


       LD DE, 65535; 10: 1

       XOR A; 4: 1

       ADD HL, DE; 11: 1

       RET NC; 11 / 5: 1


       DUP 1963

       LD DE, 65533, short, DE = previous DE - 2

       INC A

       ADD HL, DE

       RET NC

       EDUP

; Last 6 lines see above

DUP 63 - Means to repeat 63 times the block of code.
Now the time:
Min. time = 18 (initial CP-scale) + 36 = 54 clock cycles.
Maks.vremya = 18 (------ / / ------) + 30 * 64 = 1938 cycles.

Something like 740 cycles snatched a very large response.
Blog cycle is: 6 * 64 = 384 bytes.

In such cases, I would like to recall the procedure 3, which is 
any root of 838 cycles. Thus, if use it to find the roots of 
these accidents 4096 You can win in time as follows:

1. (838-18) / 30 = 27. ...
2. 28 ^ 2 = 784

That is, the interval from 0 to 784 procedure 2 "makes" the 
procedure 3, but further ... bummer.

-------------------------------------------------- ------------
5. Explicit table calculation.

In advance, a table of answers to SQR (HL), and then you can
use something like this.
Example for the roots of 0 to 16383 (Table 1 segment).


        LD A, H; 4: 1

        OR 192; 7: 2

        LD H, A; 4: 1

        LD A, (HL); 7: 1

        RET; 10: 1

        ------------------
                    6 bytes.
Time = 32 clocks.

or

        ADD HL, BC; BC = 49152

        LD A, (HL)

        RET

        ----------------
        3 bytes, 28 cycles.


    In principle, this method does not work that way.
Number 192 should be stored in any case (for example in C)
but about the RET should be forgotten altogether, as do CALL 6 
bytes is complete insanity. For those who do not understand - a 
subroutine should be built into the main body of the program! 
Here then we obtain an almost full speed - 19 strokes.


    And this is not the limit. Because if the table a little 
lift to 16384 bytes so, the routine can be as follows:



        SET 7, H; 8: 2

        LD A, (HL); 7: 1

        RET; 10: 1

        -----------------
                  4 bytes.

Time = 25 cycles. (No RET - 15 cycles).


    But this is not the limit. Since many computers are
the ability to connect instead of ROM in your page, if you can 
pay under the table, this piece of memory the calculation take 
the square root of 7 cycles (LD A, (HL)). From a mathematical 
point of view is a complete mess. But it is too fast.


Okay - so far enough to engage in this nonsense. In fact, mode 
3, in principle, can make a malicious nelyubiteley tables (I 
have something did not understand - surely there are still such 
Downey and Thugs). Or that - a table of 512 bytes or 16,384 - 
it - all - Khan? Guys, I understand that all of you, as if to 
programmers, such as - hackers, and all that, but sank to the 
ground, because the rate of ZX SPECTRUMa not great, and spend 
time on the conversion of some formula - it's a luxury.


   I do not remember where, was given a procedure of extracting 
the square root of the HL. There used multiplication. I do not 
understand - that there needed to multiply, and why? For all 
that the procedures of multiplication was not a fountain. What 
kind of nonsense? You that, sorry publish the normal (I'm not 
saying sverhklassnye) algorithms? Or mind only enough to write 
a braking procedure? Guys, it's time to stop using the old 
formula - Pythagoras Heron, the Newton-Leibniz (square root) 
and other elders. It is now almost 2000. Think of something my!


PS: I hope you people are smart and do not take offense to 
criticism, and 

    Notes. Did not want to offend anyone and did not try.

    You are right only in that no matter how bad your algorithm
    rhythms - they still work in real programs - and this is

    perhaps their only plus.


    If someone wants to express its opinion on the matter and 
without ... 

    Accept the challenge from anyone in terms of writing itself 
short     What the procedure or program of your choice (broken 
off almost every     dogo! )



    My E-Mail: devils@ellink.ru


PPS: I propose to write a short procedure for calculating

      the cube root of HL in integers.

      A = HL ^ (1 / 3)

Here's my version:


        LD DE, 65535

        LD BC, 0
M1: ADD HL, DE

        RET NC

        INC A

        EX DE, HL

        DEC BC

        DEC BC

        DEC BC

        DEC BC

        DEC BC

        DEC BC

        ADD HL, BC

        EX DE, HL

        JR M1

        ---------
        20 bytes.

________________________________( C) The Devils Corporation 
1999. 






Other articles:

Bugs - On the newspaper.

Protect - Protect and antizaschita programs.

Small Coding - Procedures for squaring and square root.

Jokes - Anecdotes about drug addicts and drunks.

Business on Speccy - Business on the Spectrum.

little about my - Telephone Directory: The Great Luke.

News - News from the Great Bow. Reinkarnitsiya Z-80.

about a rock star - A brief chronology of the life and work of Viktor Tsoi.

Advertising - Advertising and announcements.

Poems - Three Monkeys (novel in verse).

Test - A is not a maniac you know? (For fans of the soldering iron). Rainbow name.


Темы: Игры, Программное обеспечение, Пресса, Аппаратное обеспечение, Сеть, Демосцена, Люди, Программирование

Similar articles:
From the Author - He received a diploma ...
Articles - Chipmeyking as art minimum or "My opinion about the AY-music. "
Programming - Z80 assembler on Russos.

В этот день...   21 November