October 19, 2020, 08:49:03 pm


Alternatively, join us on Discord.
You can help CodeWalrus stay online by donating here.

Show posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.

Topics - Zeda

1
Hey all, I implemented the heapsort algorithm with some features inspired by Sean McLaughlin's implementation. Heapsort operates in O(n*log(n)) time, so it is fast even as array sizes increase. The cool perk to Sean's implementation that I wanted to include in mine is the ability to have a callback function that performs comparisons. This is fantastic if your array contains pointers to, say, an array of strings. The callback function can take the inputs (in my case, HL and DE), perform a string comparison, and return the result to the heapsort routine. Heapsort will then know whether or not to swap the elements.

As an example, I created a VAT sorting routine. It creates an array of pointers to the VAT entries of named variables (like Programs, Appvars, etc.), and then uses a fancy callback function that compares the names of the VAT entries. Returned is an alphabetically sorted array of pointers.

As well, I made a very fast routine to sort TI lists. The callback routine for this takes indices, constructs pointers to the actual list data, compares them, then returns the result. Since the OS routine uses an O(n2) algorithm, my program can perform way faster on larger lists. I ran a few tests on random lists and came up with:

* at 100 elements, the OS routine and my routine were pretty close in speed.
* at 400 elements, my routine sorted it in ~1.3 seconds, versus ~9.5. for the OS routine
* at 999 elements, my routine sorted it in ~3.0 seconds, versus ~55.5. for the OS.

Code (Heapsort) Select

#ifndef scrap
#define scrap 8000h
.echo "Warning, 'scrap' not defined, defining scrap=8000h"
#endif
#ifndef SMC
arraybase= scrap
arraylen = scrap+2
#endif
heapflags= 33
curchild = 0
heapsort:
;   HL points to the array data
;   BC is the size of the array. NOT GREATER THAN 32767
;   IX points to the routine that compares the values
#ifdef fast
    call heapify
    ld hl,(arraylen)
#else
    push bc
    call heapify
    pop hl
#endif
_:
    dec hl
    ld (arraylen),hl
#ifndef fast
    push hl
#endif
    ld de,(arraybase)
    add hl,hl
    inc de
    add hl,de
#ifdef fast
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
#else
    call swap
#endif
    ld bc,1
    call propogate
#ifdef fast
    ld hl,(arraylen)
#else
    pop hl
#endif
    ld a,h
    or l
    jr nz,-_   
    ret
heapify:
;Inputs:
;   HL points to the array data
;   BC is the size of the array. NOT GREATER THAN 32767
;   IX points to the routine that compares the values
    ld (arraybase),hl
    ld (arraylen),bc
    srl b
    rr c
_:
    push bc
    call propogate
    pop bc
    dec bc
    ld a,b
    or c
    jr nz,-_
    ret
propogate:
;BC-1 is the parent index
;2BC is the child1 index
    res curchild,(iy+heapflags)
proppost:
    sla c
    rl b
    ld d,b
    ld e,c
#ifdef SMC
arraylen=$+1
    ld hl,0
#else
    ld hl,(arraylen)
#endif
    sbc hl,de
    add hl,de
    ret c  ;no children
;compare the two children
#ifdef SMC
arraybase=$+1
    ld hl,0
#else
    ld hl,(arraybase)
#endif
    add hl,de
    add hl,de
    inc hl
    ld d,(hl)
    dec hl
    ld e,(hl)
    dec hl
    push hl
    ld a,(hl)
    dec hl
    ld l,(hl)
    ld h,a
    ;HL holds the value of child0
    ;DE holds the value of child1
    jr z,+_
    call callix
    jr nc,+_
    ex de,hl
    pop de
    inc de
    inc de
    set curchild,(iy+heapflags)
    push de
_:
;{stack} points to the child
;HL is the value of the child
;BC points to the parent
;now compare the child and parent
    ex de,hl
    ld hl,(arraybase)
    add hl,bc
    push hl
    dec hl
    ld a,(hl)
    dec hl
    ld l,(hl)
    ld h,a
    call callix
    pop hl
    pop de
    ret nc
    dec hl
    call swap
;values swapped, now set parent=child
;BC is the index of child1
    bit curchild,(iy+heapflags)
    jp z,proppost
    inc bc
    jp propogate
swap:
;HL points to the top of one word
;DE points to the top of another
;Must preserve BC
#ifdef fast
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    inc c
    inc bc
#else
    call +_
_:
    ld a,(de)
    ldd
    inc hl
    ld (hl),a
    dec hl
    inc bc
#endif
    ret
callix:
    jp (ix)

Even when optimizing for speed over size, my code is smaller, but I'm fairly sure there are more optimizations to be found!
2
Hey all, I have been playing with compression. I made a Python program that compresses data, such as a TI Image, as well as a calculator program that can decompress such data to the graph screen.

While the Python program is general-purpose, the Z80 program assumes that the input is a compressed image, so you can mess things up! The neat thing is that in my example, the calc program (LZD) and the compressed data, along with their file headers, all came out to less than the size of the original image!
3
Other / Topic: Fast 3-digit Squaring, Toom-Cook
June 26, 2018, 06:36:55 pm
Hi all, I realized that I have a lot of unshared results that I've worked out over the years! (Then again, I have a lot of shared results :P )
Today's is dealing with squaring 3-digit numbers.
If you aren't familiar with the Toom-Cook algorithm, read the spoiler! It's cool stuff. Otherwise, continue down.
Prerequisite: ShowHide

The Toom-Cook algorithm basically generalizes multiplying integers as multiplying polynomials.For example, if we want to multiply 47*81, we can think of it as multiplying (7+4x)*(1+8x) and evaluating at x=10 (since our numbers were in base 10). This seems like a lot of unnecessary complexity, and it is for such small numbers, but let's look at a few key ideas.

If you are given 3 points on a plane, you can construct a unique quadratic function that passes through each point. The key here is that it is unique meaning there are no other degree-2 polynomials that pass through each point-- there is only one and that one must exists. On the other hand given 4 points, there might be a parabola that passes through each point, but it is not a guarantee and is in fact unlikely. As well, given three points, there exist infinitely many cubic polynomials (or quartic or quintic, etc.) that pass through each point. In general, if you have n+1 points, you are guaranteed to be able to find a unique polynomial of degree n that is the only such polynomial that passes through each point.

Guarantees and uniqueness are cool and all, but actually constructing the polynomials from those points can be tricky (but still "quickly" doable in a computational sense)! Luckily for our case, the points won't be totally arbitrary. Let's look back at our Toom-Cook example of multiplying 47*81. The polynomial product is (7+4x)*(1+8x), which we know should be a quadratic equation. Then if we evaluate this product at three easy points, we should be able to reconstruct the result. Specifically, let's evaluate at x=-1, x=0, and x=1:
(7-4)*(1-8)=-21
7*1=7
(7+4)*(1+8)=99

The easiest way to reconstruct the polynomial by hand, in my opinion, is to use matrices to represent a system of equations. In this case, with ai being the points we evaluated at -1,0, and 1:

[[  1   (-1)1    (-1)2   a0]
[  1   01       02      a1]
[  1   11       12      a2]]


   [[  1  -1   1   a0]
=   [  1   0   0   a1]
    [  1   1   1   a2]]

And now we reduce this to reduced row-eschelon form.

   [[  0   1   0   (a2-a0)/2]
=>  [  1   0   0   a1]
    [  1   0   1   (a2+a0)/2]]


   [[  0   1   0   (a2-a0)/2]
=>  [  1   0   0   a1]
    [  0   0   1   (a2+a0)/2-a1]]


Now that we know the resulting coefficients, let's construct our polynomial:
a1+(a2-a0)x/2+((a2+a0)/2-a1)x2
=7+60x+32x2
Let's evaluate this at x=10 and we have 3200+600+7 = 3807 which is indeed 47*81.

The nice thing about this is, since we've already solved the system of equations, we can reuse this and we don't need to compute it every time. We just compute the product at -1, 0, and , then plug it into our formula et voila. In fact, let's make a BASIC program that multiplies two ten-digit numbers and returns a 20-digit result. Input is passed through two 2-element lists, L1 and L2, and use those elements as 5-digit 'digits' to compute the product.

{1,-1
sum(L1Ans)sum(L2Ans->A
L1(1)L2(1->B
sum(L1)sum(L2->C
.5(C-A
{B,Ans,Ans+A-B->L1
0->C
L1e-5->L1    ;engineering E
For(K,1,3
C+L1(k->L1(K
e-5int(Ans->C
End
C->L1(4
e5fPart(L1->L1

The output is a 20-digit integer comprised of four 5-digit numbers. The first element is the lowest 5 digits, the fourth element is the upper 5 digits.

To square a 3-digit number, we'll result in a degree-4 polynomial. Consequently, we'll need to evaluate at 5 points. The points I chose were {0,1,i,-1,-i} where 'i' is the imaginary unit. It may seem like evaluating at a complex point would be, well, complex, but we'll see. First, let's write our polynomial as a0+a1x+a2x2.
Then:
b0=f(0)=a02
b1=f(1)=(a0+a1+a2)2
b2=f(i)=(a0+a1i-a2)2
b3=f(-1)=(a0-a1+a2)2
b4=f(-i)=(a0-ia1-a2)2
Squaring complex numbers is easy though-- (a+bi)2= a2-b2+2abi = (a-b)(a+b)+2abi
So simplifying:
b2=(a0-a2+a1)(a0-a2-a1)+2i(a0-a2)a1
b4=(a0-a2+a1)(a0-a2-a1)-2i(a0-a2)a1

Noticing symmetry, let's define instead:
c3=(a0-a2+a1)(a0-a2-a1)
c4=2(a0-a2)a1

Then:
b2=c3+ic4
b4=c3-ic4


    1       0       0       0       0   b0
    1       1       1       1       1   b1
    1       i      -1      -i       1   c3+ic4
    1      -1       1      -1       1   b3
    1      -i      -1       i       1   c3-ic4


    1       0       0       0       0   b0
    1       0       1       0       1   (b1+b3)/2 = c1
=>  0       1       0       1       0   (b1-b3)/2 = c2
    1       0      -1       0       1   c3
    0       1       0      -1       0   c4


    1       0       0       0       0   b0
    1       0       0       0       1   (c1+c3)/2 = d1
=>  0       0       1       0       0   (c1-c3)/2 = d2
    0       1       0       0       0   (c2+c4)/2 = d3
    0       0       0       1       0   (c2-c4)/2 = d4


Then our polynomial is:
b0+d3x+d2x2+d4x3+(d1-b0)x4


So for an example, let's find 348*348
Then we have:
a0=8
a1=4
a2=3
b0=64
b1=(8+4+3)2=225
b3=(8-4+3)2=49
c3=(8-3+4)(8-3-4)=9
c4=2(8-3)*4=40
c1=(225+49)/2 = 137
c2=c1-49 = 88
d1=(137+9)/2 = 73
d2=d1-9 = 64
d3=(88+40)/2 = 64
d4=d3-40 = 24

So then, our result is:
64+64x+64x2+24x3+9x4
Evaluating at x=10 eyelids yields:
64+640+6400+24000+90000 = 121104, which is indeed 384*384


Now let's generate some Python code:

def sqr3(a,b,c):
    d=a+b+c    #roughly 75% of cases exceed the original bit-width
    d*=d
    e=a-b+c    #roughly 1/6 of the time this will exceed the original bit-width. However, whenever this exceeds the bit-width, then d also exceeds it.
    e*=e
    d=(d+e)/2
    e=d-e
    f=(a+b-c)*(a-b-c)   #it is never the case that both multiplicands exceed  their original word width, but about 1/3 of the time, one of them exceeds it (~2/3 of the time, neither exceed).
    b*=(a-c)        #neither multiplicand exceeds the original bit-width (though the second operand may become negative).
    b+=b
    a*=a
    e=(e+b)/2
    b=e-b
    d=(d-f)/2
    f+=d-a
    return [a,e,d,b,f]

Note that three of the sub-multiplications are just squaring the input. Also note that all divisions are by a power of two, and all results and operations are integer operations, not complex (assuming all inputs are integers).
4
A fascination with fire is nothing new for humans, or even programmers. With that in mind, I wanted to give some insight into how to create your very own fire animations, like this:

In that program, I basically started from the top of the screen, working downward. I'd copy from the pixel below, with a 12.5% chance of the copy "burning up" and thus failing. To cause a perpetual burn, I then OR the original image back to the buffer and run the next iteration. If I didn't do this, then the image would disintegrate after a few seconds (which is cool, too).

Now of course, this is Assembly, which is fast, but it's also the Z80, so not that fast. Instead of operating on one pixel at a time, it's faster if we can operate on 8 pixels at a time (represented by 1 byte). Copying the pixels is easy-- if DE points to the current byte of pixels, and HL points to the byte of pixels below, then we can do ld a,(hl) \ ld (de),a (or ldi). The tricky part is coming up with some chance of messing with those pixels.

We want it to appear random or chaotic, so we need an RNG of some kind, in this case a Pseudo-Random Number Generator. Ideally, we also want it to be fast. If our PRNG returns numbers with the property that each bit is chaotic, but has a 50% chance of being 0 or 1, then we can apply some probability tricks. For example, if we OR three pseudo-random numbers, then the result has bits with probability 1/8 being 0, 7/8 being 1. Armed with this knowledge, if we then AND this with the pixels to copy, we have a 12.5% chance of turning any ON pixels to OFF.

This assumes the 'flames' are black. If we want white flames, AND three numbers from the PRNG and OR the result with the pixels to copy. This has a 12.5% chance of turning a pixel ON.

If we want more complicated probabilities, we can change the sequence of ANDs and ORs. For example, if we OR two PRNs and AND a third, the bit probabilities are 37.5% being 1, 62.5% being 0.
Just note that if we change the order, then the probabilities change! If we AND to PRNs and then OR a third, we have a 62.5% being 0 and 37.5% being 1.

To calculate, we start with p0=50%.
If we OR with a PRN, then pi+1=(1+pi)/2
If we AND with a PRN, then pi+1=pi/2

For example, if we AND/OR/AND/OR, we have:
p0=.5
p1=.5/2 = .25
p2=(1+.25)/2 = .625
p3=.625/2 = .3125
p4=(1+.3125)/2 = .65625



With my example, I wanted it to be easy, so I just went with ORing three pseudo-random numbers to generate a bitmask. I created a 24-bit PRNG with the necessary qualities (and as a bonus, it passed a suite of randomness tests), so by calling it once and ORing each of the bytes together, I can make a mask.

For those who might be able to follow Axe code a little better:

StorePic
While getKey(15)=0
L3->W
For(Z,L6,L6+767)
{Z} or {W}->{Z}
W+1->W
End
L6->Z+12->W
For(K,0,755)
rand or rand or rand or {W}->{Z}
W+1->W
Z+1->Z
End
DispGraph
End
5
A while ago I made Floatlib, an app that contained a bunch of Z80 single-precision floating point routines (as well as a few other helpful routines). My intent was for my own on-calc programming, so it includes in-app documentation. Anyways, I got bored and decided to make a program that draws the Mandelbrot Set as kind of an example. The program requires Floatlib to be on your calc, but the source is:

#include "header.z80"
cx   = vars
cy   = vars+4
delta= vars+8
zx   = vars+12
zy   = vars+16
zx2  = vars+20
zy2  = vars+24
x    = vars+28
y    = vars+29
bufptr=vars+30
mask = vars+32
lcd_y= vars+33
tmp  = vars+34
gbuf = 9340h
maxiter=16
start:
    in a,(2)
    add a,a
    sbc a,a
    and 3
    out (20h),a
    call init   ;initialize keyboard, LCD, variables
forx:
    ld a,64     ;set Y counter to 64
    ld (y),a
    ffmov(cy_init,cy)
    ffmov(delta_init,delta)
fory:
    in a,(1)    ;poll the keyboard for [Clear]
    and 40h
    ret z
    in a,(4)    ;poll for [ON]
    and 8
    ret z
    ffmov(cx,zx)
    ffmov(cy,zy)
    fmul(zx,zx,zx2)
    fmul(zy,zy,zy2)
    ld a,maxiter
    jp endwhile
startwhile:
    fmul(zx,zy,zy)
    ld hl,zy+3      ;multiply zy by 2 by incrementing the exponent. No worries of overflow, so we can do this cheaply
    inc (hl)
    fadd(cy,zy,zy)
    fsub(zx2,zy2,zx)
    fadd(cx,zx,zx)   
    fmul(zx,zx,zx2)
    fmul(zy,zy,zy2)
endwhile:
    dec a
    jr z,plotcalc
    fadd(zx2,zy2,tmp)
    ;fcmp(tmp,four)
    ld h,a
    ld a,(tmp+3)       ;check if tmp>=4. This happens if and only if the exponent of tmp is 82h or higher.
    cp 82h
    ld a,h
    jr c,startwhile
plotcalc:
    or a            ;plot the pixel if counter reached zero
    ld a,(mask)
    ld hl,(bufptr)
    jr nz,$+5
    or (hl)
    jr $+4
    cpl
    and (hl)
    ld (hl),a
    out (17),a
    ld de,12
    add hl,de
    ld (bufptr),hl
    fadd(cy,delta,cy)
    ld hl,y
    dec (hl)
    jp nz,fory
    fadd(cx,delta,cx)
    ld hl,(bufptr)
    ld a,(mask)
    dec h
    dec h
    dec h
    rrca
    ld (mask),a
    jr nc,+_
    inc l
    ld a,(lcd_y)
    inc a
    out (16),a
    cp 2Ch
    ld (lcd_y),a
_:
    ld (bufptr),hl
    ld hl,x
    dec (hl)
    jp nz,forx   
    ret
init:
    di
    ld a,80h        ;Program the LCD to move the index to the top
    out (16),a
    ;ld a,80h       ;load the mask for the pixel data. Commented since 'a' is already 80h
    ld (mask),a
    ld hl,gbuf      ;load the ptr to the actual pixel data
    ld (bufptr),hl
    ld a,96           ;set X counter
    ld (x),a
    ld hl,(cx_init) \ ld (cx),hl     ;load the starting value for cx
    ld hl,(cx_init+2) \ ld (cx+2),hl
    ld a,$FD        ;Program the keyboard to only poll for keys in the [Enter] through [Clear] range
    out (1),a
    ld a,20h
    ld (lcd_y),a
    out(16),a
    xor a
    ld hl,gbuf
    ld (hl),a
    ld de,gbuf+1
    ld bc,767
    ldir
    ret
cx_init:
    .db $00,$00,$80,$81 ;-2.0
;    .db $00,$00,$80,$80 ;-1.0
;    .db $51,$D6,$56,$7E ;.41960384
;    .db $00,$00,$A0,$80 ;-1.25
cy_init:
;    .db $00,$00,$80,$81 ;-2.0
    .db $00,$00,$80,$80 ;-1.0
;    .db $7C,$AA,$48,$7D ;.19596284
;    .db $00,$00,$80,$7E ;-.25
;48aa7C
delta_init:
;    .db $00,$00,$00,$7C ;.0625
    .db $00,$00,$00,$7B ;.03125
;    .db $1b,$29,$1d,$73 ;.00007494
;    .db $00,$00,$00,$79 ;1/128

.echo "Size: ",$-$9D95," bytes"

It is 386 bytes and I think it's reasonably fast for using floating point operations instead of integer or fixed-point arithmetic.

I also made a stand-alone Mandelbrot Set Explorer a few years ago, which I think is a lot cooler, but is 1647 bytes and you can't achieve as much of a zoom (this float-based program allows about 500x greater zoom). Then again, the explorer allows you to navigate the Mandelbrot set, zoom in and out, and change how many iterations are used, while the Floatlib one just draws a fixed image (though you can change the source and recompile).
Powered by EzPortal