Hello there!!
While on CodeWalr.us chat, PT_ and I thought about a way to clear screen of a TI83PCE/TI84+CE as fast as possible! (in 8bpp mode)
Here's the result :
FastClr:
ld de,$555555 ; will write byte 85 (= blue color)
or a
sbc hl,hl
ld b,217
di
add hl,sp ; saves SP in HL
ld sp,vram+76818 ; for best optimisation , we'll write 18 extra bytes
ClrLp: .fill 118,$d5 ; = 118 * "PUSH DE"
djnz ClrLp ; during 217 times
ld sp,hl ; restore SP
ei
16+4+8+8+4+4+16+217*(118*10+13)-5+4+4=258944 States !!! ;D
(the classic LDIR takes about 537600 states)
Imagine this routine relocated in the faster memory-area $e30800 !!! (faster again !!)
** EDIT **
A little faster !
FastClr:
ld de,$555555 ; will write byte 85 (= blue color)
or a
sbc hl,hl
ld b,213
di
add hl,sp ; saves SP in HL
ld sp,vram+76800 ; as a PUSH is decreasing SP, begin at end of 8bpp mode physical screen
ClrLp: .fill 120,$d5 ; = 120 * "PUSH DE"
djnz ClrLp ; during 213 times
.fill 40,$d5 ; 40 * "PUSH DE"
ld sp,hl ; restore SP
ei
16+4+8+8+4+4+16+213*(120*10+13)-5+40*10+4+4 = 258832 States =D
Indeed usign push/pop is the fastest way possible, but it is also very large. This trick was already used in the z80 area - for filling, clearing or everything else. The drawback is that interrupt is disabled, but it isn't a huge issue. Actually, the fastest way ever would require 25600 bytes :P (but it is already good like this, relatively small footprint at ~170 bytes, vs less than 10 for ldir).
Hm I am curious about if this would be a viable replacement for the clear screen routine in Sprites and the C libraries? Better speed is always better but I am curious about if this would increase the libs size? Nice work regardless :)
Probably not, since it disables interrupts, and lib functions are interrupt-safe.
However, for programmers using ASM directly in their project and already manually handling interrupts, well... :)
(BTW grosged, Runer said push is 10 states, not 12)
Ah right, that could be an issue then >.<
Quote from: TheMachine02 on June 11, 2016, 02:45:32 PM
Actually, the fastest way ever would require 25600 bytes :P
Do you mean something like this?
ClrVeryFast:
ld hl, 0
ld (plotsscreen), hl
ld (plotsscreen+2), hl
ld (plotsscreen+4), hl
ld (plotsscreen+6), hl
ld (plotsscreen+8), hl
ld (plotsscreen+10), hl
...
ld (plotsscreen+764), hl
ld (plotsscreen+766), hl
ret
Wait, are loops actually this much slower in ASM too? O.O I thought that was just a TI-BASIC-specific flaw O.O
Loop unrolling is a common trick to gain speed at the cost of size since you spend less time decrementing, comparing and jumping.
Loops are not that slow in ASM, but loops cause overhead in every language. The speed difference may not even be noticable and in this case it's deffinately not worth the additional memory requirements, but it is technically faster.
Well, it doesn't have to jump every time and calculate which loop it is on. Instead everything is hardcoded.
Ah I see. I just thought it was TI sucking <_<
This is why the 83+ version of GalagACE used 12 Output commands to draw 12 ships instead of two For loops and 1 Output command.
Ah yes, Push does not take 12 but only 10 !! (I've checked)
Thanks, Adriweb...and Runer ;)
I also modified "PUSH IX/IY" which takes 14 states (not 16)
Has this actually been timed on calc? The ez80 'sort of' has some pipelining features that could introduce some benefits for certain instruction combinations.
This morning, I've just manually measured both methods : "LDIR" and "PUSH"
I used http://online-stopwatch.chronme.com/ , my TI83PCE (freshly "Ram cleared", unplugged)
Here are the 2 programs to clear screen during 10 000 times !
First, the classic method "LDIR"...
ld a,$27
ld ($e30018),a
ld bc,10000
BigLp: push bc
;----------------------------------------------------------------
( di )
ld hl,$d40000
ld de,$d40001
ld (hl),85
ld bc,76799
ldir
( ei )
;-----------------------------------------------------------------
pop bc
dec bc
ld a,b
or c
jp nz,BigLp
ld a,$2d
ld ($e30018),a
ret
which takes (with or without interrupts!) 1 minute and 59 seconds
Then, the method "PUSH" ...
ld a,$27
ld ($e30018),a
ld bc,10000
BigLp: push bc
;-----------------------------------------------------------------------
ld de,$555555 ; will write byte 85 (= blue color)
or a
sbc hl,hl
ld b,213
di
add hl,sp ; saves SP in HL
ld sp,vram+76800 ; begin at end of 8bpp mode physical screen
ClrLp: .fill 120,$d5 ; = 120 * "PUSH DE"
djnz ClrLp ; during 213 times
.fill 40,$d5 ; 40 * "PUSH DE"
ld sp,hl ; restore SP
ei
;------------------------------------------------------------------------
pop bc
dec bc
ld a,b
or c
jp nz,BigLp
ld a,$2d
ld ($e30018),a
ret
which takes ... 58 seconds !!! ;D
And if we relocate the main routine in $e30800, time will decrease to 51 seconds !!!
Wow, that's some impressive gain. Good job ;D
Wow, that's twice a fast. Good job! :)
Better method, again ;D
Yesterday I discussed with PT_ another method :
its aim is to clear while create coding !!
"Push de" is coded $d5
with "ld de,$d5d5d5", a "push de" will create 3 "push de" !!..That's the trick :)
In 8bpp mode, whe have to clear 76800 bytes , using PUSHs we need 76800/3=25600 PUSHs
As a PUSH creates 3 PUSHs, we just need to clear/create 1/4 of 25600 = 6400 PUSHs :)
Then we will go inside this huge group of 19200 bytes $d5 to complete the 3/4 remaining to clear !!
Of course, we will write at the very end "ld sp,hl \ ei \ ret" to be able to quit the routine ;)
Here's the routine:
ld bc,$c9fbf9 ; pour écrire "ld sp,hl \ ei \ ret"
ld de,$d5d5d5 ; $d5=code de "push de"
or a ; en PUSHant $d5d5d5, on crée du code
sbc hl, hl ; (des PUSHs qui créent des PUSHs !!)
di
add hl, sp ; mémorise SP dans HL
ld sp,$D52C03
push bc
ld b,52
PushLp: .fill 123,$d5 ; 6400 = 52*123+4
djnz PushLp ; là, on "PUSH DE" 6400 fois ( = 1/4 de l'effaçage écran)
push de ; pour ensuite aller dedans!! (car c'est aussi du code!)
push de ; (afin de de poursuivre l'effaçage des 3/4 restants de l'écran)
push de
push de
jp $d52C00-(6400*3)
length = 153 bytes only !
16+16+4+8+4+4+16+10+8+(123*10+13)*52-5+10+10+10+10+17+19200*10+4+4+21= 256803 states !!!
The constraint is we must clear using byte $d5
But that may not be a problem as , in 8bpp mode, we can modify the palette ;)
pretty impressive, nice job
At this rate, you'll have a clear screen routine that it so fast that it will take a negative amount of time to execute, causing time travel of some sort... O.O
I like the idea of using the instruction byte as the cleared index, very clever :).
This is amazing O.O
wont all the pushes cause a stack overflow?
:-X
Quote from: c4ooo on July 05, 2016, 06:22:57 AM
This is amazing O.O
wont all the pushes cause a stack overflow?
This is literally a modified example from asm in 28 days. If you read the section, it will describe more of what it is doing :)
Of course, I do like the code creation and then executing aspect. Although it would be difficult to implement correctly, it is pretty neat.
Quote from: MateoConLechuga on July 10, 2016, 12:51:41 AM
:-XQuote from: c4ooo on July 05, 2016, 06:22:57 AM
This is amazing O.O
wont all the pushes cause a stack overflow?
This is literally a modified example from asm in 28 days. If you read the section, it will describe more of what it is doing :)
Of course, I do like the code creation and then executing aspect. Although it would be difficult to implement correctly, it is pretty neat.
Ohh you mean page 10? http://tutorials.eeems.ca/ASMin28Days/lesson/day10.htm
TBH i never fully read that guide, merely skimmed over the pages that i needed :P