I'm now looking at the project, and I have several optimization suggestions
* first and foremost: unless you experience issues with optimized compilation, which is very infrequent, you should
compile with optimization (-Os compiler flag), which, besides speeding up the program, also saves a fantastic amount of space
* on top of that, the -fomit-frame-pointer -mregparm=5 compiler flags save some more space;
* you really should dynamically allocate 5 huge variables: external_level_buffer (jm_levels.c) and the four LCD_SIZE'd variables from jm_graphics.c. You can allocate a single block and play with pointer arithmetic a bit. Currently, they get stored in BSS, but BSS suck for both size and speed efficiency - both directly, because of inefficient instructions + relocation information, and indirectly, because they can prevent from making other optimizations. Getting rid of BSS for most programs I worked on, not just TICT programs, was a major win.
* for now, using compressed relocations and compressed references for BSS saves some more space. When the above variables are dynamically allocated, the BSS becomes small enough to merge with the main executable, and compressed BSS references are moot.
* you used the old, large version of IsVTI(); the newest version can be found inline in e.g. https://github.com/debrouxl/gcc4ti/blob/next/trunk/tigcc/archive/gray.s
or in https://github.com/debrouxl/gcc4ti/blob/next/trunk/tigcc/archive/hw_version.s
* in GraySetScreenColor_R(): 1) move.l #0xffffffff,%d0 / %d5 would be much better as moveq #-1,%d0 / % d5, 2) the andi.l instructions would be smaller and faster as moveq to an additional register followed by and.l, 3) the cmpi.b instructions might be redundant because the andi.l (with a single-bit mask) already sets the CCR flags, 4) you should use explicit short branches;
* in GraySingleSprite8_COLOR_R(): 1) cmpi.w #0,%d4 is better written as tst.w %d4 (and you could even avoid the tst.w %d4 if you load d6 before loading d4), 2a) given that you're not using the upper part of d5, you should use moveq # instead of move #, 2b) in fact you could replace everything between __GraySingleSprite8_R__Test_WHITE and __GraySingleSprite8_R__Test_Finish by a single lsr.w #3,%d4 instruction (and thereby avoid using d5 at all), 3) you should use an explicit .l on the adda, 4) you should use explicit short branches.
This computer doesn't have GCC4TI binaries, I'll have to build them... or use the other computer's binaries.
On my side, the current build stats for jumpman are:
Program Variable Size: 45249 Bytes
BSS Size: 47878 Bytes
Absolute Relocs: 712
Natively Emitted Relocs: 2
Relocs Removed by Branch Optimization: 299
Relocs Removed by Move Optimization: 211
Relocs Removed by Test Optimization: 5
Space Saved by Range-Cutting: 1110 Bytes