[gLib][3d][z80][ez80] gLib a fast 3D asm/axiom library

TheMachine02 · September 18, 2016, 08:37:29 PM

Well texturing should be about twice as fast as alpha-screenshot. I plan about ~200TStates per pixel (and want less of course !) vs the 450TStates of alpha/pixel.

TheMachine02 · September 22, 2016, 12:32:21 PM

triangle flat filling

advanced maths functions

depth sort

Framebuffer

Texture

cliping

As you can see, not everything as been implemented, but once it will, a solid base will be created for 3D on CE.

EDIT : so a few progress on texture mapping :

Texture gradient are for now hardcoded and a lot of thing isn't proprey implemented, but it is on good way.
It need 4 divisions for setup though, so about 2000 ticks. Inner loop is from 126 to 136 TStates, so it give a fillrate of about 350KPxl/s

TheMachine02 · September 25, 2016, 11:39:05 AM

Sooooo this is technically a triple post, but anyway

Texture now works (and there isn't hardcoded value, this is the full routine !). There is quite a lot of optimisation to do right now, but it less important. The maximum size of the texture is 127x255 which allow quite some stuff. (A 127x255 texture is 32385 bytes ). I now will focus on clipping and also in handling model seams correctly (as if, in a model, on vertex share different texture coordinate). Texturing isn't finished either, I want to implement a BC1 type texture compression (ratio of 1/2) with decompression during texel fetch.

As for speed, the inner texture loop is exactly 114 TStates, +10TStates when changing texture coordinate. This is a theorical 421000KPxl/s. Of course, this isn't reachable in pratice, but give a good indication of what is possible.

catastropher · September 25, 2016, 03:16:55 PM

Nice work! I can't wait to see this once it's done! I have two questions:

Is this doing affine texture mapping or perspective correct? X3D does affine, but at one point I had it doing the perspective correction every 16 pixels (just like Descent did)
Is this only for drawing triangles or can it handle other polygons as well? You can use the polyline property of convex polygons to split a polygon into two polylines and then draw each scanline by walking down the edges. This saves a lot of time because e.g. to draw a hexagon you only need one poly instead of 4.

A lot of people think you have to sort all the points in the polygon or something, but I just this method that I came up with (which runs in linear time):

[spoiler]

Code Select


typedef struct X3D_PolyVertex {
    X3D_Vex2D v2d;
    int16 intensity;
    int32 u, v;
    int32 z;
} X3D_PolyVertex;

typedef struct X3D_PolyLine {
    uint16 total_v;
    X3D_PolyVertex** v;
} X3D_PolyLine;

_Bool x3d_polyline_split(X3D_PolyVertex* v, uint16 total_v, X3D_PolyLine* left, X3D_PolyLine* right) {
    // Force the polygon to be clockwise
    x3d_polyvertex_make_clockwise(v, total_v);
    
    int16 top_left = 0;
    int16 top_right = 0;
    int32 max_y = v[0].v2d.y;
    
    int16 i;
    // Find the top left point, the top right point, and the maximum y value
    for(i = 1; i < total_v; ++i) {
        if(v[i].v2d.y < v[top_left].v2d.y) {
            top_left = i;
            top_right = i;
        }
        else if(v[i].v2d.y == v[top_left].v2d.y) {
            if(v[i].v2d.x < v[top_left].v2d.x)    top_left = i;
            if(v[i].v2d.x > v[top_right].v2d.x)   top_right = i;
        }
        
        max_y = X3D_MAX(max_y, v[i].v2d.y);
    }
    
    left->total_v = 0;
    right->total_v = 0;
    
    // Grab the points for the left polyline
    do {
        left->v[left->total_v] = v + top_left;
        top_left = (top_left + 1 < total_v ? top_left + 1 : 0);
    } while(left->v[left->total_v++]->v2d.y != max_y);
    
    // Grab the points for the right polyline
    do {
        right->v[right->total_v] = v + top_right;
        top_right = (top_right != 0 ? top_right - 1 : total_v - 1);
    } while(right->v[right->total_v++]->v2d.y != max_y);
    
    return left->total_v > 1 && right->total_v > 1;
}

[/spoiler]

TheMachine02 · September 25, 2016, 03:31:41 PM

The routine definitly doesn't do any perpective correction for speed sake. The poor ez80 can't do divide on is own and it is about 500TStates. That would mean about 30TStates/pxl in more, but that is without counting the fact it would destroy many register that the routine would need to restore.
It does only draw triangle, as drawing polygons might be quite hard to get on pure ez80 asm. I'll look into your method, which look like quite interesting.

Dream of Omnimaga · September 25, 2016, 11:32:45 PM

Ideally, you should avoid splitting your lib too much, because people will get confused about what is Iris3D and what does gLib do, like with various other softwares. I'm glad you got textures working by the way. I can't wait for animated eye-candy.

TheMachine02 · September 26, 2016, 07:35:38 PM

So, there is progress :

Quite a lot a textured poly drawn if you ask me. The uv texture coordinate is kinda mess up in some model due to the lack of texture repeat. Anyway, look good. I'll will try to speed this up to the speed of the light of course, but I don't promise anything. (Futhermore, the model is created to have two texture on it, and I doesn't implement such texture switch already)

As for split, well dunno. I can do thing as such I have a modulable environnement, in that case the library will be straight renamed as Iris3D on ez80 plateform (but still gLib on b&w). As such, the developper chose directly what does he want to use as function and only the used functions are included in final program.

catastropher · September 26, 2016, 09:09:33 PM

Nice work so far!

I'm sure you know this, but I'm just throwing it out there. If you have a texture dimension that is a power of two (2^n), logical anding the u and v values with (2^n - 1) will give you texture wrapping. So your loop would look something this (this isn't real code, I just wrote it as an example):

[spoiler]

Code Select

typedef struct {
    int x;
    fp16x16 u, v;   // 16.16 fixed point
    fp16x16 z;
} ScanlineValue;

typedef struct {
    unsigned char size;
    unsigned char* texels;
} Texture;

void draw_scanline(ScanlineValue* left, ScanlineValue* right, Texture* tex, int y, short* zbuf) {
    int dx = (right->x - left->x != 0 ? right->x - left->x : 1);
    
    fp16x16 du = (right->u - left->u) / dx;
    fp16x16 dv = (right->v - left->v) / dx;
    fp16x16 dz = (right->z - left->z) / dx;
    
    fp16x16 u = left->u;
    fp16x16 v = left->v;
    fp16x16 z = left->z;
    
    for(int x = left->x; x <= right->x; ++x) {
        int zz = z >> 16;
        
        if(zz < zbuf[y * SCREEN_WIDTH + x]) {
            int uu = (u >> 16) & (tex->size - 1);
            int vv = (v >> 16) & (tex->size - 1);
            
            screen[y * SCREEN_WIDTH + x] = tex->texels[vv * tex->size + uu];
            zbuf[y * SCREEN_WIDTH + x] = zz;
        }
        
        u += du;
        v += dv;
        z += dz;
    }
}

[/spoiler]

TheMachine02 · September 27, 2016, 02:08:35 PM

Well the issue isn't that I don't know how to do wrapping, it is because doing such will be way to slow

I don't recalculate the real texture adress in the inner pxl loop, I barely offset it. As such, it is quite faster. (But there isn't wrapping). Actually, the only triangles which would pose an issue is the triangle starting at one side of the texture and finishing at the other ... but there isn't much triangle doing so.

EDIT : I might have fixed texture coordinate bug from my converter. This is a part of the model of ashe from ff12. (about 1300 triangles)

Snektron · September 27, 2016, 03:07:06 PM

Shaders when

Looks very good btw.

TheMachine02 · September 27, 2016, 03:28:10 PM

Well, pixel shader won't be implemented anytime soon, because a call is really slow

Anyway, I could do vertex/geometry shader.

Dream of Omnimaga · September 27, 2016, 04:07:16 PM

I am impressed at the speed those textured models run at.

TheMachine02 · September 27, 2016, 04:56:37 PM

Quote from: DJ Omnimaga on September 27, 2016, 04:07:16 PM
I am impressed at the speed those textured models run at.

Yeah, ez80 still pack some punch. I wish they'd put 1wait state ram everywhere though >_>. I hope however to make it run faster by creating lut for some of setup operation, and with general optimisation (and there is a lot to be found in this routine !). Also keep in mind there is no perpective correction, so texture could appears somethime completly off.

Anyway, chocobo are cute <3

EDIT : dark shiva from ff10 isn't bad either. This is about 2500 triangles

(much less displayed though, with bfc)

catastropher · September 28, 2016, 02:06:57 AM

When I was writing the texture mapper for X3D, I figured out a way to avoid doing the texels[v * tex_w + u] calculation, or any sort of complex logic to update an internal pointer. I just pre-multiply v by the width of the texture and make sure that only multiples of tex_w are counted (since it's interpolated, you could end up with e.g. v = 2.5 * tex_w, which would give you two and a half scanlines which is wrong) with v by doing a bitmask:

Code Select

// Texture size of 64 -> 6 bits
const int TEX_BITS = 6;

// Fixed point precision for u and v
const int FRAC_BITS = 10;

fp du = ((right->u - left->u) << FRAC_BITS) / dx;
fp u = left->u << FRAC_BITS;

// Premultiply v by the width of the texture (make sure the type of v is big enough!)
fp dv = ((right->v - left->v) << (FRAC_BITS + TEX_BITS)) / dx;
fp v = left->v << (FRAC_BITS + TEX_BITS);

...

// Mask because only whole multiples of tex_w are meaningful (i.e. whole scanlines of texels)
const int v_mask = ((1 << TEX_BITS) - 1) << (FRAC_BITS + TEX_BITS);

for(int i = left->x; i <= right->x; ++i) {
    int index = ((v & v_mask) + u) >> FRAC_BITS;
    
    *screen_ptr = texels[index];
    
    ++screen_ptr;
    u += du;
    v += dv;
    ...
}

Using a more complex mask, this can also be used for super fast texture wrapping. Let me know if you're interested and I can give you more details!

TheMachine02 · September 28, 2016, 06:47:16 AM

Well, I need to see how I can convert this method to asm

I use a slighty different method currently : I add int(du)+tex_size*int(dv) and update a fractionnal part with frac(du)*65536+frac(dv). As such, when the register containing the fractional part produc a carry, I offset the texture coordinate by one, and if a bit is rolled into the mid register (ie, High register), it is offseted by tex_size.
I use this method because the ez80 is quite limited on register and it allow me to do not reload gradient into the register each loop.

EDIT :