Andrzej unjello Lichnerowicz

Polish Flag Size Coding Compo

2024-02-27T19:11:03+02:00

Throwback

I wrote this post back in February this year, but wanted to have nice disassembly syntax highlighting, so I made an MR first against Chroma that Hugo uses for syntax highlighting, and waited for Hugo to pick up Chroma with my changes, and in the meantime I have sort of forgot about the whole thing.. I haven’t written anything since then, so no biggie but wanted to clarify, just in case anyone actually reads this blog and wonders how come the post suddenly appeared in September with date in the past :)

A trip down memory lane. It’s 2002. I’m sitting in the dimly lit Heidmarkhalle in Fallingbostel, Germany. It feels great to be back at the Mekka & Symposium demoscene party.

Heidmarkhalle during Mekka & Symposium demoparty. Glowing CRT monitors in the dark

The room is shrouded in darkness, except for the glow of the computer screen illuminating my face. Ah, the good old days, when lugging a PC tower and a CRT monitor to parties was the norm! That year marked a record in the party’s history, boasting the highest number of PC demo submissions ever. Over 50 entries!

At some point, a organizers announced a bonus competition – the Polish flag compo! Organized by mados and tomaes of TAP, a small German demoscene group, this challenge sparked my interest. Even though the official Demozoo records don’t list the results, I managed to unearth them buried deep within the interwebs on the TAP webpage.

I accepted the challenge, with a bit of help from payda/Procreation and a couple others whose names, unfortunately, escape me after all these years. Here’s what I came up with

00000000  B013              mov al,0x13
00000002  CD10              int 0x10
00000004  B9027D            mov cx,0x7d02
00000007  51                push cx
00000008  6800A0            push word 0xa000
0000000B  07                pop es
0000000C  B00F              mov al,0xf
0000000E  F3AA              rep stosb
00000010  59                pop cx
00000011  B004              mov al,0x4
00000013  F3AA              rep stosb
00000015  CD16              int 0x16

and boy, did it look good:

A screenshot of my entry to the Polish flag bonus compo

A mere 23 bytes. Impressive, right? Curious if I could optimize it further, I consulted ChatGPT, which suggested this was as compact as it could get. However, the competition results suggested otherwise:

    
 #.. | handle..           | size.. | time. | additional remarks..
 1.  | bup / fuzzion      |     18 | 18:22 | scratches the rules a little bit, but yeah, 18 bytes finally
 2.  | Avoozl/ECFh        |     19 | 17:54 | text mode again
 3.  | PigPen^oCCult      |     19 | 18:28 | you made it
 4.  | muhmac^freestyle   |     19 | 19:12 | it's possible in mode 13h too, and 17 byte using dark gray
 5.  | lsd/MeltDown       |     19 | 19:19 | your 18 byte crashs
 6.  | Plex / BionFX      |     19 | 19:50 | goes
 7.  | MadMan/TAP         |     21 | 18:14 | good trick
 8.  | Seven /            |     22 | 15:46 | accepted
 9.  | baNa               |     22 | 17:31 | ok
 10. | angelo/exodus      |     23 | 16:41 | very good
 11. | Ctulhu/Headcrash   |     23 | 18:03 | your 22 byte is gray :(
 12. | RufUsul            |     23 | 19:33 | yo
 13. | cydo               |     23 | 19:37 | your 17 byte crashs, a ret would be nice
 14. | salt / kolabore    |     25 | 19:40 | clean flag
 15. | DaGrilling/#Define |     26 | 18:42 | still clear
 16. | rawhed/moojuice    |     27 | 16:58 | odd flag but ok
 17. | bastiw             |     27 | 19:16 | good flag
 18. | MiZC /CMOV         |     34 | 19:24 | clean flag
 19. | Topy44             |    512 | 16:24 | it needs ansi.sys
 20. | Gunhed             |   9216 | 16:51 | visual basic
 21. | mellon             |  14336 | 16:13 | directx and upx ;-)
 22. | puzzler/kekse      |  20772 | 14:16 | a puzzle game %-)
 23. | jack c./faque      | 238592 | 15:48 | huh, it's waving
 -   | vax|centric        |      - | 16:28 | shows a black screen
 -   | odi^hicknhack      |      - | 19:49 | working 23 byte version was to late

 

9 entries better than me! (of course I already worked through that dissapointment, it was 22 years ago after all:). The first place entry was by bup/fuzzion. His submission bent the rules more than just a bit:

Screen of bup’s entry. A screen should be half white, half red. It is 99% white.

Without prior knowledge of the theme, one might never guess it represented the Polish flag. But, moving past this cheeky observation, I’ve set my sights on matching the 19-byte record.

Public Service Announcement

Before diving deeper, I must confess something I overlooked until writing this post – my program lacked a ret instruction, and its functioning was purely coincidental. Since all com files load at the same memory location and some participants included a ret in their larger solutions, my program inadvertently stumbled upon this missing instruction, courtesy of these unsung heroes. Not all heroes wear capes, indeed :)

So can I do better in 2024?

I wished I new how to do it myself by now, but improving my entry will require a bit of reverse engineering of the other submissions. Looking at those that managed 19 bytes, only mumac and PigPen achieved this feat in graphics mode. The rest relied on text mode. Eager to understand their advantage, I discovered a clever trick both used to eliminate the need for a second loop. Instead of directly setting colors with values 15 (white) and 4 (red) and implementing two distinct loops, one can subtract 11 from white and use jns to control the loop. This approach works because the subtraction will trigger the sign flag on the second iteration (4 - 15 = -11):

00000000  B013              mov al,0x13
00000002  CD10              int 0x10
00000004  B9027D            mov cx,0x7d02
00000007  51                push cx
00000008  6800A0            push word 0xa000
0000000B  07                pop es
0000000C  B00F              mov al,0xf
0000000E  F3AA              rep stosb
00000010  59                pop cx
00000011  B004              mov al,0x4
00000013  F3AA              rep stosb
00000015  CD16              int 0x16

there’s a nice trick they both used to eliminate the second loop. Instead of using directly values 15 for white and 4 for red and having two explicit loops, we can substract 11 from white, and use jns for loop. For the first time we will still be zero, for second time we will break out because substraction will set the sign flag:

bits 16
section .text
org 100h

    mov al,0x13
    int 0x10
    mov al,0xf
    push word 0xa000
    pop es
loop:
    mov cx,0x7d02
    rep stosb
    sub al,0xb
    jns loop
    int 0x16
    ret

By replacing the need for push cx and pop cx with sub and jns, we save 2 bytes, though adding ret brings us to 22 bytes. The next byte we can shave off is here:

00000000  B013              mov al,0x13
00000002  CD10              int 0x10
00000004  B00F              mov al,0xf
00000006  6800A0            push word 0xa000
00000009  07                pop es
0000000A  B9027D            mov cx,0x7d02
0000000D  F3AA              rep stosb
0000000F  2C0B              sub al,0xb
00000011  79F7              jns 0xa
00000013  CD16              int 0x16
00000015  C3                ret

And we can achieve it with a small compromise. Why strive for perfection? We can adjust the half-byte count from 0x7D02 to 0x7D00, allowing us to use mov ch, which is one byte shorter than mov cx:

bits 16
section .text
org 100h

        mov al,0x13
        int 0x10
        mov al,0xf
        push word 0xa000
        pop es
loop:
        mov ch,0x7d
        rep stosb
        sub al,0xb
        jns loop
        int 0x16
        ret

The final bytes require some more background. Here’s what we are going to tackle:

00000000  B013              mov al,0x13
00000002  CD10              int 0x10
00000004  B00F              mov al,0xf
00000006  6800A0            push word 0xa000
00000009  07                pop es
0000000A  B57D              mov ch,0x7d
0000000C  F3AA              rep stosb
0000000E  2C0B              sub al,0xb
00000010  79F8              jns 0xa
00000012  CD16              int 0x16
00000014  C3                ret

I was staring at how the 19 byte versions got rid of pushing 0xA000 to es to write to video memory, not really understanding what’s going on. And then I did reach out to the only person I knew who might have had a slightes idea on this level of assembly optimization that I still had any contact with: ryg/farbraush. Not only he replied after all the years, but he did provide entire historical context!

So, I’ve already touched on how all .com files start at the same memory location, which is why we use the org 100h directive, right? But, for some reason, it never dawned on me to question what resides in those initial 256 bytes. Had I pondered this sooner, I would’ve stumbled upon something known as the Program Segment Prefix (PSP).

Here’s a brief overview of what the PSP contains at the start:

Offset Size Contents
00h-01h 2 bytes CP/M-80-like exit
02h-03h 2 bytes Segment of the first byte beyond the memory allocated to the program

Let’s unwrap this a little.

  1. CP/M-80-like exit, is a regular int 20h instruction, binary coded as 0xCD 0x20. It is a thing from the old days of automated software translation from CP/M, explained nicely by one-and-only Raymond Chen.
  2. This segment is usually upper memory segment limit, so 0x99 0xFF or close to it. Which is very close to our 0xA000!

This idea was apparently devleoped first by pascal/CubicTeam, first used in his parallax starfield 20-byte intro – an optimized version of 23 byte parallax starfield:

.model tiny

.code

org 100h

start:
  mov al,13h    ;//ah assumed to be 0. ok.
  int 10h

  pop sp        ;//com loader pushes 0, => sp == 0
  pop cx        ;//2 instruction bytes for int 20h => cx = 20CDh
                ;//pop something else for less stars
  pop ds        ;//we need the video segment 0A000h, here we
                ;//find 9FFFh (upper memory segment limit) on most machines.
                ;//this is not clean, but who cares...

it also turnes out, I couldn’t have reached out to a better person, because it was most likely ryg himself who found a way make it even more efficient. Instead of doing 3 pop introductions that take 3 bytes, we can do lea and have same effect in just 2 bytes:

00000000  B013              mov al,0x13
00000002  CD10              int 0x10
00000004  B00F              mov al,0xf
00000006  C417              les dx,[bx]
00000008  B57D              mov ch,0x7d
0000000A  F3AA              rep stosb
0000000C  2C0B              sub al,0xb
0000000E  79F8              jns 0x8
00000010  CD16              int 0x16
00000012  C3                ret

Screenshot of 19 bytes version. You can see a small shift between white and red, because of LDA ES,[BX] trick

This subtle yet effective modification not only honors the original challenge but also showcases the beauty of size coding. If it looks good, it is good. Even in a field as precise as assembly, there’s room for creativity and improvement.