Throwback
I wrote this post back in February this year, but wanted to have nice disassembly syntax highlighting, so I made an MR first against Chroma that Hugo uses for syntax highlighting, and waited for Hugo to pick up Chroma with my changes, and in the meantime I have sort of forgot about the whole thing.. I haven’t written anything since then, so no biggie but wanted to clarify, just in case anyone actually reads this blog and wonders how come the post suddenly appeared in September with date in the past :)
A trip down memory lane. It’s 2002. I’m sitting in the dimly lit Heidmarkhalle in Fallingbostel, Germany. It feels great to be back at the Mekka & Symposium demoscene party.
The room is shrouded in darkness, except for the glow of the computer screen illuminating my face. Ah, the good old days, when lugging a PC tower and a CRT monitor to parties was the norm! That year marked a record in the party’s history, boasting the highest number of PC demo submissions ever. Over 50 entries!
At some point, a organizers announced a bonus competition – the Polish flag compo! Organized by mados and tomaes of TAP, a small German demoscene group, this challenge sparked my interest. Even though the official Demozoo records don’t list the results, I managed to unearth them buried deep within the interwebs on the TAP webpage.
I accepted the challenge, with a bit of help from payda/Procreation and a couple others whose names, unfortunately, escape me after all these years. Here’s what I came up with
00000000 B013 mov al,0x13
00000002 CD10 int 0x10
00000004 B9027D mov cx,0x7d02
00000007 51 push cx
00000008 6800A0 push word 0xa000
0000000B 07 pop es
0000000C B00F mov al,0xf
0000000E F3AA rep stosb
00000010 59 pop cx
00000011 B004 mov al,0x4
00000013 F3AA rep stosb
00000015 CD16 int 0x16
and boy, did it look good:
A mere 23 bytes. Impressive, right? Curious if I could optimize it further, I consulted ChatGPT, which suggested this was as compact as it could get. However, the competition results suggested otherwise:
#.. | handle.. | size.. | time. | additional remarks.. 1. | bup / fuzzion | 18 | 18:22 | scratches the rules a little bit, but yeah, 18 bytes finally 2. | Avoozl/ECFh | 19 | 17:54 | text mode again 3. | PigPen^oCCult | 19 | 18:28 | you made it 4. | muhmac^freestyle | 19 | 19:12 | it's possible in mode 13h too, and 17 byte using dark gray 5. | lsd/MeltDown | 19 | 19:19 | your 18 byte crashs 6. | Plex / BionFX | 19 | 19:50 | goes 7. | MadMan/TAP | 21 | 18:14 | good trick 8. | Seven / | 22 | 15:46 | accepted 9. | baNa | 22 | 17:31 | ok 10. | angelo/exodus | 23 | 16:41 | very good 11. | Ctulhu/Headcrash | 23 | 18:03 | your 22 byte is gray :( 12. | RufUsul | 23 | 19:33 | yo 13. | cydo | 23 | 19:37 | your 17 byte crashs, a ret would be nice 14. | salt / kolabore | 25 | 19:40 | clean flag 15. | DaGrilling/#Define | 26 | 18:42 | still clear 16. | rawhed/moojuice | 27 | 16:58 | odd flag but ok 17. | bastiw | 27 | 19:16 | good flag 18. | MiZC /CMOV | 34 | 19:24 | clean flag 19. | Topy44 | 512 | 16:24 | it needs ansi.sys 20. | Gunhed | 9216 | 16:51 | visual basic 21. | mellon | 14336 | 16:13 | directx and upx ;-) 22. | puzzler/kekse | 20772 | 14:16 | a puzzle game %-) 23. | jack c./faque | 238592 | 15:48 | huh, it's waving - | vax|centric | - | 16:28 | shows a black screen - | odi^hicknhack | - | 19:49 | working 23 byte version was to late
9 entries better than me! (of course I already worked through that dissapointment, it was 22 years ago after all:). The first place entry was by bup/fuzzion. His submission bent the rules more than just a bit:
Without prior knowledge of the theme, one might never guess it represented the Polish flag. But, moving past this cheeky observation, I’ve set my sights on matching the 19-byte record.
Public Service Announcement
Before diving deeper, I must confess something I overlooked until writing this post – my program lacked a ret
instruction, and its functioning was purely coincidental. Since all com files load at the same memory location and some participants included a ret
in their larger solutions, my program inadvertently stumbled upon this missing instruction, courtesy of these unsung heroes. Not all heroes wear capes, indeed :)
So can I do better in 2024?
I wished I new how to do it myself by now, but improving my entry will require a bit of reverse engineering of the other submissions. Looking at those that managed 19 bytes, only mumac and PigPen achieved this feat in graphics mode. The rest relied on text mode. Eager to understand their advantage, I discovered a clever trick both used to eliminate the need for a second loop. Instead of directly setting colors with values 15
(white) and 4
(red) and implementing two distinct loops, one can subtract 11 from white and use jns
to control the loop. This approach works because the subtraction will trigger the sign flag on the second iteration (4 - 15 = -11
):
00000000 B013 mov al,0x13
00000002 CD10 int 0x10
00000004 B9027D mov cx,0x7d02
00000007 51 push cx
00000008 6800A0 push word 0xa000
0000000B 07 pop es
0000000C B00F mov al,0xf
0000000E F3AA rep stosb
00000010 59 pop cx
00000011 B004 mov al,0x4
00000013 F3AA rep stosb
00000015 CD16 int 0x16
there’s a nice trick they both used to eliminate the second loop. Instead of using directly values 15
for white and 4
for red and having two explicit loops, we can substract 11
from white, and use jns
for loop. For the first time we will still be zero, for second time we will break out because substraction will set the sign flag:
bits 16
section .text
org 100h
mov al,0x13
int 0x10
mov al,0xf
push word 0xa000
pop es
loop:
mov cx,0x7d02
rep stosb
sub al,0xb
jns loop
int 0x16
ret
By replacing the need for push cx
and pop cx
with sub
and jns
, we save 2 bytes, though adding ret
brings us to 22 bytes. The next byte we can shave off is here:
00000000 B013 mov al,0x13
00000002 CD10 int 0x10
00000004 B00F mov al,0xf
00000006 6800A0 push word 0xa000
00000009 07 pop es
0000000A B9027D mov cx,0x7d02
0000000D F3AA rep stosb
0000000F 2C0B sub al,0xb
00000011 79F7 jns 0xa
00000013 CD16 int 0x16
00000015 C3 ret
And we can achieve it with a small compromise. Why strive for perfection? We can adjust the half-byte count from 0x7D02
to 0x7D00
, allowing us to use mov ch
, which is one byte shorter than mov cx
:
bits 16
section .text
org 100h
mov al,0x13
int 0x10
mov al,0xf
push word 0xa000
pop es
loop:
mov ch,0x7d
rep stosb
sub al,0xb
jns loop
int 0x16
ret
The final bytes require some more background. Here’s what we are going to tackle:
00000000 B013 mov al,0x13
00000002 CD10 int 0x10
00000004 B00F mov al,0xf
00000006 6800A0 push word 0xa000
00000009 07 pop es
0000000A B57D mov ch,0x7d
0000000C F3AA rep stosb
0000000E 2C0B sub al,0xb
00000010 79F8 jns 0xa
00000012 CD16 int 0x16
00000014 C3 ret
I was staring at how the 19 byte versions got rid of pushing 0xA000
to es
to write to video memory, not really understanding what’s going on. And then I did reach out to the only person I knew who might have had a slightes idea on this level of assembly optimization that I still had any contact with: ryg/farbraush. Not only he replied after all the years, but he did provide entire historical context!
So, I’ve already touched on how all .com
files start at the same memory location, which is why we use the org 100h
directive, right? But, for some reason, it never dawned on me to question what resides in those initial 256 bytes. Had I pondered this sooner, I would’ve stumbled upon something known as the Program Segment Prefix (PSP).
Here’s a brief overview of what the PSP contains at the start:
Offset | Size | Contents |
---|---|---|
00h-01h | 2 bytes | CP/M-80-like exit |
02h-03h | 2 bytes | Segment of the first byte beyond the memory allocated to the program |
Let’s unwrap this a little.
- CP/M-80-like exit, is a regular
int 20h
instruction, binary coded as0xCD 0x20
. It is a thing from the old days of automated software translation from CP/M, explained nicely by one-and-only Raymond Chen. - This segment is usually upper memory segment limit, so
0x99 0xFF
or close to it. Which is very close to our0xA000
!
This idea was apparently devleoped first by pascal/CubicTeam, first used in his parallax starfield 20-byte intro – an optimized version of 23 byte parallax starfield:
.model tiny
.code
org 100h
start:
mov al,13h ;//ah assumed to be 0. ok.
int 10h
pop sp ;//com loader pushes 0, => sp == 0
pop cx ;//2 instruction bytes for int 20h => cx = 20CDh
;//pop something else for less stars
pop ds ;//we need the video segment 0A000h, here we
;//find 9FFFh (upper memory segment limit) on most machines.
;//this is not clean, but who cares...
it also turnes out, I couldn’t have reached out to a better person, because it was most likely ryg
himself who found a way make it even more efficient. Instead of doing 3 pop
introductions that take 3 bytes, we can do lea
and have same effect in just 2 bytes:
00000000 B013 mov al,0x13
00000002 CD10 int 0x10
00000004 B00F mov al,0xf
00000006 C417 les dx,[bx]
00000008 B57D mov ch,0x7d
0000000A F3AA rep stosb
0000000C 2C0B sub al,0xb
0000000E 79F8 jns 0x8
00000010 CD16 int 0x16
00000012 C3 ret
This subtle yet effective modification not only honors the original challenge but also showcases the beauty of size coding. If it looks good, it is good. Even in a field as precise as assembly, there’s room for creativity and improvement.