Created: 2024-06-23

Diatom: Baby Steps

Contents

Motivation
Goals
Parts
Playground

Diatom is my current long-term project and my take on a Forth implementation. In this post I'm gonna lay out the current state of it and dive into some implementation details.

Motivation

I've been reading about Forth now and again on Hackernews and didn't really get the appeal of the language as it doesn't resemble any other programming languages I've seen and is definitely up there with the other strange languages (looking at you Lisp, Prolog, K, ...). Chuck Moore, the original creator, also claimed ludicrously high productivity gains, up to 100x, when using Forth, that puts every 10x developer to shame. I didn't give it any further thought as this sounded just too surreal.

Some time passed, I did the occasional microcontroller project and always found the development process to be a drag. Flashing the microcontroller takes a couple of seconds, which isn't that big of a deal but also doesn't feel very nice when you are still in an exploratory phase.

More time passed and I did some Advent of Code puzzles with Common Lisp + Emacs and I was pretty amazed by the interactive development process (which is still the best one to this day, once you set everything up at least...).

Then one day, while resolving some transient dependency conflicts, I stumbled upon this article that cherishes the simplicity of Forth and argues that ill-fitting library abstractions are one of the main reasons developer productivity suffers. That got me kinda interested this time around but the last paragraph really brought it home for me:

Or you remain a cog in the machine, developing on absurdly convoluted "software stacks", using inadequate and brittle languages, connect barely matching interfaces to only small parts of too many bloated libraries, using insanely sophisticated tools of incomprehensble complexity and trot on, asking yourself why everything sucks and why the last time you actually enjoyed programming and were you could be genuinely and deservedly proud on what you accomplished, was when you were a kid...

Ouch... Maybe Chuck Moore isn't such a lunatic after all and there is some truth in what those Forth people are saying - only one way to find out though!

Goals

So I set out to implement a custom Forth dialect to explore interactive development of embedded devices that has the following priorities:

Simplicity - simple to use but also simple to implement
Portability - be the most portable virtual machine there is
Correctness - the obvious way to do something should also be the right way
Efficiency - be reasonable efficient in both time and space requirements but always provide an escape hatch to the host language

Parts

The current implementation consists of multiple parts that we'll explore in detail in this section.

Virtual Machine

The Diatom virtual machine (VM) is a very simple stack VM consisting of two stacks (return & data stack) and a flat memory space. Each element on the stack is a word wide (4 bytes) and can only be accessed by the provided stack-manipulating instructions (pop, dup, etc. - consult the /Instruction Set/ section for a comprehensive list). Memory can either be accessed byte- or word-wise.

Everything else mostly behaves as one would expect with the exception that arithmetic operations don't overflow but instead saturate, meaning that the value won't get past the maximum/minimum representable value. This definitely has some negative performance implications as it doesn't match the behaviour of most CPUs but it makes overflows easier to detect as one only needs to compare against the max/min values.

I also believe it makes way more sense for embedded applications that for example controll a plane's flight path to be stuck at not quite the right elevator deflection than overflowing and moving the elevator into the entirely wrong direction.

The stack depth and memory size is configurable in the current Go implementation.

Most Forths are implemented directly in assembly of the target CPU but implementing it against a VM makes porting to new environments pretty easy - e.g. getting it to run in the browser has been pretty straight forward.

Instruction Set

The currently supported instructions:

exit Halts the VM
nop Does nothing
ret Pops an address from the return stack and jumps to it
const Puts the next word of memory onto the data stack
@ Reads a word from the specified address
! Writes a word to the specified address
+ Adds the top two numbers on the data stack
- Subtracts the top two numbers on the data stack
* Multiplies the top two numbers on the data stack
/ Divides the second by the top most number on the data stack
% Same as / but stores the remainder
dup Duplicates the top most number on the data stack
drop Drops the top most item on the data stack
swap Swaps the two top most items on the data stack
over Duplicates the second item and stores in on top of the data stack
cjmp If the top most item equals -1 then jump to the address denoted by the next word in memory
call Jumps to the address denoted by the following word in memory and puts an appropriate return address onto the return stack
scall Same as call but takes the address to jump to from the data stack
key Reads a single byte from stdin (implementation dependent)
emit Writes a single byte to stdout (implementation dependent)
= Compares the two top most items, -1 if equal, otherwise 0
~~~ Bitwise NOT
& Bitwise AND
| Bitwise OR
< Same as = but with lesser-than semantics
> Same as = but with greater-than semantics
rpop Pops the top most item off the return stack and stores it on the data stack
rput Reverse opration of rpop
rpeek Same as rpop but rather copies the address from the return stack
b@ Same as @ but only reads a single byte
b! Same as ! but only writes a single byte

Not as minimal as I'd like it to be, but still very well on the small side.

Assembler

One downside of having a home-brew virtual machine is that we somehow have to generate the machine code for it as well. For this I have implemented a simple assembler that takes the VM assembly as input and emits the corresponding byte-code.

Apart from this basic translation task, the assembler also supports forward-referencing labels, some built-in macros tailored to implementing Forths and, of course, comments:

           ( Jump to the address of ":main". )
           const -1 cjmp @main

           ( Declare a variable "x" that is 4 bytes wide. )
           .var x 4 .end

           ( Define a new word "double" that executes "dup" and "+". )
           .codeword double dup + .end

           ( Define a new word "quadruple" that calls the word "double" twice. )
           ( "!<word>" is a macro that expands to "call _dict<word>". )
           .codeword quadruple !double !double .end

           ( Directly call the word "quadruple". )
           :main
           const 4 call _dictquadruple
           ( "_dict<word>" is a generated label that points to the
                 code section of "quadruple". )

Preamble

With the VM and the assembler in place, we already have everything to run a bare-bones Forth interpreter - except the actual code to bootstrap it...

The preamble.dasm file is exactly that. Clocking in at ~ 600 lines of assembly (including comments), it sets everything up to get the first Forth read-eval-print-loop (REPL) started.

It is based on the amazing jonesforth implementation and mainly takes care of setting up the Forth dictionary, implementing word lookups and compilation.

It is definitely larger in size than it needs to be but for now it gets the job done and I'll revisit it once the higher-level constructs are more established.

Standard Library

This is the first actual Diatom code and it's where stuff gets interesting. Right now it is pretty barren but this is the place where the basic language primitives are defined and where it will diverge from most Forth implementations quite a bit.

I'm still figuring this part out as I try to use Diatom in some everyday scripts and add what I deem necessary to complete the task.

Still, I think it is showing off one of Forths greatest advantages pretty well - it doesn't require much code to bootstrap itself. For example if-statements are not based on magic, built-in keywords, as in most languages, but are instead just a bunch of immediate words (a.k.a. macros in other languages). Same goes for implementing recursion (via recurse) and even ( <text> ) for comments!

With the standard lib loaded we have most of the basic building block available to write some code. For example a factorial implementation could look something like this:

         ( accumulator int -- result )
         : _factorial
           dup 0 = over 1 = |
           if return then
           swap over *
           swap 1-
           recurse
         ;

         ( int -- result )
         ( Calculates the factorial of the given number. )
         : factorial 1 swap _factorial drop ;

The stack shuffling with swap and over is probably not super nice but I think for a tail-recursive implementation it is not too bad and who knows, maybe the VM gets some addressable stack registers at one point. And that is the exact reason why I'm so excited about that project - there is no notion of how programming should work, everything is mallable and because it is so little code it will hopefully be easy to try out ideas and see if they suck or indeed make programming more enjoyable.

Playground

Before ending this post I want to leave you with something cool. Because the virtual machine is so simple, it has been pretty straight forward to port it to Javascript and therefore have an interactive Diatom REPL where you can try out some example programs!

If the VM doesn't do anything, press the Reset button as it is likely that it crashed for some reason. Error reporting is currently non-existent so expect the REPL to spontaneously combust when you try to use an undefined word for example.

You can also checkout this page's source code to see how little it takes to spin up this REPL.

Here is an example how to implement the classic fizz-buzz with the currently available language constructs:

          : 'fizz'
                ( No strings yet so ASCII codes it is. )
                102 emit
                105 emit
                122 emit
                122 emit
          ;

          : 'buzz'
                98 emit
                117 emit
                122 emit
                122 emit
          ;

          : 'space' 32 emit ;

          ( number accumulator -- )
          : _fizzBuzz
                ( Return if the accumulator reached the maximum. )
                over over = if drop drop return then

                dup 3 % 0 = if
                  'fizz'
                then
                dup 5 % 0 = if
                  'buzz'
                then
                dup 3 % 0 = over 5 % 0 = | ~ if
                  dup .
                then

                'space' 1+ recurse
          ;

          ( number -- )
          : fizzBuzz 1+ 1 _fizzBuzz ;

          ( Run it like this: 20 fizzBuzz )

That's it, hope I could get you a bit excited about Forth-like languages and I'll be back with more updates once there is something worth showing.