×

Sign In

Sign in with your username and password to access your account.
×
Previous
Part 2 of 5
Next

Brainfuck Improvements

Last time we took a look at the way Brainfuck works and how to execute it. We built the most basic implementation of an interpreter possible. It was so simple that it didn’t even know where the current loop began or ended. Although not being a great way to run Brainfuck scripts, we now have a baseline to work with.

Be better than baseline

If you look at any Brainfuck code one thing becomes apparent pretty quickly: There is a lot of repetition of commands directly next to one another. As Brainfuck only allows for changes to the cell pointer and the cell being pointed to, to be made in increments by one, multiple of these commands need to be chained for any other offsets.

This makes for a particularly straightforward optimization. When loading the program from disk, merge any identical adjacent commands and replace them with a single command and the offset to be added or subtracted.



++++        ->      + (3)
>           ->      > (1)
++          ->      + (2)
>           ->      > (1)
++          ->      + (2)
<<          ->      < (2)
.           ->      .
.           ->      .
[           ->      [
  -         ->      - (1)
  >         ->      > (1)
  ++++      ->      + (4)
  >         ->      > (1)
  +++       ->      + (3)
  >         ->      > (1)
  +         ->      + (1)
  <<<       ->      < (3)
]           ->      ]

The example above shows how a bit of Brainfuck code can be converted to the optimized version. The in- and output commands are deliberately left alone as two outputs for example need to be executed as two steps anyway. This would then require an extra loop in the interpreter. Moreover compared to pointer and data arithmetic these instructions are excitingly rare. Usually, a lot of computation is done, leading up to a single output command printing the result.

In the end, this process yields a compressed Brainfuck code, that could be described as selectively run-length encoded. RLE is a very basic lossless compression algorithm that for example looks at every byte in a data stream and merges any adjacent repetitions. “HHEEEELOOOOO” would be encoded as “H2E4L1O5”. *End of tangent* 🙃

The other obvious area needing improvement is loops. Every time the interpreter reaches a loop and decides not to enter it, it should already know where the loop ends so it can jump over it immediately, without having to search for a matching ] . The same goes for loops that are entered and should be rerun. At the end of the loop, the interpreter should be aware of the beginning of the loop and jump there directly.

The latter could be achieved with a stack during runtime. When entering a new loop, its beginning position can be pushed to the stack. On every rerun the topmost entry indicates where to jump. When a loop has to be skipped, the interpreter again has to search. Hypothetically, it could use the stack to cache the end position found, at the instruction indicating the beginning of the loop. But as the command-merger runs before the script gets executed anyway, the natural approach is to do all of this just there.

Let’s look at some code

To store our “compiled” Brainfuck I just created an array where every instruction takes up two slots. The first one being the type of instruction (opcode) and the other one additional data (operand). The operand is either the data to add/subtract for arithmetic commands or the code positions to jump to for loop instructions.



  case ASCII.Plus:
  case ASCII.Minus:
  case ASCII.Left:
  case ASCII.Right:
    let n= 0;
    while( src.readUInt8(i) === cur && i < src.length ) {
      i++;
      n++;
    }
    code.push( cur );
    code.push( n );
    i--;
    break;

All commands to be merged are handled by the same switch-case. It’s essentially a loop counting the number of identical bytes.



  case ASCII.Dot:
  case ASCII.Comma:
    code.push( cur );
    code.push( 0 );
    break;

IO commands are simply added to the code array with an operand of 0.

Handling loops a bit more complicated. When reaching a new loop, the current position in the code array is pushed to the loop stack. As we don’t know where the loop ends yet, the instruction appended to the code has an operand of 0.



  case ASCII.OBracket:
    loops.push( code.length );
    code.push( cur );
    code.push( 0 );
    break;

Most of the actual work happens at the end of a loop. Popping-of the top value from the loop stack returns the beginning position of the current loop. There we can now set the current position as the operand. The instruction added to the code array uses the beginning position as its operand. Therefore the interpreter knows where the loop ends when it needs to be skipped, and where the loop begins when it needs to be rerun.



  case ASCII.CBracket:
    if( !loops.length ) {
      throw Error('Unexpected end of loop');
    }
    let x= loops.pop();
    code[x+1]= code.length;
    code.push( cur );
    code.push( x );
    break;

The execution loop has changed a bit, to deal with our improved code. On every step, it reads the command opcode and operand into two variables. Plus and minus instructions just add/subtract the operand.



function run() {
  const data= [ 0 ];
  let ptr= 0;

  for( let i= 0; i!= code.length; i+=2 ) {
    const cmd= code[i];
    const opr= code[i+1];

    switch( cmd ) {
      case ASCII.Plus:
        data[ptr] += opr;
        break;

      case ASCII.Minus:
        data[ptr] -= opr;
        break;

      //...

    }
  }
}

The pointer movement instructions do the same thing, plus some additional checks, that it won’t underflow the begin of the cell array and append cells if needed.



  case ASCII.OBracket:
    if( !data[ptr] ) {
      i= opr;
    }
    break;

  case ASCII.CBracket:
    if( data[ptr] ) {
      i= opr;
    }
    break;

Being aware of the delimiting positions of loops, the instructions changed to conditional jumps. They behave like JZ (jump if zero) and JNZ (jump if not zero) in x86 assembly.

Cutting time down like a chainsaw

After all this work, there is one question left. Did it pay of? Looking at the numbers, the answer is a big fat yes.

Comparing the best runtimes from each version, the new version runs at about 4.4 times the speed. These graphs include the time needed to compile the script. I measured both, with and without the compile step but it didn’t make much of a difference.

Next time we will look at another way to build the interpreter without a single large switch. Stay tuned! 😊

An exercise to the reader

Two more things could be easily improved about this version. The plus and minus instructions could be combined. Instead of subtraction, the operand will be a negative value that’s added.

Furthermore, the switch could be simplified to use only numbers from 0 to N with increments of one instead of the randomly distributed ASCII codes. Additionally prevent object lookups by using actual number constants in the ‘case’ statements, although v8 should be able to recognize, that the object is created and never modified, allowing to inline its values. By changing the switch this way, v8 can JIT compile it to a jump table removing unnecessary branches. Although I’m not sure if TurboFan, the optimization backend of v8, is even able to do such a thing.

But based on this little test it at least makes some kind of difference.

Part 1
last edited 10th Aug 20