After reimplementing one of my Uni assignments with scratch I wondered how far I can push the ghetto-scratch-CPU-emulator-thingy I came up with to handle local variables. I felt the urge to do some hello-world-printing but using the draw functions instead of a simple message box. However, to be fully honest, I don’t really know why I thought that Scratch needed this feature.
But first of all, I wanted to be able to allocate memory dynamically, so here we go…
Memory allocation basics
If you ever wrote anything in the C programming language and had to store either more than a few hundred bytes or didn’t know how much space you might need until the program runs you will be familiar with the malloc function. It is provided by the C standard library and allows for dynamic memory allocation. Whenever a block of memory is requested, a call is done to the OS to have additional memory mapped into the process’s address space. But as this would mean, that each process would consume more and more memory, the free function exists.
Calling free returns the memory back to the malloc-infrastructure, to allow recycling of memory blocks. Contrary to my own belief when I started programming it’s not the OS that is concerned with the memory management of user-level programs. It only provides them with a usable address space that might be expanded later on. This has mainly two advantages: Calls to the OS can be pretty slow and having every allocation trigger a context switch into kernel space would be a no-no. Moreover, memory mapping can also be slow depending on the requested block size.
Secondly, having the user handle the memory management themselves allows for custom implementations of malloc for specific applications. The whole system is part of the standard library, but one can write their own thing and do the system calls to the OS. C++ even provides the placement new operator for this exact purpose. Google for example provides the TCMalloc allocator.
Under the hood, malloc has to keep track of all the allocations done by the program, to allow correct freeing later on. As the user only gets handed over a pointer to the requested memory, the block’s size has to be stored by malloc. Furthermore, all the freed blocks need to be kept somewhere for recycling.
Although this doesn’t mean, memory is never returned to the OS (free might decide to call munmap to release memory back to the OS) usually most deallocations are absorbed by malloc . This reduces the traffic of syscalls a lot.
So as a little recap: Every time malloc is called, the system checks whether a block already exists, that can fit the requested amount of memory. In case there is nothing to reuse or more memory is needed than any of the free blocks can contain, the OS is asked to expand the address space. If the user decides to return memory via free , it is added to the list of reusable blocks.
Coming up with a memory layout
Scratch technically allows for dynamic memory allocation, but not the way it would be necessarily useful for more involved programming. Arrays or “lists” how Scratch calls them, can grow and shrink dynamically, which I used as a makeshift variable stack last time. Now that I want to be able to have an area for global variables, and a heap, I repurposed the array as general memory space. Everything except the registers is stored in it and pointers are just array indices. The fact that the creators of Scratch thought it was acceptable to have arrays start at index 1, has the interesting side effect, that null pointers with the value zero are actually invalid, as they don’t index anything in the memory array.
This principle of using an array-like memory segment to run low-level code expecting full control over its memory is also used by web assembly for example. There it is called linear memory, which is nothing more than a byte buffer allocated in JS and handed to the wasm execution context.
The diagram above shows, how I laid out the memory. The first section is the stack, that like every correct design should, grows from top to bottom. Then the following section of predetermined size contains any static or global data cells. Finally, the last section is the heap that grows upwards.
The possibly simplest implementation for an allocation function would be a so-called “bump allocator”. It basically works like a stack, with a pointer pointing to the next free cell. On every allocation, the pointer gets moved aka bumped by the number of required cells. Freeing is impossible with this kind of system, which results in a steadily growing memory footprint.
A better yet still no too complicated way is to use reusable blocks like discussed previously. Unused blocks are stored in a linked list, that is searched whenever memory is requested. Therefore every block has to store its size and the pointer to the next block in the list. This is done in the block’s header, which is hidden from the user.
My implementation is very naive and tries to minimize the search effort by returning the first block that is large enough to fit the number of requested cells. Meaning that it might return a block that is way too large for the request. This is not a problem for the user, as the only consequence, a write outside of the expected boundary will just not fail. It’s simply really inefficient if a 5 cell allocation occupies a 500 cell block of memory.
In case none of the blocks in the list is large enough or the list is currently empty, a new block is created on top of the heap. The number of usable cells is at least four, to make later reusage more likely. In a sense, this works like a bump-allocator.
While the internal list of blocks uses pointers, that point to the first element of the block structure, the pointer returned to the user points to the first usable cell instead, hiding the header. Hence, the free function first has to convert a received pointer by subtracting the header length. Then the block is pushed to the front of the list.
A scratch block can allocate memory on the heap by calling the alloc function which will return a pointer to the requested block in the A register.
The image shows the allocation of a glyph table, which will come in use in the next article. 😊
If you want to play around a bit yourself, you can download the scratch file. Don't forget to change the file extension to .sb3 !
Improvements and closing notes
I described my implementation as “simple” which is not false, but calling it “plainly bad” would probably be correct too. One simple way to improve the search performance would the usage of multiple lists, where each stores blocks of different size categories. If a large block is needed, required to fit more than 64 cells for example, the search loop doesn’t have to iterate over a bunch of 4 cell blocks to find an appropriate one. This comes with the additional plus, that the relative part of wasted cells is more controllable, as only smaller blocks are searched for smaller allocations. Another even better option would be to keep them inside a sorted tree, to reduce the search time even further.
Still, if a large block is only partly filled, the absolute number of wasted cells might be quite large. Furthermore, if the user at some point allocates a bunch of small objects and frees them, the heap is forever stuck with them, and can’t reuse them properly. It would be therefore desirable to merge adjacent blocks during freeing and split them on allocation if more than 16 cells get wasted for example.
Splitting would create two blocks, one containing the old header, that now needs to be changed to the split size, and one that needs to be initialized as a new block. The one that remains needs to be added to the list (or one of the lists) of free blocks.
Merging blocks is a little bit more tricks. If a block is returned it should check if any adjacent blocks are also free and can therefore be swallowed. But how would you even find the next block? As every block knows its own size, the heap itself can be traversed like a linked list, since every next block starts directly after the previous one. So on free, adding the own size to the block pointer reveals the block directly after this one. But it’s only a single linked list, which means we cannot check the block before us. Hence, there is still a fifty-fifty chance, that we cannot combine blocks, even though an adjacent one is free. On the other hand, there could be a whole trail of free blocks behind the current one.
Implementing such a feature requires the block header to grow a bit. Every block has to store now whether it is currently used with a flag. Additionally, when merging, we mustn’t forget to remove the swallowed block from the list of free blocks. To do that in an efficient manner, a double linked list needs to be used instead, requiring the header to store an extra “previous” pointer.
For debugging purposes, I put the free pointer, which stores the beginning of the list of free blocks, into its own register, which does not make much sense in practice. It should be stored in the memory’s static section.
Moreover, I erroneously called the function expanding the heap mmap . The way it behaves it should actually be called sbrk if I wanted to stick to system call names.