If you had to describe what a programming language looks like really precisely to someone lacking any ability of independent thought, someone struggling to understand basic natural language, suffering from the complete inability to make any kind of useful assumptions or fill in holes themselves. How would you explain to someone unable to decipher the meaning of basic words, what a variable declaration looks like? Some like a computer for example.
Well, you would use a language they understand. When talking about teaching a computer reading structured text, one might think of regular expressions (regex) and they would be correct in 90% of all cases. Using regex a programmer can define which characters, how often and in what order they should appear, to judge the correctness of an input or filter out some information.
There are many ways to write such expressions, but they are mostly made up of three things. A way to define what kind of character or character string to expect, a way to define a repetition pattern and finally a way to group them to create more complex structures. The order they should appear in is usually inferred from the direction the regex is read, therefore mostly left to right.
The regex helloworld for example expects a string containing helloworld . If you wanted to change the number of possible l characters in the word “hello” allowed from two to at least two you can use the + operator. hel+oworld . There are multiple such operators like ? , * and + , meaning “one or none”, “none or many” and “one or more”. To repeat whole words you need to group the characters with braces (helloworld)? making the whole word optional for example. An empty string would now also be accepted as a correct input value. To make our many l version optional we would write (hel+oworld)? .
Such regex can get extremely evolved, and there is also a fourth thing I glossed over called “look arounds” allowing to restrict if something may be followed or may follow something else. If you want to learn more about regex and maybe play around with them a little, I highly recommend this page where you can write your expressions and interactively see what and how parts of your input are matched.
After this super quick introduction into regex you might think, this is just the right tool to describe a programming language to build a syntax highlighter or parser for a compiler, right? Sadly no. There something very important regex cannot handle: recursion. What do I mean by that? If you wanted to express that a string is only valid if it’s constructed from the word "hello" enclosed in a matching set of hashtags, how would you write that?
Maybe try #*hello#* , which does the trick if you don’t care that the number of hashtags on the right and the left side is unequal. Writing something like this in regex is impossible. What we need is a means to define that we want an input consisting of something enclosed in a matching pair of # . This something is either the word "hello" or again something enclosed in hashtags, aka recursion.
Programming languages are full of these annoying recursions. Just think of if and else blocks, parenthesis in function calls or object definitions in JS.
Without getting too theoretical different text structures can be put into multiple different groups where one is the superset of the other one. The smallest set is so-called “finite languages”. These are made up of a predefined list of accepted inputs called words, like a lookup table. The next larger set is “regular expressions” which define languages that might be made up from an infinite number of different words. a+ for example defines a language that accepts the words a , aa , aaa and so on. Finally, there are “context-free grammars” which allow for recursion, which is exactly what we need. Then there are a few more, that we don’t care about today.
So what does a context-free grammar look like? Similar to regex, there are many ways to express them, but I prefer a modified version of EBNF, which is a very concise and easy-to-read format. Basically, EBNF is made up of a list of named regex rules that can use each other inside of them. To make it possible to distinguish between expected characters (terminals) and the name of a rule (non-terminal) all terminals are under quotation marks like string constants.
Going back to our previous example of hashtag enclosed words, we want to define a rule that either recurs with a pair of hashtags or terminates with the word "hello". MyRule ::= ("#" MyRule "#") | "hello" does just that. With the pipe operator, we allow two different ways to match an input, and the first one that matches will be consumed.
rule ::= name "::=" expression+ expression ::= quantified_value ( "|" quantified_value )* quantified_value ::= value ("?" | "*" | "+")? value ::= name | ("\"" text "\"") | ( "(" expression+ ")" ) text ::= (char | digit)+ name ::= char+ digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" char ::= "_" | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"
The code above shows how the EBNF syntax for a single rule can be described using EBNF. To keep it simple, I made some simplification of the characters usable in strings and rule names.
In the next part, we will take a look at a tool I wrote to parse EBNF and use it to match input strings.