Introduction
When I stumbled upon the book “Writing an Interpreter in Go”, shared by some fellow developers, I was immediately intrigued. The idea of building an interpreter from scratch seemed daunting yet fascinating. In this post, I’ll break down the core components of an interpreter—lexers, parsers, abstract syntax trees (ASTs), evaluators, and more. Let’s dive in!
What’s a Lexer?
A lexer (or lexical analyzer, or scanner) is the first step in the process of interpreting source code. Its job is to read the raw input (the program as text) and break it up into a sequence of tokens. Each token represents a meaningful element: a keyword, identifier, operator, literal, etc.
Why is a Lexer important?
-
Simplifies Parsing: By converting raw text into tokens, the parser can focus on structure, not character-by-character details.
-
Error Detection: Lexers can catch invalid characters or malformed tokens early.
-
Foundation for Syntax: Every higher-level component (parser, evaluator) depends on the lexer’s output.
How does the Lexer work?
Lexer Structure:
type Lexer struct {
input string
position int
readPosition int
ch byte
}
- input: The full source code as a string.
- position: Index of the current character being looked at.
- readPosition: Index of the next character to read (helps with lookahead).
- ch: The current character under examination.
Initialization:
func New(input string) *Lexer {
l := &Lexer{
input: input,
}
l.readChar()
return l
}
- When a new lexer is created, it stores the input and reads the first character to initialize ch.
Reading Characters:
func (l *Lexer) readChar() {
if l.readPosition >= len(l.input) {
l.ch = 0 // ASCII code for 'NUL', signals end of input
} else {
l.ch = l.input[l.readPosition]
}
l.position = l.readPosition
l.readPosition += 1
}
- Moves the lexer forward by one character.
- If at the end of input, sets ch to 0 (signals EOF).
Tokenization:The Heart of the Lexer
func (l *Lexer) NextToken() token.Token
- Skips whitespace using skipWhitespaces().
- Switches on the current character (ch) to decide what kind of token to emit:
- Single-character tokens: e.g., +, -, *, /, ;, (, ), {, }, [, ], :
- Multi-character operators: e.g., ==, != (uses peekChar() to look ahead)
- String literals: If ch is ”, calls readString()
- Identifiers/Keywords: If ch is a letter, calls readIdentifier(), then uses token.LookupIdent() to distinguish keywords from identifiers.
- Numbers: If ch is a digit, calls readNumber()
- Illegal characters: Anything else is marked as ILLEGAL.
- Advances to the next character after emitting a token (except for identifiers and numbers, which already advanced).
Helper Methods:
- peekChar(): Looks ahead one character without advancing the lexer.
- readIdentifier(): Reads a sequence of letters/underscores (for identifiers and keywords).
- readNumber(): Reads a sequence of digits (for integer literals).
- readString(): Reads until the next ” or end of input.
- skipWhitespaces(): Skips spaces, tabs, newlines, and carriage returns.
Token Construction:
- newToken(tokenType, ch): Helper to create a token from a type and a character.
How the Lexer Advances:
- The lexer always keeps track of the current character and the next character.
- For each call to NextToken(), it emits the next token and advances its position.
- For identifiers and numbers, it reads the whole sequence before emitting the token.
- For operators that could be two characters (like ==), it peeks ahead and combines if needed.
Example Walkthrough:
Given input:
let x = 5 + 10;
- Skips whitespace.
- Reads let as an identifier, checks if it’s a keyword.
- Reads x as an identifier.
- Reads = as an assign token.
- Reads 5 as an integer.
- Reads + as a plus token.
- Reads 10 as an integer.
- Reads ; as a semicolon.
How Lexers Can Differ:
- Some lexers use regular expressions for tokenization.
- Some support more complex lookahead (not just one character).
- Some handle comments, escape sequences, or Unicode.
- Some are table-driven or use state machines for more complex languages.
Purpose of the Token Component
The Token component defines:
- The types of tokens the language understands (keywords, operators, literals, etc.)
- The structure of a token (what information each token holds)
- How to distinguish between identifiers and keywords
Tokens are the “words” of your language, produced by the lexer and consumed by the parser.
Token Structure:
type Token struct {
Type TokenType
Literal string
}
- Type: The category of the token (e.g., IDENT, INT, PLUS, LET).
- Literal: The actual string value from the source code (e.g., “let”, “x”, “5”).
Token Type Constants:
- Special: ILLEGAL, EOF
- Identifiers/Literals: IDENT, INT, STRING
- Operators: ASSIGN, PLUS, MINUS, BANG, SLASH, ASTERISK, LT, GT, EQ, NOT_EQ
- Delimiters: COMMA, SEMICOLON, LPAREN, RPAREN, LBRACE, RBRACE, LBRACKET, RBRACKET, COLON
- Keywords: FUNCTION, LET, TRUE, FALSE, IF, ELSE, RETURN
Keyword Lookup:
var keywords = map[string]TokenType{
"fn": FUNCTION,
"let": LET,
"true": TRUE,
"false": FALSE,
"if": IF,
"else": ELSE,
"return": RETURN,
}
func LookupIdent(ident string) TokenType {
if tok, ok := keywords[ident]; ok {
return tok
}
return IDENT
}
- keywords map: Maps reserved words in Monkey to their token types.
- LookupIdent: When the lexer reads an identifier, it uses this function to check if it’s a keyword. If so, it returns the keyword’s token type; otherwise, it returns IDENT.
How Tokens Are Used:
- The lexer emits tokens as it scans the input.
- The parser consumes these tokens to build the AST.
- Each token carries both its type (for syntax) and its literal value (for semantics).
Example:
Given:
let x = 5;
The lexer would emit:
- {Type: “LET”, Literal: “let”}
- {Type: “IDENT”, Literal: “x”}
- {Type: ”=”, Literal: ”=”}
- {Type: “INT”, Literal: “5”}
- {Type: ”;”, Literal: ”;”}
- {Type: “EOF”, Literal: ""}
Purpose of the Parser
The parser’s job is to take the stream of tokens produced by the lexer and turn them into an Abstract Syntax Tree (AST), which represents the syntactic structure of the program. This tree is then used by the evaluator to execute the program.
Parser Structure:
type Parser struct {
l *lexer.Lexer
errors []string
curToken token.Token
peekToken token.Token
prefixParseFns map[token.TokenType]prefixParseFn
infixParseFns map[token.TokenType]infixParseFn
}
- l: The lexer instance providing tokens.
- errors: List of parsing errors.
- curToken/peekToken: The current and next tokens (for lookahead).
- prefixParseFns/infixParseFns: Maps of functions for parsing expressions, depending on the token type and position.
Operator Precedence:
The parser uses a precedence climbing (Pratt parsing) approach:
- Precedence levels are defined as constants (e.g., LOWEST, SUM, PRODUCT).
- The precedences map assigns precedence to token types (e.g., + and - have the same precedence).
Initialization:
func New(l *lexer.Lexer) *Parser
- Registers parsing functions for different token types (prefix and infix).
- Reads two tokens to initialize curToken and peekToken.
Parsing Functions:
- Prefix parse functions: Handle tokens that start an expression (e.g., identifiers, literals, prefix operators).
- Infix parse functions: Handle tokens that appear between expressions (e.g., +, -, *, /, ==, !=).
These are registered in the constructor, allowing the parser to be easily extended.
Main Parsing Loop:
func (p *Parser) ParseProgram() *ast.Program
- Creates an
ast.Programnode. - Loops through tokens, calling
parseStatement()for each top-level statement until EOF.
Statement Parsing:
- parseStatement: Dispatches to specific statement parsers based on the current token (e.g.,
parseLetStatement,parseReturnStatement,parseExpressionStatement). - parseLetStatement: Parses
letstatements. - parseReturnStatement: Parses
returnstatements. - parseExpressionStatement: Parses statements that are just expressions.
Expression Parsing:
-
parseExpression(precedence int): The core of Pratt parsing. It:
-
Looks up the prefix parse function for the current token.
-
Calls it to parse the left side of the expression.
-
While the next token has higher precedence, looks up the infix parse function and calls it, passing the left expression.
-
parseInfixExpression: Handles infix operators (e.g., +, -, *, /).
-
parsePrefixExpression: Handles prefix operators (e.g., -, !).
Other Features:
- parseGroupedExpression: Handles parentheses for grouping.
- parseIfExpression: Parses if expressions.
- parseFunctionLiteral: Parses function definitions.
- parseCallExpression: Parses function calls.
- parseArrayLiteral, parseHashLiteral, parseIndexExpression: Handle arrays, hashes, and indexing.
Error Handling: If a token is not expected, the parser records an error but tries to continue parsing.
Example: Parsing an Expression:
Give:
let x = 5 + 3 * 2;
- The parser will build an AST that represents the correct precedence: 5 + (3 * 2).
Extensibility:
New syntax can be added by registering new prefix/infix parse functions and updating the precedence map.
Purpose of AST:
The AST is a tree-like data structure that represents the syntactic structure of the source code.
- Each node in the tree corresponds to a construct in the language (e.g., expressions, statements, literals).
- The parser builds the AST, and the evaluator walks it to execute the program.
Core Interfaces:
type Node interface {
TokenLiteral() string
String() string
}
type Statement interface {
Node
statementNode()
}
type Expression interface {
Node
expressionNode()
}
- Node: The base interface for all AST nodes. Requires methods for getting the literal value of the token and a string representation.
- Statement: Represents statements (e.g.,
let x = 5;). Embeds Node. - Expression: Represents expressions (e.g., 5 + 3). Embeds Node.
Key Node Types:
Program root:
type Program struct {
Statements []Statement
}
- The root node of every Monkey program. Contains a list of statements.
Statements:
- LetStatement: Represents
letbindings. - ReturnStatement: Represents
returnstatements. - ExpressionStatement: A statement that is just an expression.
Expressions:
- Identifier: Variable names.
- IntegerLiteral, StringLiteral, Boolean: Literal values.
- PrefixExpression: Prefix operators (e.g., -x, !x).
- InfixExpression: Infix operators (e.g., x + y).
- IfExpression: Conditional expressions.
- FunctionLiteral: Function definitions.
- CallExpression: Function calls.
- ArrayLiteral, HashLiteral, IndexExpression: Arrays, hashes, and indexing.
Node Examples:
type InfixExpression struct {
Token token.Token // the operator token, e.g., '+'
Left Expression
Operator string
Right Expression
}
- Represents expressions like 5 + 3.
- Has left and right expressions and the operator.
type LetStatement struct {
Token token.Token // the 'let' token
Name *Identifier
Value Expression
}
- Represents statements like
let x = 5;.
String Representation:
- Each node implements a String() method, which is useful for debugging and testing.
- For example, an infix expression’s String() method will recursively call String() on its left and right nodes.
How the AST is Built:
- The parser creates these nodes as it recognizes language constructs.
- For example, when parsing
let x = 5 + 3;, the parser creates aLetStatementnode, whoseValueis anInfixExpressionnode.
Extensibility:
- To add new language features, you add new AST node types and update the parser to produce them.
Example AST for a Simple Program:
Given:
let x = 5 + 3;
The AST would look like:
- Program
- LetStatement
- Name: Identifier (“x”)
- Value: InfixExpression
- Left: IntegerLiteral (5)
- Operator: ”+”
- Right: IntegerLiteral (3)
- LetStatement
Object:
The Object component defines the runtime data structures (values) used during the evaluation of a Monkey program. These objects represent the results of evaluating AST nodes and are the entities manipulated by the program at runtime (e.g., numbers, strings, functions).
Integer Object Creation
case *ast.IntegerLiteral:
return &object.Integer{Value: node.Value}
Key Role in Pipeline
- Objects are the end result of the evaluation process. They are created by the evaluator based on the AST structure (built by the parser from tokens produced by the lexer). They represent the dynamic state of the program and are used for computations, storage in the environment, and returning results to the user.
Evaluator:
The Evaluator is the execution engine of the Monkey interpreter. It walks the AST, interprets the meaning of each node, and produces runtime values (objects) based on the program’s logic. It manages variable scopes through an environment and handles control flow like returns and conditionals.
- Eval Function: The main entry point (Eval(node ast.Node, env *object.Environment) object.Object) that recursively evaluates AST nodes.
case *ast.InfixExpression:
left := Eval(node.Left, env)
right := Eval(node.Right, env)
return evalInfixExpression(node.Operator, left, right)
- Recursively evaluates left and right sides, then applies the operator
(e.g., 3 + 5 results in
object.Integer{Value: 8}).
REPL (Read-Eval-Print Loop):
The REPL provides an interactive interface for users to enter Monkey code, see it evaluated immediately, and view the results. It is a user-friendly way to experiment with the language without writing full programs.
- Read Loop: Continuously reads input from the user (likely using standard input).
- Integration with Pipeline: Feeds input to the lexer, triggers parsing and evaluation, and displays results.
- Error Display: Shows syntax or runtime errors to the user for debugging.
- Prompt: Provides a visual cue (e.g., >>) to indicate readiness for input.
func Start(in io.Reader, out io.Writer) {
scanner := bufio.NewScanner(in)
env := object.NewEnvironment()
for {
fmt.Print(PROMPT)
scanned := scanner.Scan()
if !scanned {
return
}
line := scanner.Text()
l := lexer.New(line)
p := parser.New(l)
program := p.ParseProgram()
if len(p.Errors()) != 0 {
printParserErrors(out, p.Errors())
continue
}
evaluated := evaluator.Eval(program, env)
if evaluated != nil {
io.WriteString(out, evaluated.Inspect())
io.WriteString(out, "\n")
}
}
}
func printParserErrors(out io.Writer, errors []string) {
io.WriteString(out, MONKEY_FACE)
io.WriteString(out, "Woops! We ran into some monkey business here!\n")
io.WriteString(out, " parser errors:\n")
for _, msg := range errors {
io.WriteString(out, "\t"+msg+"\n")
}
}
Summary of the Entire Pipeline
Now that we’ve covered all components, here’s a comprehensive summary of the Monkey interpreter pipeline, showing how each part interacts to transform source code into results:
- Source Code (Input): The raw text of a Monkey program, user input in the REPL.
- Component: Initiated by Main Program (decides input source) or REPL (interactive input).
- Role: Starting point of the pipeline.
- Lexer: Reads the source code character by character and produces a stream of Tokens (e.g., keywords, identifiers, operators).
- Component: Lexer and Token.
- Role: Breaks down raw text into meaningful units for parsing.
- Example: let x = 5; becomes tokens like LET, IDENT(“x”), ASSIGN, INT(“5”), SEMICOLON.
- Parser: Consumes tokens and builds an Abstract Syntax Tree (AST), representing the syntactic structure of the program.
- Component: Parser and AST.
- Role: Transforms linear tokens into a hierarchical structure for evaluation.
- Example: Tokens for let x = 5; become a LetStatement node with an Identifier and IntegerLiteral.
- Evaluator: Walks the AST recursively, interpreting each node to compute results and manage state using an environment. Produces runtime Objects (values like integers, strings).
- Component: Evaluator and Object.
- Role: Executes the program by transforming static AST into dynamic behavior and results.
- Example: Evaluates 5 + 3 in the AST to object.Integer{Value: 8}.
- Output: The results of evaluation (objects) are either stored in the environment, returned as output in the REPL, or used for further computation.
- Component: Managed by REPL (for interactive output) or Main Program (for script output).
- Role: Presents the final results to the user or system.
Source Code → [Main Program/REPL] → Lexer → Tokens → Parser → AST → Evaluator → Objects → Output
When I first started coding, all of this would have seemed like magic to me. However, it’s really just the result of a few dedicated individuals putting in the extra effort to create something valuable and useful.