Starting a tree-walk interpreter in C

I have been reading Crafting Interpreters by Robert Nystrom for a while now. It is one of those books that is hard to put down — not because it is gripping in the thriller sense, but because every chapter ends with something actually working. You write code, you run it, something happens. That feedback loop is rare in systems books.

The book walks you through two implementations of a language called Lox. The first, jlox, is a tree-walk interpreter written in Java. The second, clox, is a bytecode VM in C. The intention is that you do both, and the second one teaches you why the first one is slow.

I am doing both. But I am also doing something slightly off-script: I am translating jlox into C before moving on to clox. Not because the book tells me to — it does not — but because I want to understand the tree-walk approach at the level C forces you to think at, before the bytecode abstraction takes over.

This series documents that process.

Why C

Java handles a lot of things quietly. Memory is managed. Strings are objects. You can throw an exception and something will catch it. C does none of that, and that is the point.

When you write a scanner in Java, you return an ArrayList<Token> and move on. In C, you have to decide: where does this list live? Who owns it? When does it get freed? These are not interesting questions in the abstract — they are interesting because answering them forces you to understand exactly what is happening.

I want that understanding. The discomfort is the point.

What I have so far: the scanner

The scanner — or lexer, the terms are interchangeable here — takes raw source code as a string and produces a flat list of tokens. A token is a small struct: a type, a pointer into the source for the lexeme, its length, and the line it appeared on.

typedef enum {
  // Single character
  TOKEN_LEFT_PAREN, TOKEN_RIGHT_PAREN,
  TOKEN_LEFT_BRACE, TOKEN_RIGHT_BRACE,
  TOKEN_COMMA, TOKEN_DOT, TOKEN_MINUS, TOKEN_PLUS,
  TOKEN_SEMICOLON, TOKEN_SLASH, TOKEN_STAR,

  // One or two characters
  TOKEN_BANG, TOKEN_BANG_EQUAL,
  TOKEN_EQUAL, TOKEN_EQUAL_EQUAL,
  TOKEN_GREATER, TOKEN_GREATER_EQUAL,
  TOKEN_LESS, TOKEN_LESS_EQUAL,

  // Literals
  TOKEN_IDENTIFIER, TOKEN_STRING, TOKEN_NUMBER,

  // Keywords
  TOKEN_AND, TOKEN_CLASS, TOKEN_ELSE, TOKEN_FALSE,
  TOKEN_FOR, TOKEN_FUN, TOKEN_IF, TOKEN_NIL, TOKEN_OR,
  TOKEN_PRINT, TOKEN_RETURN, TOKEN_SUPER, TOKEN_THIS,
  TOKEN_TRUE, TOKEN_VAR, TOKEN_WHILE,

  TOKEN_ERROR, TOKEN_EOF
} TokenType;

typedef struct {
  TokenType type;
  const char* start;
  int length;
  int line;
} Token;

The scanner itself holds a pointer to the source and tracks where it currently is:

typedef struct {
  const char* start;
  const char* current;
  int line;
} Scanner;

start marks the beginning of the current lexeme being scanned. current is where we are right now. When we finish a token, everything between start and current is the lexeme — no copying, just a pointer and a length. This is one of the places where C is actually nicer than Java for this kind of work.

The core loop is straightforward:

Token scanToken(Scanner* scanner) {
  skipWhitespace(scanner);
  scanner->start = scanner->current;

  if (isAtEnd(scanner)) return makeToken(scanner, TOKEN_EOF);

  char c = advance(scanner);

  if (isAlpha(c)) return identifier(scanner);
  if (isDigit(c)) return number(scanner);

  switch (c) {
    case '(': return makeToken(scanner, TOKEN_LEFT_PAREN);
    case ')': return makeToken(scanner, TOKEN_RIGHT_PAREN);
    case '{': return makeToken(scanner, TOKEN_LEFT_BRACE);
    case '}': return makeToken(scanner, TOKEN_RIGHT_BRACE);
    case ';': return makeToken(scanner, TOKEN_SEMICOLON);
    case ',': return makeToken(scanner, TOKEN_COMMA);
    case '.': return makeToken(scanner, TOKEN_DOT);
    case '-': return makeToken(scanner, TOKEN_MINUS);
    case '+': return makeToken(scanner, TOKEN_PLUS);
    case '/': return makeToken(scanner, TOKEN_SLASH);
    case '*': return makeToken(scanner, TOKEN_STAR);
    case '!':
      return makeToken(scanner,
        match(scanner, '=') ? TOKEN_BANG_EQUAL : TOKEN_BANG);
    case '=':
      return makeToken(scanner,
        match(scanner, '=') ? TOKEN_EQUAL_EQUAL : TOKEN_EQUAL);
    case '<':
      return makeToken(scanner,
        match(scanner, '=') ? TOKEN_LESS_EQUAL : TOKEN_LESS);
    case '>':
      return makeToken(scanner,
        match(scanner, '=') ? TOKEN_GREATER_EQUAL : TOKEN_GREATER);
    case '"': return string(scanner);
  }

  return errorToken(scanner, "Unexpected character.");
}

Nothing here should be surprising if you have read the book. The one thing worth noting is that makeToken does not copy the lexeme — it stores a pointer into the original source string. That is fine as long as the source outlives the tokens, which it does in our case since we own it in main.

What is next

The scanner works — I can feed it source and get a token stream back. Next is the parser, which means decisions about how to represent the AST in C. Java gives you inheritance for free; in C you have to simulate it, usually with tagged unions or a void pointer plus a type tag.

That is where it gets interesting.