Loft - Lexer

The lexer library breaks a text into tokens so your program can understand its structure. Tokens are the smallest meaningful pieces: numbers, names (like variable and function names), operators, and quoted strings. You tell the lexer which multi-character sequences count as one token and which words are reserved keywords, and it handles the rest.

Setting Up the Lexer

Create a lexer::Lexer and register your language's rules before you parse anything. set_tokens ensures operators like += or >> are scanned as one token instead of two separate characters. set_keywords prevents reserved words from being treated as ordinary names — the lexer will report them exactly as written so your parser can treat them specially.

use lexer;
fn main() {
  l = lexer::Lexer { };
  l.set_tokens(["+=", "*=", "-=", "<=", ">=", "!=", "==", ">>", "<<", "->", "=>", ">>>", "..", "..=", "&&", "||"]);
  l.set_keywords(["for", "in", "if", "else", "fn", "pub", "use", "struct", "enum", "match", "and", "or"]);

Reading Tokens

parse_string(name, source) feeds source text into the lexer. The name is used in error messages and position reports. After that, call the typed reader functions one by one to consume tokens in order.

int() consumes and returns the next integer token, or null if the current token is not an integer. long_int() does the same for integers suffixed with l. matches(s) consumes the next token only when it equals s and returns true; otherwise it leaves the token in place and returns false. peek() returns the next token as text without consuming it. position() returns the current location as file:line:col.

  l.parse_string("Tokens", "12 += -2 * 3l >> 4");
  assert(l.int() == 12, "Integer");
  assert(!l.matches("+"), "Incorrect plus");
  assert(l.peek() != "+", "Incorrect plus");
  assert(l.matches("+="), "Incorrect plus_is");
  assert(l.int() == -2, "Second integer");
  assert(l.matches("*"), "Incorrect multiply");
  assert(l.int() != 3, "Third number");
  assert(l.long_int() == 3, "Incorrect long");
  assert(l.position() == "Tokens:1:15", "Incorrect position {l.position()}");
  assert(!l.matches(">"), "Incorrect higher");
  assert(l.matches(">>"), "Incorrect logical shift");
  assert(l.position() == "Tokens:1:18", "Incorrect position {l.position()}");

String Literals and Comments

constant_text() reads a double-quoted string and handles special codes like \n (newline) and \\ (backslash). constant_character() reads a single-quoted character literal and returns it as text.

  l.parse_string("Texts", "\"123\" + '4'");
  assert(l.constant_text() == "123", "Incorrect text literal");
  assert(l.matches("+"), "Incorrect add");

P111: constant_character() returns a character, not text "123". This assert was passing only because char == text always returned true. TODO: fix the test to match actual constant_character() semantics. assert("{l.constant_character()}" == "123", "Incorrect text literal"); The lexer collects // comments automatically as it scans. You do not need to handle them yourself. last_comment() returns the accumulated comment text since the last consumed token. When multiple comment lines appear in a row they are joined with newlines into a single string. comment_behind() is true when the comment appeared on the same line as the preceding token rather than on its own line above. is_finished() returns true once every token has been consumed.

  l.parse_string("Comments", "// starting comments\n123 // same line comment\n// extra comment\n4");
  assert(!l.comment_behind(), "Initial comment not behind");
  assert(l.last_comment() == "starting comments", "Initial comment");
  assert(l.int() == 123, "Content integer");
  assert(l.comment_behind(), "Second comment is behind");
  assert(l.last_comment() == "same line comment\nextra comment", "Second comment");
  assert(!l.is_finished(), "Not Ready");
  assert(l.int() == 4, "Second integer");
  assert(l.last_comment() == "", "No remaining comment");
  assert(l.is_finished(), "Ready");

Embedded Format Expressions

Loft string literals can embed expressions with {expr}. The lexer exposes a protocol that lets you parse these yourself. When constant_text() reaches a {, it returns the literal text before it and sets is_formatting() to true. At that point call set_formatting(false) and parse the embedded expression normally using the usual token readers. When the expression is done, call set_formatting(true) and consume the closing }}. Then constant_text() continues with the next segment of the string.

  l.parse_string("Formatting", "\"abc{{12 + 34}}def\"");
  assert(l.constant_text() == "abc", "Before formatting");
  assert(l.is_formatting(), "Formatting");
  l.set_formatting(false);
  assert(l.int() == 12, "First integer");
  assert(l.matches("+"), "Incorrect plus");
  assert(l.int() == 34, "Second integer");
  l.set_formatting(true);
  assert(l.matches("}}"), "Incorrect closing brace");
  assert(l.constant_text() == "def", "After formatting");
  assert(!l.is_formatting(), "Formatting");
}