Building a lexer programatically #1100

mcy · 2023-10-07T18:31:18Z

mcy
Oct 7, 2023

Hello!

A month or so ago I chatted with Andrew about a somewhat unusual usage of regex_automata: taking a user-defined definition for a C-like language's tokens and generating a token-tree-yielding lexer from it. Andrew asked me to put together a discussion post so we can figure this out together!

The surface-level API for defining a syntax is described below, as a user would manipulate it. The resulting lexer produces something morally equivalent to proc_macro::TokenStream. I have not published this library yet, so I don't have any code I can point to... but ideally that should not be necessary.

let mut b = Spec::builder();
b.comment("//");
b.block_comment(("/*", "*/"));

Lexemes {
  kw_let: b.keyword("let"),
  kw_fn: b.keyword("fn"),
  kw_type: b.keyword("type"),
  kw_ptr: b.keyword("ptr"),
  kw_void: b.keyword("void"),
  kw_null: b.keyword("null"),
  kw_poison: b.keyword("poison"),

  kw_x: b.keyword("x"),
  equal: b.keyword("="),
  comma: b.keyword(","),
  colon: b.keyword(":"),
  semi: b.keyword(";"),
  minus: b.keyword("-"),
  arrow: b.keyword("->"),

  parens: b.delimiter(("(", ")")),
  bracks: b.delimiter(("[", "]")),
  braces: b.delimiter(("{", "}")),

  ident: b.rule(IdentRule::new().with_continue('/')),
  bare_symbol: b.named_rule(
    "symbol",
    IdentRule::new()
      .with_starts('0'..'9')
      .with_continue('/')
      .sigils(|s| {
        s.prefixes(["%", "@", "#"]).require_prefix();
      }),
  ),
  quote_symbol: b.named_rule(
    "symbol",
    QuotedRule::new([("\"", "\"")])
      .escapes(
        Escapes::new()
          .rule('\\', EscapeRule::Invalid)
          .rules([
            ("\\0", '\0'),
            ("\\n", '\n'),
            ("\\r", '\r'),
            ("\\t", '\t'),
            ("\\\\", '\\'),
            ("\\\"", '\"'),
            ("\\\'", '\''),
          ])
          .rule(
            "\\x",
            EscapeRule::Fixed {
              char_count: 2,
              parse: Box::new(|hex| u32::from_str_radix(hex, 16).ok()),
            },
          ),
      )
      .sigils(|s| {
        s.prefixes(["%", "@", "#"]).require_prefix();
      }),
  ),
  number: b.rule(
    NumberRule::new([
      NumberBase::new(10, "")
        .max_decimal_points(1)
        .exponent_part(NumberExponent::new(10, "e").extra_prefix("E")),
      NumberBase::new(16, "0x")
        .max_decimal_points(1)
        .exponent_part(NumberExponent::new(10, "p").extra_prefix("P")),
    ])
    .separator("_"),
  ),
  int_type: b.named_rule(
    "`iN`",
    NumberRule::new([NumberBase::new(10, ""), NumberBase::new(16, "0x")])
      .separator("_")
      .sigils(|s| {
        s.prefix("i").require_prefix();
      }),
  ),

  spec: b.compile(),
}

Essentially I am building something similar to what a crate like logos does, but much more opinionated in what it can parse to simplify common cases and provide a simpler API for converting tokens into an AST.

Currently the lexer is implemented (somewhat incorrectly) by building a trie of prefixes that can start a lexeme, and then dealing with some special cases around things like Unicode XIDs.

This is... not ideal. Instead, I would like to be able to compile this specification, at runtime, into a DFA that recognizes a single token, and explore that DFA by hand. (This bit is necessary, since the token grammar I am parsing is almost entirely regular but technically context sensitive in some places, so I need to implement a very limited version of backreferences).

However, I have no desire to implement any of the DFA construction and evaluation algorithms necessary to do this for arbitrary user specifications, so I want to use regex_automata to do this, and learn something about DFAs along the way. I would appreciate some broad pointers at what to look at first. I expect that the description I've given to be insufficient, so please ask clarifying questions on what I'm trying to do and I'll do my best to answer them; I don't know what I don't know, so to speak.

BurntSushi · 2023-10-07T19:36:43Z

BurntSushi
Oct 7, 2023
Maintainer

Hiya! I think the key examples you want are these:

This one shows how to build a DFA and walk its transitions manually: https://docs.rs/regex-automata/latest/regex_automata/#build-a-full-dfa-and-walk-it-manually
This one is slightly higher level, but shows overlapping matches (not sure how relevant that is for your use case): https://docs.rs/regex-automata/latest/regex_automata/#find-all-overlapping-matches
Shows a higher level example of a simple regex (this uses the meta engine, which you probably won't touch if you want to specifically stick to DFAs): https://docs.rs/regex-automata/latest/regex_automata/meta/struct.Regex.html#example-simple-lexer
The Automaton trait will be your best friend. While many of its methods have examples, you'll likely find this to be far more complicated than what you might expect given some loose background knowledge of DFAs. The main points of complexity are support for look-around and various optimizations. (For example, the alphabet is not just u8, but actually every possible u8 and a special end-of-input sentinel value.) But take a gander at the docs and if you have questions just relay them here.
Depending on what you're doing, it may be advantageous to avoid building a full DFA in the first place. For example, you can build a lazy DFA. It will build out its transition table at search time. The downside is that you need to pass in mutable Cache values at search time, and in some pathological cases, it can get quite slow. The upside is that you don't need to build out a full DFA ahead of time. That may or may not matter for you, dunno. The lazy DFA exposes basically the same API as the Automaton trait. i.e., You get the same low level access to individual transitions.
Beware of Unicode. It will make your DFAs large. Very large. Things like \w are Unicode-aware by default.

OK, I've probably thrown enough at you to get started. This should hopefully spur some questions!

0 replies

mcy · 2023-10-07T20:02:35Z

mcy
Oct 7, 2023
Author

Beware of Unicode. It will make your DFAs large. Very large. Things like \w are Unicode-aware by default.

I think this is the biggest issue I want to discuss up-front. Most of my character classes are small, but XID_START and XID_CONTINUE are not. I suspect the right answer is to somehow cheat and when I would match an [:xid:]-type thing, I would slam the breaks on manually walking the DFA and hit my own optimized unicode tables. I'm not sure what the correct way to detect that case and act on it, though.

For example, suppose that my lexer believes that foo is a keyword, identifiers don't require sigils, and I have the input foobar. How do distinguish between them without accidentally matching foo (since from the DFA's perspective, it's the longest prefix), and also know when to switch from "advance the DFA" to "manually lex XID characters for a while".

Relatedly, one kind of token I want to offer recognition for is a generalization of Rust's r###"..."### strings, and C++'s R"blah(...)blah" strings. These are obviously not regular and will require me to implement backreferences. Do you have any pointers for how to do this? This feels similar to the XID hack I described above.

There's also the problem of continuing to explore the DFA after I exit this hand-implemented non-regular state. No idea how to do that.

This one shows how to build a DFA

Is it really the best idea to build an RE string and pass it into regex_syntax? I'm not being sarcastic, I genuinely want to know. I don't want to have to think too hard about quoting (even though regex_syntax does let me solve that problem, kinda). However, looking at the NFA builder was overwhelming and I didn't know where to start.

I also don't understand how a string like "[abc]+" maps into transition IDs.

Depending on what you're doing, it may be advantageous to avoid building a full DFA in the first place.

It sounds like this can be swapped out without terribly much pain, so it sounds like that's a decision I can make later?

2 replies

BurntSushi Oct 8, 2023
Maintainer

I'm about to head to bed, but wanted to respond with some quick notes:

Depending on how you use \p{XID_Start} and \p{XID_Continue}, you might be able to abide it. It really just depends on what your requirements are. You can play around with stuff via regex-cli:

$ regex-cli debug dense dfa -p '\p{XID_Continue}' -q
      parse time:  15.302µs
  translate time:  11.87µs
compile nfa time:  219.637µs
compile dfa time:  3.699356ms
          memory:  313920
     pattern len:  1
      start kind:  Both
    alphabet len:  113
          stride:  128
      has empty?:  false
        is utf8?:  true

$ regex-cli debug dense dfa -p '\p{XID_Continue}' -q --start-kind unanchored
      parse time:  12.21µs
  translate time:  12.107µs
compile nfa time:  219.736µs
compile dfa time:  2.226893ms
          memory:  157760
     pattern len:  1
      start kind:  Unanchored
    alphabet len:  113
          stride:  128
      has empty?:  false
        is utf8?:  true

So if you only need, say, unanchored searches, then \p{XID_Continue} is about 150KB. That size can quickly balloon though.

I'm not sure what the correct way to detect that case and act on it, though.

I have personally found composition of regexes pretty challenging in all but the simplest of cases. However, I'm usually working in a context where the code I write needs to work for any arbitrary regex. Composition might be easier if you can tailor it to a specific regex (or set of regexes). Hard to say without really getting into the weeds though.

To be honest, your use case reminds me more of the Ragels and re2cs of the world. Those are more geared toward letting you inject custom actions at specific points. They usually work by letting you describe the regex and the actions, and then it has a code generator that builds out the DFAs and emits C, Rust or whatever. IIRC, re2c supports emitting Rust code.

How do distinguish between them without accidentally matching foo (since from the DFA's perspective, it's the longest prefix), and also know when to switch from "advance the DFA" to "manually lex XID characters for a while".

One potentially useful bit about regex-automata is that it provides leftmost-first semantics. That means something like sam|samwise will match sam in samwise. Or similarly, \w|\w\d against the haystack a5 will match a and then 5 as two distinct matches:

$ regex-cli find match dense -p '\w|\w\d' -y 'a5'
                  parse time:  14.929µs
              translate time:  16.436µs
    compile forward nfa time:  356.067µs
    compile reverse nfa time:  198.504µs
build forward dense DFA time:  4.563226ms
build reverse dense DFA time:  20.073865ms
            build regex time:  853ns
                 search time:  1.756µs
               total matches:  2
0:0:1:a
0:1:2:5

Relatedly, one kind of token I want to offer recognition for is a generalization of Rust's r###"..."### strings, and C++'s R"blah(...)blah" strings. These are obviously not regular and will require me to implement backreferences. Do you have any pointers for how to do this? This feels similar to the XID hack I described above.

Sadly, no. I don't have a ton of experience writing lexers based on DFAs, so listening to me in general may be a case of the blind leading the blind. Whenever I've needed such things, I usually write them by hand because I've found it to be easier to deal with cases like these, and, especially, error reporting. I've only ever used "regex as a lexer" for simplistic things.

Is it really the best idea to build an RE string and pass it into regex_syntax? I'm not being sarcastic, I genuinely want to know. I don't want to have to think too hard about quoting (even though regex_syntax does let me solve that problem, kinda). However, looking at the NFA builder was overwhelming and I didn't know where to start.

The point of regex-automata (well, the secondary point) is to provide entry points into multiple different levels of abstraction:

You could build an RE via its concrete syntax using regex-syntax.
You could also do it via regex-syntax's Hir. The Hir is a simplified structured representation of a regex pattern that's easier to analyze. So if you have structured data, it probably makes sense to just build an Hir directly from it.
The Thompson NFA builder is how you would do it if you wanted to avoid regex-syntax entirely. I admit that the builder is overwhelming. It's basically my first pass at such an abstraction, and I'm not 100% certain it is the right one. I would only suggest this route if you can convince yourself that it's the correct route. The main issue here is that the builder is a somewhat light abstraction around building a sequence of inter-connected NFA states. A lot of the complexity in building a regex is actually in how the builder is used via the thompson::Compiler. The compiler is basically a black box that you can feed concrete syntax (or an Hir) into and have it produce an NFA. The Compiler uses the thompson::Builder internally, but there's a fair bit that the compiler does (particularly around Unicode character classes).

I'll stop there for now.

I also don't understand how a string like "[abc]+" maps into transition IDs.

I don't think there's a way to form any direct mapping. The API regex-automata exposes is basically that you feed it an Hir (or if you're so inclined, hand-build a thompson::NFA via a thompson::Builder) and it spits out an NFA. That in turn can be used to build a DFA. By the time you get to a DFA, there really isn't any specific connection you can make between it and the concrete syntax. I'm not sure if this is intrinsic or not, probably not since any DFA can be turned back into an equivalent regex in theory, but it's unclear to me whether it can be meaningfully turned back into the precise regex pattern used to build it. It's just a fact that there are multiple regex patterns one could write to describe the same regular language.

It sounds like this can be swapped out without terribly much pain, so it sounds like that's a decision I can make later?

Kinda. It depends on what assumptions you use to build up around it. It would be easy, for example, to assume that one can just build a full DFA ahead of time and share it between threads without impunity. But once you go to swap it out with a lazy DFA, you realize you need some mutable Cache to run each search. If you're in a multi-threaded environment, that requires something... Whether it's a thread local, or a memory pool or something else. Your specific case might make it very easy to swap between things, but it might not. For example, you might not be doing multi-threading at all, in which case, I would probably agree that it doesn't matter and it would be easy to swap out in the future. It really just depends on your requirements and assumptions.

The way I conceptualize this is that the runtime components of a fully compiled DFA are exceptionally simple. But the runtime components of a lazy DFA are an order of magnitude more complex, since it necessarily includes determinization itself. Where as with a fully compiled DFA, determinization is done completely ahead of time. This is why only fully compiled DFAs (in this crate) support zero-copy deserialization and execution in a no-std and no-alloc environment.

BurntSushi Oct 8, 2023
Maintainer

Popping up a level, it's worth pointing out that the main thing that regex-automata gives you is the engineering component of a general purpose regex engine. It doesn't necessarily give you the full theoretical flexibility of finite automata. For example, there's nothing in this crate that facilitates DFA composition via intersection, complement or even union. Everything has to funnel down through the syntax->Ast->Hir->NFA->DFA highway. Although, you can hop on that highway at syntax, Ast, Hir or NFA. But once you get to DFA, everything just kind of stops and the only thing left is searching and exploring its transitions.

mcy · 2023-10-08T16:59:36Z

mcy
Oct 8, 2023
Author

Andrew-- this is extremely useful, and definitely enough for me to start messing around and hurting myself. I'm not sure when I'll get to that, but I will report back once I have concrete problems to debug.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building a lexer programatically #1100

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Building a lexer programatically #1100

mcy Oct 7, 2023

Replies: 3 comments · 2 replies

BurntSushi Oct 7, 2023 Maintainer

mcy Oct 7, 2023 Author

BurntSushi Oct 8, 2023 Maintainer

BurntSushi Oct 8, 2023 Maintainer

mcy Oct 8, 2023 Author

mcy
Oct 7, 2023

Replies: 3 comments 2 replies

BurntSushi
Oct 7, 2023
Maintainer

mcy
Oct 7, 2023
Author

BurntSushi Oct 8, 2023
Maintainer

BurntSushi Oct 8, 2023
Maintainer

mcy
Oct 8, 2023
Author