Replies: 3 comments 2 replies
-
Hiya! I think the key examples you want are these:
OK, I've probably thrown enough at you to get started. This should hopefully spur some questions! |
Beta Was this translation helpful? Give feedback.
-
I think this is the biggest issue I want to discuss up-front. Most of my character classes are small, but XID_START and XID_CONTINUE are not. I suspect the right answer is to somehow cheat and when I would match an For example, suppose that my lexer believes that Relatedly, one kind of token I want to offer recognition for is a generalization of Rust's There's also the problem of continuing to explore the DFA after I exit this hand-implemented non-regular state. No idea how to do that.
Is it really the best idea to build an RE string and pass it into regex_syntax? I'm not being sarcastic, I genuinely want to know. I don't want to have to think too hard about quoting (even though regex_syntax does let me solve that problem, kinda). However, looking at the NFA builder was overwhelming and I didn't know where to start. I also don't understand how a string like "[abc]+" maps into transition IDs.
It sounds like this can be swapped out without terribly much pain, so it sounds like that's a decision I can make later? |
Beta Was this translation helpful? Give feedback.
-
Andrew-- this is extremely useful, and definitely enough for me to start messing around and hurting myself. I'm not sure when I'll get to that, but I will report back once I have concrete problems to debug. |
Beta Was this translation helpful? Give feedback.
-
Hello!
A month or so ago I chatted with Andrew about a somewhat unusual usage of regex_automata: taking a user-defined definition for a C-like language's tokens and generating a token-tree-yielding lexer from it. Andrew asked me to put together a discussion post so we can figure this out together!
The surface-level API for defining a syntax is described below, as a user would manipulate it. The resulting lexer produces something morally equivalent to proc_macro::TokenStream. I have not published this library yet, so I don't have any code I can point to... but ideally that should not be necessary.
Essentially I am building something similar to what a crate like logos does, but much more opinionated in what it can parse to simplify common cases and provide a simpler API for converting tokens into an AST.
Currently the lexer is implemented (somewhat incorrectly) by building a trie of prefixes that can start a lexeme, and then dealing with some special cases around things like Unicode XIDs.
This is... not ideal. Instead, I would like to be able to compile this specification, at runtime, into a DFA that recognizes a single token, and explore that DFA by hand. (This bit is necessary, since the token grammar I am parsing is almost entirely regular but technically context sensitive in some places, so I need to implement a very limited version of backreferences).
However, I have no desire to implement any of the DFA construction and evaluation algorithms necessary to do this for arbitrary user specifications, so I want to use regex_automata to do this, and learn something about DFAs along the way. I would appreciate some broad pointers at what to look at first. I expect that the description I've given to be insufficient, so please ask clarifying questions on what I'm trying to do and I'll do my best to answer them; I don't know what I don't know, so to speak.
Beta Was this translation helpful? Give feedback.
All reactions