-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EXPERIMENT] Create a recursive descent version of KAS' parser #1746
base: main
Are you sure you want to change the base?
Conversation
npm Snapshot: PublishedGood news!! We've packaged up the latest commit from this PR (21a716f) and published it to npm. You Example: yarn add @khanacademy/perseus@PR1746 If you are working in Khan Academy's webapp, you can run: ./dev/tools/bump_perseus_version.sh -t PR1746 |
Size Change: +14 B (0%) Total Size: 867 kB
ℹ️ View Unchanged
|
type PreparedRule = [RegExp, Token | ((match: string) => Token)]; | ||
|
||
const preparedRules: PreparedRule[] = rules.map((rule) => { | ||
const regex = new RegExp("^(?:" + rule[0] + ")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So each rule becomes a regex that's anchored to the beginning of the input string (^
) and is non-capturing ((?:...)
). Am I reading this correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. It's a simplified version of https://github.com/zaach/jison-lex/blob/master/regexp-lexer.js#L10-L68 which is what jison
uses internally for its lexer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the non-capturing group part is really necessary since we only ever look at the whole string that's matched. 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: I think it's useful to mark them as non-capturing. It is a form of stating that the match is not used.
…, add a comment with a brief description of how the parser works
…n is being consumed
4e016f1
to
b5abf08
Compare
GeraldRequired Reviewers
Don't want to be involved in this pull request? Comment |
This PR resulted in a minimal (+14 B) change in the bundle size because the new parser wasn't being included in bundle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few thoughts. I've also asked @jaredly to take a look as I get out of my depth when we get to the parser and how it handles the token stream. :)
throw new Error(`No match for ${input}`); | ||
} | ||
|
||
return [token, input.slice(currentMatch.length)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely extremely nitpicky, but slicing the input every time we match means we create quite a few strings while parsing. Would it be alot more complex to just maintain an index that marks where we've processed to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would that much additional complexity. I did it this way because that's what jison-lex does, see https://github.com/zaach/jison-lex/blob/master/regexp-lexer.js#L330. I'll try keeping track of the current index instead, but to do so I think restructuring the code to use a class for the lexer will help.
type PreparedRule = [RegExp, Token | ((match: string) => Token)]; | ||
|
||
const preparedRules: PreparedRule[] = rules.map((rule) => { | ||
const regex = new RegExp("^(?:" + rule[0] + ")"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: I think it's useful to mark them as non-capturing. It is a form of stating that the match is not used.
// | ||
// This method can only be called with `tokenKind`s for tokens that | ||
// have a `value` property. See lexer.ts. | ||
expectValue<Kind extends "FUNC" | "VAR" | "TRIG">(tokenKind: Kind): string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way in TypeScript to model the last statement of the doc comment instead of saying the Kind extends "FUN" | "VAR" | "TRIG"
? Those aren't the only tokens that carry a value
(see CONST
, SIGN
, FLOAT
, etc)
// This method can only be called with `tokenKind`s for tokens that
// have a `value` property. See lexer.ts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a fancy way to compute this type from the Token
type. I'll give that a try.
|
||
equation(): Expr { | ||
if (this.peek().kind === "EOF") { | ||
return new Add([]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why we place an empty Add
operation here when we see EOF
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we do this either, but this is what was in the original code.
Summary:
The reason for creating this alternate version of the KAS parse is that I want to create a parser that outputs a MathBlocks AST for my hackathon project. This project is heavily based on MathBlocks which is a suite of node modules I've developed as a side project to facilitate the development of interact math applications on the web. I thought that it would be easier to create the parser for my hackathon project if I first had a recursive descent parser (as opposed to the current JISON-based parser). It's much hard to see what's going on with the JISON parser because parts of code the make up the parser are in different files. Also, the JISON parser doesn't produce type-safe code.
In the future, we may want replace KAS' current parser with this one. I think it'll make maintenance a bit easier since it uses plain old TypeScript code which means that maintainers don't have to understand JISON. The parser is implemented as a recursive descent parser which is one of the simplest types of parsers to write. Because it uses TypeScript it's also typesafe which also makes it easier to work on. The parser itself is also fewer lines 577 (shipped) vs 814 (shipped) + 214 (dev-only).
If we do want to adopt this parser we'll likely also want to convert the unit parser to be a recursive descent parser as well. This would allow us to remove JISON from the code base completely (JISON is a dev dependency so this wouldn't have too much impact on bundle size). Additionally, converting nodes.js to TypeScript and all of the node types to be ES6 classes would also help with maintainability, but would not be a requirement for adopting the new parser(s).
Issue: None
TODO:
Test plan: