Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXPERIMENT] Create a recursive descent version of KAS' parser #1746

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kevinb-khan
Copy link
Contributor

@kevinb-khan kevinb-khan commented Oct 12, 2024

Summary:

The reason for creating this alternate version of the KAS parse is that I want to create a parser that outputs a MathBlocks AST for my hackathon project. This project is heavily based on MathBlocks which is a suite of node modules I've developed as a side project to facilitate the development of interact math applications on the web. I thought that it would be easier to create the parser for my hackathon project if I first had a recursive descent parser (as opposed to the current JISON-based parser). It's much hard to see what's going on with the JISON parser because parts of code the make up the parser are in different files. Also, the JISON parser doesn't produce type-safe code.

In the future, we may want replace KAS' current parser with this one. I think it'll make maintenance a bit easier since it uses plain old TypeScript code which means that maintainers don't have to understand JISON. The parser is implemented as a recursive descent parser which is one of the simplest types of parsers to write. Because it uses TypeScript it's also typesafe which also makes it easier to work on. The parser itself is also fewer lines 577 (shipped) vs 814 (shipped) + 214 (dev-only).

If we do want to adopt this parser we'll likely also want to convert the unit parser to be a recursive descent parser as well. This would allow us to remove JISON from the code base completely (JISON is a dev dependency so this wouldn't have too much impact on bundle size). Additionally, converting nodes.js to TypeScript and all of the node types to be ES6 classes would also help with maintainability, but would not be a requirement for adopting the new parser(s).

Issue: None

TODO:

  • handle custom decimal separator

Test plan:

  • yarn test packages/kas/src

@kevinb-khan kevinb-khan self-assigned this Oct 12, 2024
Copy link
Contributor

github-actions bot commented Oct 12, 2024

npm Snapshot: Published

Good news!! We've packaged up the latest commit from this PR (21a716f) and published it to npm. You
can install it using the tag PR1746.

Example:

yarn add @khanacademy/perseus@PR1746

If you are working in Khan Academy's webapp, you can run:

./dev/tools/bump_perseus_version.sh -t PR1746

Copy link
Contributor

github-actions bot commented Oct 12, 2024

Size Change: +14 B (0%)

Total Size: 867 kB

Filename Size Change
packages/kas/dist/es/index.js 38.8 kB +14 B (+0.04%)
ℹ️ View Unchanged
Filename Size
packages/keypad-context/dist/es/index.js 760 B
packages/kmath/dist/es/index.js 4.27 kB
packages/math-input/dist/es/index.js 78 kB
packages/math-input/dist/es/strings.js 1.79 kB
packages/perseus-core/dist/es/index.js 1.48 kB
packages/perseus-editor/dist/es/index.js 280 kB
packages/perseus-linter/dist/es/index.js 22.2 kB
packages/perseus/dist/es/index.js 420 kB
packages/perseus/dist/es/strings.js 3.4 kB
packages/pure-markdown/dist/es/index.js 3.66 kB
packages/simple-markdown/dist/es/index.js 12.4 kB

compressed-size-action

type PreparedRule = [RegExp, Token | ((match: string) => Token)];

const preparedRules: PreparedRule[] = rules.map((rule) => {
const regex = new RegExp("^(?:" + rule[0] + ")");
Copy link
Collaborator

@jeremywiebe jeremywiebe Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each rule becomes a regex that's anchored to the beginning of the input string (^) and is non-capturing ((?:...)). Am I reading this correctly?

Copy link
Contributor Author

@kevinb-khan kevinb-khan Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. It's a simplified version of https://github.com/zaach/jison-lex/blob/master/regexp-lexer.js#L10-L68 which is what jison uses internally for its lexer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the non-capturing group part is really necessary since we only ever look at the whole string that's matched. 🤷‍♂️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I think it's useful to mark them as non-capturing. It is a form of stating that the match is not used.

packages/kas/src/lexer.ts Outdated Show resolved Hide resolved
packages/kas/src/parser.ts Show resolved Hide resolved
packages/kas/src/parser.ts Outdated Show resolved Hide resolved
packages/kas/src/parser.ts Outdated Show resolved Hide resolved
@kevinb-khan kevinb-khan marked this pull request as ready for review October 17, 2024 15:27
@khan-actions-bot
Copy link
Contributor

Gerald

Required Reviewers
  • @Khan/perseus for changes to packages/kas/src/lexer.ts, packages/kas/src/nodes.ts, packages/kas/src/parser.ts, packages/kas/src/__tests__/lexer.test.ts, packages/kas/src/__tests__/parser.test.ts, packages/kas/src/__tests__/__snapshots__/lexer.test.ts.snap

Don't want to be involved in this pull request? Comment #removeme and we won't notify you of further changes.

@khan-actions-bot khan-actions-bot requested a review from a team October 17, 2024 15:27
@kevinb-khan kevinb-khan marked this pull request as draft October 17, 2024 16:05
@kevinb-khan
Copy link
Contributor Author

This PR resulted in a minimal (+14 B) change in the bundle size because the new parser wasn't being included in bundle.

Copy link
Collaborator

@jeremywiebe jeremywiebe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few thoughts. I've also asked @jaredly to take a look as I get out of my depth when we get to the parser and how it handles the token stream. :)

throw new Error(`No match for ${input}`);
}

return [token, input.slice(currentMatch.length)];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely extremely nitpicky, but slicing the input every time we match means we create quite a few strings while parsing. Would it be alot more complex to just maintain an index that marks where we've processed to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would that much additional complexity. I did it this way because that's what jison-lex does, see https://github.com/zaach/jison-lex/blob/master/regexp-lexer.js#L330. I'll try keeping track of the current index instead, but to do so I think restructuring the code to use a class for the lexer will help.

type PreparedRule = [RegExp, Token | ((match: string) => Token)];

const preparedRules: PreparedRule[] = rules.map((rule) => {
const regex = new RegExp("^(?:" + rule[0] + ")");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: I think it's useful to mark them as non-capturing. It is a form of stating that the match is not used.

//
// This method can only be called with `tokenKind`s for tokens that
// have a `value` property. See lexer.ts.
expectValue<Kind extends "FUNC" | "VAR" | "TRIG">(tokenKind: Kind): string {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way in TypeScript to model the last statement of the doc comment instead of saying the Kind extends "FUN" | "VAR" | "TRIG"? Those aren't the only tokens that carry a value (see CONST, SIGN, FLOAT, etc)

    // This method can only be called with `tokenKind`s for tokens that
    // have a `value` property.  See lexer.ts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a fancy way to compute this type from the Token type. I'll give that a try.


equation(): Expr {
if (this.peek().kind === "EOF") {
return new Add([]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we place an empty Add operation here when we see EOF.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we do this either, but this is what was in the original code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants