Add cut operator (`^`) to grammar #2104

traviscross · 2025-12-14T17:03:28Z

The cut operator (^) is a backtracking fence. Once the expression to its left succeeds, we become committed to the alternative; the remainder of the expression must parse successfully or parsing will fail. See Packrat Parsers Can Handle Practical Grammars in Mostly Constant Space, Mizushima et al., https://kmizu.github.io/papers/paste513-mizushima.pdf.

This operator solves a problem for us with C string literals. These literals cannot contain a null escape. But if we simply fail to lex the literal (e.g. c"\0"), we may instead lex it successfully as two separate tokens (`c "\0"), and that would be incorrect.

As long as we only use cut to express constraints that can be expressed in a regular language and we keep our alternations disjoint, the grammar can still be mechanically converted to a CFG.

Let's add the cut operator to our grammar and use it for C string literals and some similar constructs.

In the railroad diagrams, we'll render the cut as a "no backtracking" box around the expression or sequence of expressions after the cut. The idea is that once you enter the box the only way out is forward.

(H/t to @ehuss for suggesting the cut operator to solve this problem.)

cc @ehuss

This is stacked on #2097 and should merge after it.

mattheww · 2025-12-17T17:00:13Z

When you're defining a cut operator, it's important to specify the scope over
which it cancels re-attempts.

For the lexer's purposes it would be fine to make that scope unlimited, saying that if the right-hand side of a production containing a cut fails after reaching the cut then the entire lexing process fails.

But in this PR the nearest thing to a definition of the cut operator is the reference to https://kmizu.github.io/papers/paste513-mizushima.pdf .

That paper defines a cut operator with a narrower scope: it allows a cut only on the left-hand side of an ordered choice expression, and cancels only the re-attempt of the right-hand side of that expression.

That definition doesn't work for the positions in which this PR is placing cuts.

mattheww · 2025-12-17T17:01:13Z

If you're still planning to use prioritised choice more widely in the Reference, then (given that the Reference already has the notion of reserved forms) perhaps the simplest way to define cut is to say that:

FOO ← a ^ b

is a shorthand for

FOO ← a b
RESERVED_PREFIX_OF_FOO ← a

with RESERVED_PREFIX_OF_FOO appearing immediately after FOO in the lexical grammar's top-level ordered choice for tokens.

That characterisation also illustrates why adding the notion of a cut doesn't buy very much.

In particular if you use prioritised choice for the token rules then (for the lexing dialect used in Rust 2021 and later) you can simplify the existing reserved token rules, getting rid of the "except b or c or r or br or cr" business, and end up with something like this:

C_STRING_LITERAL ← c " (more stuff)
RESERVED_TOKEN_DOUBLE_QUOTE ← IDENTIFIER_OR_KEYWORD "
IDENTIFIER_OR_KEYWORD ← (same as at present)

This way c"\0" is rejected as a reserved form, and there's no need to bother the Reference's readers with a discussion of cuts.

traviscross · 2025-12-18T00:14:02Z

Yes, @ehuss and I had earlier discussed these same matters, point for point. Thanks for elaborating them here; good to have these written out on the PR.

With regard to using RESERVED_ and ordered choice, that was my first thought too for solving this problem. But it's less theoretically satisfying than cut (with global escape) -- to me anyway -- since the input is still parsed or lexed successfully as far as the grammar is concerned. The idea that RESERVED_ rules matching are failures is something that has to be overlaid.

traviscross · 2025-12-18T00:21:10Z

With regard to the locality (or lack thereof) of cut, it's interesting that Python's grammar has a rule with cut that would be a no-op under the Mizushima et al. interpretation:

assignment_expression:
    | NAME ':=' ~ expression

But their description of cut does not clearly suggest a global escape:

~ (“cut”): commit to the current alternative and fail the rule even if this fails to parse

traviscross · 2025-12-18T00:23:39Z

One library that takes the local interpretation is pegase. As they describe:

Used outside an ordered choice expression, it's simply a no-op.

mattheww · 2025-12-18T21:28:33Z

With regard to the locality (or lack thereof) of cut, it's interesting that Python's grammar has a rule with cut that would be a no-op under the Mizushima et al. interpretation:

AIUI the cuts in the Python grammar are there for performance (and debuggability) reasons, not to change the accepted language.

The parser can discard some state as soon as it gets to the cut (as described in the paper you linked), so it's useful even when there's no choice operator following.

mattheww · 2025-12-18T21:30:39Z

While you're looking at the quoted literals, you might consider helping the cut operator earn its keep by changing

SUFFIX → IDENTIFIER_OR_KEYWORD_{except _}

to something equivalent to

SUFFIX → XID_Start XID_Continue^* | _ ^ XID_Continue⁺

That would fix another bug in this family, by preventing something like "xxx"_ being analysed as two tokens.

traviscross · 2025-12-18T23:50:39Z

@ehuss has a forthcoming PR, likely to supersede this one, that resolves a large number of grammar issues. On that branch, looks like he went with:

SUFFIX ->
      `_` ^ XID_Continue+
    | XID_Start XID_Continue*

This clarifies the UNICODE_ESCAPE rule that the hex value must be a valid Unicode scalar value. This resolves the problem that a string like `"\u{ffffff}"` is not a valid token, but the grammar did not reflect that. I don't see a practical way to define this with character ranges. The resulting expression is huge. Note that this restriction means that the UNICODE_ESCAPE rule will not match an invalid value, and that all the places where UNICODE_ESCAPE is used, the preceding character must *not* be `\`, which forces those rules to fail their match. In turn the only rules that contain UNICODE_ESCAPE have `'` or `"` characters, which won't match any other rule in the grammar, forcing them to fail the parse. If all those assumptions seem too fragile, then we can consider adding the [cut operator](rust-lang#2104) just after the `\u` so that the interpretation is clear that a failure to match the part from the opening brace is an immediate parse failure.

ehuss · 2026-01-07T22:43:25Z

Here's a commit that changes this to be an unary operator: ehuss@21ea969

traviscross · 2026-01-08T09:50:45Z

Thanks; cherry-picked.

The hard cut operator (`^`) is a backtracking fence. Once the expressions to its left in a sequence match, the rest of the sequence must match or parsing fails unconditionally -- no enclosing expression can backtrack past the cut point. This operator is necessary because some Rust tokens begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. If `c"\0"` fails to lex as a C string literal (because null bytes are not allowed in C strings), a PEG parser would normally backtrack and try other alternatives, potentially lexing it as the identifier `c` followed by the string `"\0"`. The hard cut after `c"` prevents this: once the opening delimiter matches, failure is unconditional. We add `^` to the grammar notation and use it in the productions for C string literals, byte literals, byte string literals, and the raw string variants -- each of which has a prefix that could otherwise be consumed as a separate token. In the notation chapter, we add a dedicated section explaining ordered alternation and backtracking, distinguishing a hard cut (which prevents all backtracking past the cut point) from a soft cut (which prevents backtracking only within the immediately enclosing choice), and citing Mizushima et al. for introducing cut operators to PEG. In the grammar tooling, we add a `Cut` variant to the expression AST, parse `^` at the sequence level, and render it in both the Markdown and railroad diagram outputs. In the railroad diagrams, the hard cut is rendered as a "no backtracking" box around the expressions after the cut point. The idea is that once you enter the box the only way out is forward.

traviscross force-pushed the TC/add-cut-to-grammar branch 2 times, most recently from 24690d2 to fc646a1 Compare December 15, 2025 06:15

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-author Status: The marked PR is awaiting some action (such as code changes) from the PR author. label Dec 18, 2025

ehuss mentioned this pull request Dec 20, 2025

Clarify UNICODE_ESCAPE valid token value #2123

Merged

ehuss force-pushed the TC/add-cut-to-grammar branch 2 times, most recently from 9330955 to babbd9b Compare February 13, 2026 03:38

traviscross force-pushed the TC/add-cut-to-grammar branch from babbd9b to aee21d8 Compare February 13, 2026 05:32

traviscross force-pushed the TC/add-cut-to-grammar branch from aee21d8 to 2fb34ab Compare February 13, 2026 05:53

traviscross marked this pull request as ready for review February 13, 2026 05:56

rustbot added the S-waiting-on-review Status: The marked PR is awaiting review from a maintainer label Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cut operator (`^`) to grammar #2104

Add cut operator (`^`) to grammar #2104

traviscross commented Dec 14, 2025 •

edited

Loading

Uh oh!

mattheww commented Dec 17, 2025

Uh oh!

mattheww commented Dec 17, 2025

Uh oh!

traviscross commented Dec 18, 2025 •

edited

Loading

Uh oh!

traviscross commented Dec 18, 2025 •

edited

Loading

Uh oh!

traviscross commented Dec 18, 2025

Uh oh!

This comment has been minimized.

mattheww commented Dec 18, 2025

Uh oh!

mattheww commented Dec 18, 2025

Uh oh!

traviscross commented Dec 18, 2025 •

edited

Loading

Uh oh!

ehuss commented Jan 7, 2026

Uh oh!

traviscross commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add cut operator (^) to grammar #2104

Are you sure you want to change the base?

Add cut operator (^) to grammar #2104

Conversation

traviscross commented Dec 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattheww commented Dec 17, 2025

Uh oh!

mattheww commented Dec 17, 2025

Uh oh!

traviscross commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

traviscross commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

traviscross commented Dec 18, 2025

Uh oh!

This comment has been minimized.

mattheww commented Dec 18, 2025

Uh oh!

mattheww commented Dec 18, 2025

Uh oh!

traviscross commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ehuss commented Jan 7, 2026

Uh oh!

traviscross commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add cut operator (`^`) to grammar #2104

Add cut operator (`^`) to grammar #2104

traviscross commented Dec 14, 2025 •

edited

Loading

traviscross commented Dec 18, 2025 •

edited

Loading

traviscross commented Dec 18, 2025 •

edited

Loading

traviscross commented Dec 18, 2025 •

edited

Loading