Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion dev-guide/src/grammar.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,9 @@ Name -> <Alphanumeric or `_`>+

Expression -> Sequence (` `* `|` ` `* Sequence)*

Sequence -> (` `* AdornedExpr)+
Sequence ->
(` `* AdornedExpr)* ` `* Cut
| (` `* AdornedExpr)+

AdornedExpr -> ExprRepeat Suffix? Footnote?

Expand Down Expand Up @@ -92,6 +94,8 @@ Prose -> `<` ~[`>` LF]+ `>`
Group -> `(` ` `* Expression ` `* `)`

NegativeExpression -> `~` ( Charset | Terminal | NonTerminal )

Cut -> `^` Sequence
```

The general format is a series of productions separated by blank lines. The expressions are as follows:
Expand All @@ -110,6 +114,7 @@ The general format is a series of productions separated by blank lines. The expr
| Prose | \<any ASCII character except CR\> | An English description of what should be matched, surrounded in angle brackets. |
| Group | (\`,\` Parameter)+ | Groups an expression for the purpose of precedence, such as applying a repetition operator to a sequence of other expressions. |
| NegativeExpression | ~\[\` \` LF\] | Matches anything except the given Charset, Terminal, or Nonterminal. |
| Cut | Expr1 ^ Expr2 \| Expr3 | The hard cut operator. Once the expressions preceding `^` in the sequence match, the rest of the sequence must match or parsing fails unconditionally --- no enclosing expression can backtrack past the cut point. |
| Sequence | \`fn\` Name Parameters | A sequence of expressions that must match in order. |
| Alternation | Expr1 \| Expr2 | Matches only one of the given expressions, separated by the vertical pipe character. |
| Suffix | \_except \[LazyBooleanExpression\]\_ | Adds a suffix to the previous expression to provide an additional English description, rendered in subscript. This can contain limited Markdown, but try to avoid anything except basics like links. |
Expand Down
16 changes: 16 additions & 0 deletions src/notation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,23 @@ The following notations are used by the *Lexer* and *Syntax* grammar snippets:
| ~\[ ] | ~\[`b` `B`] | Any characters, except those listed |
| ~`string` | ~`\n`, ~`*/` | Any characters, except this sequence |
| ( ) | (`,` _Parameter_)<sup>?</sup> | Groups items |
| ^ | `b'` ^ ASCII_FOR_CHAR | The rest of the sequence must match or parsing fails unconditionally ([hard cut operator]) |
| U+xxxx | U+0060 | A single unicode character |
| \<text\> | \<any ASCII char except CR\> | An English description of what should be matched |
| Rule <sub>suffix</sub> | IDENTIFIER_OR_KEYWORD <sub>_except `crate`_</sub> | A modification to the previous rule |
| // Comment. | // Single line comment. | A comment extending to the end of the line. |

Sequences have a higher precedence than `|` alternation.

r[notation.grammar.cut]
### The hard cut operator

The grammar uses ordered alternation: the parser tries alternatives left to right and takes the first that matches. If an alternative fails partway through a sequence, the parser normally backtracks and tries the next alternative. The cut operator (`^`) prevents this. Once every expression to the left of `^` in a sequence has matched, the rest of the sequence must match or parsing fails unconditionally.

Mizushima et al. introduced [cut operators][cut operator paper] to parsing expression grammars. In the PEG literature, a *soft cut* prevents backtracking only within the immediately enclosing ordered choice --- outer choices can still recover. A *hard cut* prevents all backtracking past the cut point; failure is definitive. The `^` used in this grammar is a hard cut.

The hard cut operator is necessary because some tokens in Rust begin with a prefix that is itself a valid token. For example, `c"` begins a C string literal, but `c` alone is a valid identifier. Without the cut, if `c"\0"` failed to lex as a C string literal (because null bytes are not allowed in C strings), the parser could backtrack and lex it as two tokens: the identifier `c` and the string literal `"\0"`. The [cut after `c"`] prevents this --- once the opening delimiter is recognized, the parser cannot go back. The same reasoning applies to [byte literals], [byte string literals], [raw string literals], and other literals with prefixes that are themselves valid tokens.

r[notation.grammar.string-tables]
### String table productions

Expand All @@ -52,7 +62,13 @@ r[notation.grammar.visualizations]
Below each grammar block is a button to toggle the display of a [syntax diagram]. A square element is a non-terminal rule, and a rounded rectangle is a terminal.

[binary operators]: expressions/operator-expr.md#arithmetic-and-logical-binary-operators
[byte literals]: tokens.md#r-lex.token.byte.syntax
[byte string literals]: tokens.md#r-lex.token.str-byte.syntax
[cut after `c"`]: tokens.md#r-lex.token.str-c.syntax
[cut operator paper]: https://kmizu.github.io/papers/paste513-mizushima.pdf
[hard cut operator]: notation.md#the-hard-cut-operator
[keywords]: keywords.md
[raw string literals]: tokens.md#r-lex.token.literal.str-raw.syntax
[syntax diagram]: https://en.wikipedia.org/wiki/Syntax_diagram
[tokens]: tokens.md
[unary operators]: expressions/operator-expr.md#borrow-operators
13 changes: 6 additions & 7 deletions src/tokens.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,7 +217,7 @@ r[lex.token.literal.str-raw.syntax]
RAW_STRING_LITERAL -> `r` RAW_STRING_CONTENT SUFFIX?

RAW_STRING_CONTENT ->
`"` ( ~CR )*? `"`
`"` ^ ( ~CR )*? `"`
| `#` RAW_STRING_CONTENT `#`
```

Expand Down Expand Up @@ -251,7 +251,7 @@ r[lex.token.byte]
r[lex.token.byte.syntax]
```grammar,lexer
BYTE_LITERAL ->
`b'` ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX?
`b'` ^ ( ASCII_FOR_CHAR | BYTE_ESCAPE ) `'` SUFFIX?

ASCII_FOR_CHAR ->
<any ASCII (i.e. 0x00 to 0x7F) except `'`, `\`, LF, CR, or TAB>
Expand All @@ -270,7 +270,7 @@ r[lex.token.str-byte]
r[lex.token.str-byte.syntax]
```grammar,lexer
BYTE_STRING_LITERAL ->
`b"` ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?
`b"` ^ ( ASCII_FOR_STRING | BYTE_ESCAPE | STRING_CONTINUE )* `"` SUFFIX?

ASCII_FOR_STRING ->
<any ASCII (i.e 0x00 to 0x7F) except `"`, `\`, or CR>
Expand Down Expand Up @@ -306,7 +306,7 @@ RAW_BYTE_STRING_LITERAL ->
`br` RAW_BYTE_STRING_CONTENT SUFFIX?

RAW_BYTE_STRING_CONTENT ->
`"` ASCII_FOR_RAW*? `"`
`"` ^ ASCII_FOR_RAW*? `"`
| `#` RAW_BYTE_STRING_CONTENT `#`

ASCII_FOR_RAW ->
Expand Down Expand Up @@ -343,13 +343,12 @@ r[lex.token.str-c]
r[lex.token.str-c.syntax]
```grammar,lexer
C_STRING_LITERAL ->
`c"` (
`c"` ^ (
~[`"` `\` CR NUL]
| BYTE_ESCAPE _except `\0` or `\x00`_
| UNICODE_ESCAPE _except `\u{0}`, `\u{00}`, …, `\u{000000}`_
| STRING_CONTINUE
)* `"` SUFFIX?

```

r[lex.token.str-c.intro]
Expand Down Expand Up @@ -402,7 +401,7 @@ RAW_C_STRING_LITERAL ->
`cr` RAW_C_STRING_CONTENT SUFFIX?

RAW_C_STRING_CONTENT ->
`"` ( ~[CR NUL] )*? `"`
`"` ^ ( ~[CR NUL] )*? `"`
| `#` RAW_C_STRING_CONTENT `#`
```

Expand Down
5 changes: 4 additions & 1 deletion tools/grammar/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ pub enum ExpressionKind {
Charset(Vec<Characters>),
/// ``~[` ` LF]``
NegExpression(Box<Expression>),
/// `^ A B C`
Cut(Box<Expression>),
/// `U+0060`
Unicode(String),
}
Expand Down Expand Up @@ -116,7 +118,8 @@ impl Expression {
| ExpressionKind::RepeatPlus(e)
| ExpressionKind::RepeatPlusNonGreedy(e)
| ExpressionKind::RepeatRange(e, _, _)
| ExpressionKind::NegExpression(e) => {
| ExpressionKind::NegExpression(e)
| ExpressionKind::Cut(e) => {
e.visit_nt(callback);
}
ExpressionKind::Alt(es) | ExpressionKind::Sequence(es) => {
Expand Down
100 changes: 86 additions & 14 deletions tools/grammar/src/parser.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,18 +173,19 @@ impl Parser<'_> {
match es.len() {
0 => Ok(None),
1 => Ok(Some(es.pop().unwrap())),
_ => Ok(Some(Expression {
kind: ExpressionKind::Alt(es),
suffix: None,
footnote: None,
})),
_ => Ok(Some(Expression::new_kind(ExpressionKind::Alt(es)))),
}
}

fn parse_seq(&mut self) -> Result<Option<Expression>> {
let mut es = Vec::new();
loop {
self.space0();
if self.peek() == Some(b'^') {
let cut = self.parse_cut()?;
es.push(cut);
break;
}
let Some(e) = self.parse_expr1()? else {
break;
};
Expand All @@ -201,6 +202,19 @@ impl Parser<'_> {
}
}

/// Parse cut (`^`) operator.
fn parse_cut(&mut self) -> Result<Expression> {
self.expect("^", "expected `^`")?;
let Some(rhs) = self.parse_seq()? else {
bail!(self, "expected expression after cut operator");
};
Ok(Expression {
kind: ExpressionKind::Cut(Box::new(rhs)),
suffix: None,
footnote: None,
})
}

fn parse_expr1(&mut self) -> Result<Option<Expression>> {
let Some(next) = self.peek() else {
return Ok(None);
Expand Down Expand Up @@ -506,13 +520,71 @@ fn translate_position(input: &str, index: usize) -> (&str, usize, usize) {
("", line_number + 1, 0)
}

#[test]
fn translate_tests() {
assert_eq!(translate_position("", 0), ("", 0, 0));
assert_eq!(translate_position("test", 0), ("test", 1, 1));
assert_eq!(translate_position("test", 3), ("test", 1, 4));
assert_eq!(translate_position("test", 4), ("test", 1, 5));
assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
#[cfg(test)]
mod tests {
use crate::parser::{parse_grammar, translate_position};
use crate::{ExpressionKind, Grammar};
use std::path::Path;

#[test]
fn test_translate() {
assert_eq!(translate_position("", 0), ("", 0, 0));
assert_eq!(translate_position("test", 0), ("test", 1, 1));
assert_eq!(translate_position("test", 3), ("test", 1, 4));
assert_eq!(translate_position("test", 4), ("test", 1, 5));
assert_eq!(translate_position("test\ntest2", 4), ("test", 1, 5));
assert_eq!(translate_position("test\ntest2", 5), ("test2", 2, 1));
assert_eq!(translate_position("test\ntest2\n", 11), ("", 3, 0));
}

fn parse(input: &str) -> Result<Grammar, String> {
let mut grammar = Grammar::default();
parse_grammar(input, &mut grammar, "test", Path::new("test.md"))
.map_err(|e| e.to_string())?;
Ok(grammar)
}

#[test]
fn test_cut() {
let input = "Rule -> A ^ B | C";
let grammar = parse(input).unwrap();
grammar.productions.get("Rule").unwrap();
}

#[test]
fn test_cut_captures() {
let input = "Rule -> A ^ B C | D";
let grammar = parse(input).unwrap();
let rule = grammar.productions.get("Rule").unwrap();
// The top-level expression is an alternation: (A ^ B C) | D.
let ExpressionKind::Alt(alts) = &rule.expression.kind else {
panic!("expected Alt, got {:?}", rule.expression.kind);
};
assert_eq!(alts.len(), 2);
// First alternative is a sequence: A, Cut(Sequence(B, C)).
let ExpressionKind::Sequence(seq) = &alts[0].kind else {
panic!("expected Sequence, got {:?}", alts[0].kind);
};
assert_eq!(seq.len(), 2);
assert!(matches!(&seq[0].kind, ExpressionKind::Nt(n) if n == "A"));
// The cut captures the rest of the sequence (B and C).
let ExpressionKind::Cut(cut_inner) = &seq[1].kind else {
panic!("expected Cut, got {:?}", seq[1].kind);
};
let ExpressionKind::Sequence(cut_seq) = &cut_inner.kind else {
panic!("expected Sequence inside Cut, got {:?}", cut_inner.kind);
};
assert_eq!(cut_seq.len(), 2);
assert!(matches!(&cut_seq[0].kind, ExpressionKind::Nt(n) if n == "B"));
assert!(matches!(&cut_seq[1].kind, ExpressionKind::Nt(n) if n == "C"));
// Second alternative is just D.
assert!(matches!(&alts[1].kind, ExpressionKind::Nt(n) if n == "D"));
}

#[test]
fn test_cut_fail_trailing() {
let input = "Rule -> A ^";
let err = parse(input).unwrap_err();
assert!(err.contains("expected expression after cut operator"));
}
}
5 changes: 5 additions & 0 deletions tools/mdbook-spec/src/grammar/render_markdown.rs
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ fn last_expr(expr: &Expression) -> &ExpressionKind {
| ExpressionKind::Comment(_)
| ExpressionKind::Charset(_)
| ExpressionKind::NegExpression(_)
| ExpressionKind::Cut(_)
| ExpressionKind::Unicode(_) => &expr.kind,
}
}
Expand Down Expand Up @@ -171,6 +172,10 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, output: &mut String) {
output.push('~');
render_expression(e, cx, output);
}
ExpressionKind::Cut(e) => {
output.push_str("^ ");
render_expression(e, cx, output);
}
ExpressionKind::Unicode(s) => {
output.push_str("U+");
output.push_str(s);
Expand Down
5 changes: 5 additions & 0 deletions tools/mdbook-spec/src/grammar/render_railroad.rs
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,11 @@ fn render_expression(expr: &Expression, cx: &RenderCtx, stack: bool) -> Option<B
let ch = node_for_nt(cx, "CHAR");
Box::new(Except::new(Box::new(ch), n))
}
ExpressionKind::Cut(e) => {
let rhs = render_expression(e, cx, stack)?;
let lbox = LabeledBox::new(rhs, Comment::new("no backtracking".to_string()));
Box::new(lbox)
}
ExpressionKind::Unicode(s) => Box::new(Terminal::new(format!("U+{}", s))),
};
}
Expand Down