Skip to content

Update graphlookup syntax#5209

Draft
LantaoJin wants to merge 1 commit intoopensearch-project:mainfrom
LantaoJin:pr/issues/5208
Draft

Update graphlookup syntax#5209
LantaoJin wants to merge 1 commit intoopensearch-project:mainfrom
LantaoJin:pr/issues/5208

Conversation

@LantaoJin
Copy link
Member

@LantaoJin LantaoJin commented Mar 6, 2026

Description

Update naming convention to support directional edge syntax
Unidirectional:

| graphLookup employees
  start=reportsTo
  edge=reportsTo-->name
  filter=(status='active')
  as reportingHierarchy

Bidirectional:

| graphLookup employees
  start=reportsTo
  edge=reportsTo<->name
  filter=(status='active')
  as reportingHierarchy

Related Issues

Resolves #5208

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Lantao Jin <ltjin@amazon.com>
@LantaoJin LantaoJin added the enhancement New feature or request label Mar 6, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Update graphLookup grammar and AST parsing for new edge syntax

Relevant files:

  • ppl/src/main/antlr/OpenSearchPPLLexer.g4
  • ppl/src/main/antlr/OpenSearchPPLParser.g4
  • ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java
  • core/src/main/java/org/opensearch/sql/ast/tree/GraphLookup.java

Sub-PR theme: Update tests and anonymizer for new graphLookup edge syntax

Relevant files:

  • integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLGraphLookupIT.java
  • ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLGraphLookupTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java
  • ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java
  • ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java

Sub-PR theme: Update graphLookup documentation for new edge syntax

Relevant files:

  • docs/user/ppl/cmd/graphlookup.md

⚡ Recommended focus areas for review

Lexer Mode Leak

The custom lexer modes (EDGE_MODE, EDGE_ARROW_MODE, EDGE_TO_MODE) use popMode only in EDGE_TO_MODE. If parsing fails mid-way (e.g., invalid arrow or missing toField), the lexer may remain stuck in a non-default mode, causing subsequent tokens to be misidentified. There is no error recovery or explicit popMode on failure paths.

mode EDGE_MODE;
EDGE_EQUAL:     '='                                         -> type(EQUAL);
EDGE_WS:        [ \t\r\n]+                                  -> skip;
EDGE_ID: [@*A-Z_]+? [*A-Z_0-9]*                             -> type(ID), mode(EDGE_ARROW_MODE);

// EDGE_ARROW_MODE: entered after fromField, handles optional trailing dashes, arrow, and toField
mode EDGE_ARROW_MODE;
EDGE_UNI_ARROW: '-->'                                       -> type(UNI_ARROW), mode(EDGE_TO_MODE);
EDGE_BI_ARROW:  '<->'                                       -> type(BI_ARROW), mode(EDGE_TO_MODE);
TRAIL_HYPHEN:  '-'+;
EDGE_ARROW_WS:  [ \t\r\n]+                                  -> skip;

// EDGE_TO_MODE: entered after arrow, lexes toField then pops back to default
mode EDGE_TO_MODE;
EDGE_TO_WS:     [ \t\r\n]+                                  -> skip;
EDGE_TO_ID: ID_LITERAL                                      -> type(ID), popMode;
EDGE_ID Pattern

The EDGE_ID rule pattern [@*A-Z_]+? [*A-Z_0-9]* uses case-insensitive matching only for uppercase letters (A-Z). If the lexer grammar is not configured with case-insensitive flag, lowercase field names like reportsTo or connects would not be matched. Verify that the grammar uses options { caseInsensitive=true; } or that the pattern covers lowercase letters.

EDGE_ID: [@*A-Z_]+? [*A-Z_0-9]*                             -> type(ID), mode(EDGE_ARROW_MODE);
Trailing Hyphen Handling

When trailHyphen is present, the code appends the hyphen text to the fromField name string. However, this approach relies on fromField.getField().toString() returning a plain string, which may not hold for all QualifiedName implementations (e.g., multi-part names). Additionally, there is no validation that the resulting field name is valid.

if (edgeCtx.trailHyphen != null) {
  String name = fromField.getField().toString() + edgeCtx.trailHyphen.getText();
  fromField = new Field(new QualifiedName(name));
}
Null Check Missing

startField is parsed from ctx.startClause() without a null check. If the grammar allows startClause to be absent (e.g., due to grammar ambiguity or future changes), a NullPointerException would occur. The old code had explicit null handling for optional fields.

OpenSearchPPLParser.StartClauseContext startCtx = ctx.startClause();
Field startField = (Field) internalVisitExpression(startCtx.startField);

@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix regex to match lowercase identifiers

**The EDGE_ID regex [@A-Z_]+? [A-Z_0-9] uses a lazy quantifier followed by a space
literal, which likely doesn't match the intended pattern for identifiers. It should
use a character class without a space, such as [@A-Za-z_][A-Za-z_0-9] (or
similar), to correctly match field names. The current pattern would fail to match
lowercase field names like reportsTo or connects.

ppl/src/main/antlr/OpenSearchPPLLexer.g4 [607]

-EDGE_ID: [@*A-Z_]+? [*A-Z_0-9]*                             -> type(ID), mode(EDGE_ARROW_MODE);
+EDGE_ID: [@*A-Za-z_][*A-Za-z_0-9]*                          -> type(ID), mode(EDGE_ARROW_MODE);
Suggestion importance[1-10]: 8

__

Why: The EDGE_ID pattern [@*A-Z_]+? [*A-Z_0-9]* only matches uppercase characters, which would fail to match lowercase field names like reportsTo or connects used throughout the tests. This is a critical bug that would break the new edge= syntax entirely.

Medium
General
Remove unnecessary null check for required field

The start field is now a required parameter in the new syntax (not optional), but
the anonymizer still guards it with a null check. Since startField is always
required by the new grammar, this conditional may hide bugs where startField is
unexpectedly null. Consider asserting non-null or removing the guard to match the
required nature of the field.

ppl/src/main/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizer.java [235-239]

-if (node.getStartField() != null) {
-  command.append(" start=").append(MASK_COLUMN);
-}
+command.append(" start=").append(MASK_COLUMN);
 String arrow = node.getDirection() == GraphLookup.Direction.BI ? "<->" : "-->";
 command.append(" edge=").append(MASK_COLUMN).append(arrow).append(MASK_COLUMN);
Suggestion importance[1-10]: 4

__

Why: Since startField is now a required parameter in the new grammar, the null guard in the anonymizer is inconsistent with the required nature of the field. Removing it would make the anonymizer consistent with the grammar and could surface unexpected null values as bugs rather than silently omitting the field.

Low
Document hyphenated field name edge case

The trailHyphen token captures one or more trailing hyphens from the fromField name,
but the grammar rule TRAIL_HYPHEN: '-'+ is defined in EDGE_ARROW_MODE, which is
entered only after EDGE_ID is matched. This means the trailing hyphen is a separate
token and the concatenation logic here is correct, but there's no validation that
the resulting field name with appended hyphens is a valid field identifier. Consider
adding a check or documenting this behavior clearly to avoid silent misuse.

ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java [1526-1529]

 if (edgeCtx.trailHyphen != null) {
   String name = fromField.getField().toString() + edgeCtx.trailHyphen.getText();
+  // trailHyphen allows field names with trailing hyphens (e.g., "manager-")
   fromField = new Field(new QualifiedName(name));
 }
Suggestion importance[1-10]: 1

__

Why: This suggestion only adds a comment to existing code without changing any logic. Adding comments/documentation is explicitly excluded from scoring above 0, but the existing code is functionally correct.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[graphLookup] Update naming convention to support directional edge syntax

1 participant