Skip to content

Fix #5150: Fix dedup aggregation pushdown nullifying renamed fields#5192

Open
opensearchpplteam wants to merge 5 commits intoopensearch-project:mainfrom
opensearchpplteam:fix/issue-5150
Open

Fix #5150: Fix dedup aggregation pushdown nullifying renamed fields#5192
opensearchpplteam wants to merge 5 commits intoopensearch-project:mainfrom
opensearchpplteam:fix/issue-5150

Conversation

@opensearchpplteam
Copy link
Contributor

Description

Root cause: The DedupPushdownRule converts dedup operations to LITERAL_AGG top_hits aggregations. When a field is renamed before dedup (e.g., rename value as val), the top_hits response from OpenSearch still uses the original field name (value). However, the output schema expects the renamed name (val). The OpenSearchIndexEnumerator looks up fields by the renamed names, finds nothing in the top_hits response map, and returns null.

Fix: Build a field name mapping (original -> renamed) in AggregateAnalyzer for the LITERAL_AGG case by comparing inferNamedField().getRootName() with the output alias from the project args. Pass this mapping to TopHitsParser, which applies it to translate OpenSearch field names back to the expected output names before results are consumed by the enumerator.

Changes:

  • AggregateAnalyzer.java: Build original-to-renamed field name mapping from project args in LITERAL_AGG case, pass to TopHitsParser
  • TopHitsParser.java: Added fieldNameMapping field and applyFieldNameMapping() method to translate OS field names to renamed output names. Added backward-compatible 4-arg constructor; existing 3-arg constructor delegates with empty map.
  • Unit test: Verifies dedup top_hits parsing with field name remapping
  • YAML REST tests: End-to-end tests for rename-then-dedup and eval-then-dedup scenarios

Related Issues

Resolves #5150

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

PR Reviewer Guide 🔍

(Review updated until commit fe1766a)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add fieldNameMapping support to TopHitsParser

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java
  • opensearch/src/test/java/org/opensearch/sql/opensearch/response/OpenSearchAggregationResponseParserTest.java

Sub-PR theme: Wire fieldNameMapping from AggregateAnalyzer with integration tests

Relevant files:

  • opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java
  • integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5150.yml

⚡ Recommended focus areas for review

Incomplete Mapping

The field name mapping only handles RexInputRef args. If a project arg is a RexCall (e.g., an eval expression like value * 2), it is silently skipped. This means computed/eval fields that are also renamed will not be included in the mapping, potentially causing null values in the output for those fields.

for (Pair<RexNode, String> arg : args) {
  if (arg.getKey() instanceof RexInputRef) {
    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
    String outputName = arg.getValue();
    if (!originalName.equals(outputName)) {
      String previousMapping = fieldNameMapping.put(originalName, outputName);
      if (previousMapping != null) {
        LOG.warn(
            "Field name mapping collision: field '{}' was mapped to '{}', now"
                + " overwritten by '{}'",
            originalName,
            previousMapping,
            outputName);
      }
    }
  }
}
Mapping Collision

If two different original fields are mapped to the same output name (e.g., two fields both renamed to the same alias), applyFieldNameMapping will silently overwrite one value with the other in the result map. There is no warning or error for this case, which could lead to silent data loss.

private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
  if (fieldNameMapping.isEmpty()) {
    return map;
  }
  Map<String, Object> result = new LinkedHashMap<>();
  for (Map.Entry<String, Object> entry : map.entrySet()) {
    String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
    result.put(mappedName, entry.getValue());
  }
  return result;
}
Missing Schema Check

The second test case ("Eval new field then dedup on different field retains eval values") does not include a schema match assertion, unlike the first test. This means the test does not verify that the output schema (column names and types) is correct for the eval scenario.

"5150: Eval new field then dedup on different field retains eval values":
  - skip:
      features:
        - headers
        - allowed_warnings
  - do:
      allowed_warnings:
        - 'Loading the fielddata on the _id field is deprecated and will be removed in future versions. If you require sorting or aggregating on this field you should also include the id in the body of your documents, and map this field as a keyword field that has [doc_values] enabled'
      headers:
        Content-Type: 'application/json'
      ppl:
        body:
          query: source=test_5150 | eval doubled = value * 2 | dedup category | sort category | fields category, value, doubled

  - match: { total: 2 }
  - length: { datarows: 2 }
  - match: { datarows.0.0: "X" }
  - match: { datarows.0.1: 100 }
  - match: { datarows.0.2: 200 }
  - match: { datarows.1.0: "Y" }
  - match: { datarows.1.1: 300 }
  - match: { datarows.1.2: 600 }

@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

PR Code Suggestions ✨

Latest suggestions up to fe1766a

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null field resolution result

The helper.inferNamedField(arg.getKey()) call may return null or throw if the field
cannot be resolved, leading to a NullPointerException at .getRootName(). Adding a
null check before calling getRootName() would make this more robust.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [617-632]

 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField == null) {
+      continue;
+    }
+    String originalName = namedField.getRootName();
     String outputName = arg.getValue();
     if (!originalName.equals(outputName)) {
       String previousMapping = fieldNameMapping.put(originalName, outputName);
       if (previousMapping != null) {
         LOG.warn(
             "Field name mapping collision: field '{}' was mapped to '{}', now"
                 + " overwritten by '{}'",
             originalName,
             previousMapping,
             outputName);
       }
     }
   }
 }
Suggestion importance[1-10]: 5

__

Why: Adding a null check for helper.inferNamedField(arg.getKey()) is a reasonable defensive measure to prevent a potential NullPointerException at .getRootName(). The improved code accurately reflects the suggested change.

Low
General
Detect output key collisions during field remapping

If two original fields map to the same output name, the second one will silently
overwrite the first in the result map. This could cause data loss without any
warning. Consider detecting such collisions and logging a warning or throwing an
exception.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [162-172]

 private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
   if (fieldNameMapping.isEmpty()) {
     return map;
   }
   Map<String, Object> result = new LinkedHashMap<>();
   for (Map.Entry<String, Object> entry : map.entrySet()) {
     String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+    if (result.containsKey(mappedName)) {
+      LOG.warn(
+          "Field name mapping collision in result: output key '{}' already exists, overwriting",
+          mappedName);
+    }
     result.put(mappedName, entry.getValue());
   }
   return result;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion is valid and addresses a potential silent data loss scenario when two original fields map to the same output name. However, this is an edge case that is unlikely in practice, and the LOG reference used in TopHitsParser is not defined in that class (it's defined in AggregateAnalyzer), making the improved_code incorrect as-is.

Low

Previous suggestions

Suggestions up to commit e8a2e3a
CategorySuggestion                                                                                                                                    Impact
Possible issue
Detect output key collisions during field remapping

If two original field names map to the same output name, the second one will
silently overwrite the first in the result map. This could cause data loss without
any warning. Consider detecting such collisions and logging a warning or throwing an
exception.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [162-172]

 private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
   if (fieldNameMapping.isEmpty()) {
     return map;
   }
   Map<String, Object> result = new LinkedHashMap<>();
   for (Map.Entry<String, Object> entry : map.entrySet()) {
     String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+    if (result.containsKey(mappedName)) {
+      throw new IllegalStateException(
+          String.format(
+              "Field name mapping collision in result: key '%s' already exists when mapping '%s'",
+              mappedName, entry.getKey()));
+    }
     result.put(mappedName, entry.getValue());
   }
   return result;
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion addresses a valid edge case where two original fields could map to the same output name, causing silent data loss. However, this scenario is unlikely in practice given the mapping is built from project args, and throwing an exception may be overly aggressive compared to a warning log.

Low
General
Store defensive copy of mutable map parameter

The fieldNameMapping field is stored as a direct reference to the passed-in map,
which could be mutated externally after construction. Store a defensive copy to
ensure immutability of the parser's state.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [50]

-this.fieldNameMapping = fieldNameMapping;
+this.fieldNameMapping = Map.copyOf(fieldNameMapping);
Suggestion importance[1-10]: 4

__

Why: Storing a defensive copy of fieldNameMapping is a good defensive programming practice to prevent external mutation. The Map.copyOf() change is minimal and correct, though the risk of external mutation is low given the call sites in this PR.

Low
Suggestions up to commit e9a24ad
CategorySuggestion                                                                                                                                    Impact
General
Strengthen integration test assertions with exact value checks

The is_true assertion checks for truthiness, which would fail for falsy values like
0. Since value fields contain 100 and 300 (non-zero), this works here, but it's
fragile. Consider using match assertions to verify the exact expected values for
stronger test coverage.

integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5150.yml [49-53]

   - match: { schema: [{"name": "category", "type": "string"}, {"name": "val", "type": "bigint"}] }
   - length: { datarows: 2 }
-  # Each row should have non-null val
-  - is_true: datarows.0.1
-  - is_true: datarows.1.1
+  # Each row should have non-null val — verify exact values
+  - match: { datarows.0: ["X", 100] }
+  - match: { datarows.1: ["Y", 300] }
Suggestion importance[1-10]: 5

__

Why: The suggestion improves test robustness by replacing is_true (truthiness check) with exact match assertions. This is a valid improvement since is_true would fail for falsy values like 0, and exact value checks provide stronger test coverage for the renamed field behavior.

Low
Detect collisions when applying field name mapping

If two original field names map to the same output name, the second one will
silently overwrite the first in the result map. This could cause data loss without
any warning. Consider detecting and logging a warning when a collision occurs during
the mapping application.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [162-172]

     private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
       if (fieldNameMapping.isEmpty()) {
         return map;
       }
       Map<String, Object> result = new LinkedHashMap<>();
       for (Map.Entry<String, Object> entry : map.entrySet()) {
         String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+        if (result.containsKey(mappedName)) {
+          LOG.warn(
+              "Field name mapping collision in result: mapped name '{}' already exists, overwriting value",
+              mappedName);
+        }
         result.put(mappedName, entry.getValue());
       }
       return result;
     }
Suggestion importance[1-10]: 4

__

Why: The suggestion adds collision detection in applyFieldNameMapping, which is a valid defensive measure. However, this is a relatively rare edge case and the impact is low since the primary collision warning is already handled upstream in AggregateAnalyzer. The LOG reference would also need to be added to TopHitsParser.

Low
Suggestions up to commit d82a0a1
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null field resolution result

If helper.inferNamedField(arg.getKey()) returns null, calling .getRootName() on it
will throw a NullPointerException. A null check should be added before accessing
getRootName() to prevent a runtime crash when the field cannot be resolved.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [609-617]

 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField == null) continue;
+    String originalName = namedField.getRootName();
     String outputName = arg.getValue();
     if (!originalName.equals(outputName)) {
       fieldNameMapping.put(originalName, outputName);
     }
   }
 }
Suggestion importance[1-10]: 6

__

Why: If helper.inferNamedField(arg.getKey()) can return null, calling .getRootName() would throw a NullPointerException. Adding a null check is a reasonable defensive measure, though it depends on whether the method contract guarantees non-null returns.

Low
General
Detect field name mapping collisions

If two original field names map to the same output name, or if a mapped output name
collides with another existing key in the result map, one value will silently
overwrite the other. While this may be an edge case, it could lead to data loss
without any warning. Consider detecting and handling such collisions explicitly.

opensearch/src/main/java/org/opensearch/sql/opensearch/response/agg/TopHitsParser.java [159-169]

 private Map<String, Object> applyFieldNameMapping(Map<String, Object> map) {
   if (fieldNameMapping.isEmpty()) {
     return map;
   }
   Map<String, Object> result = new LinkedHashMap<>();
   for (Map.Entry<String, Object> entry : map.entrySet()) {
     String mappedName = fieldNameMapping.getOrDefault(entry.getKey(), entry.getKey());
+    if (result.containsKey(mappedName)) {
+      throw new IllegalStateException(
+          String.format("Field name mapping collision: multiple fields map to '%s'", mappedName));
+    }
     result.put(mappedName, entry.getValue());
   }
   return result;
 }
Suggestion importance[1-10]: 3

__

Why: While collision detection is a valid concern, this is an edge case that is unlikely in practice since field rename mappings are derived from explicit user-defined renames. Throwing an exception here could cause unexpected failures in valid scenarios where the mapping is well-defined.

Low
Suggestions up to commit 6eb5ac5
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check for inferNamedField

The code doesn't handle potential null values from
helper.inferNamedField(arg.getKey()). If this method returns null, calling
getRootName() will throw a NullPointerException. Add a null check before accessing
getRootName() to prevent runtime failures.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [608-617]

 Map<String, String> fieldNameMapping = new HashMap<>();
 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
-    String outputName = arg.getValue();
-    if (!originalName.equals(outputName)) {
-      fieldNameMapping.put(originalName, outputName);
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField != null) {
+      String originalName = namedField.getRootName();
+      String outputName = arg.getValue();
+      if (!originalName.equals(outputName)) {
+        fieldNameMapping.put(originalName, outputName);
+      }
     }
   }
 }
Suggestion importance[1-10]: 7

__

Why: Valid defensive programming suggestion to prevent potential NullPointerException if helper.inferNamedField() returns null. However, without knowing the implementation details of inferNamedField(), it's unclear if this is a real issue or if the method guarantees non-null returns for RexInputRef instances.

Medium
Suggestions up to commit eddc6b9
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add null check for inferNamedField

The code assumes helper.inferNamedField(arg.getKey()) never returns null, which
could cause a NullPointerException. Add a null check before calling getRootName() to
prevent potential runtime failures when the field cannot be inferred.

opensearch/src/main/java/org/opensearch/sql/opensearch/request/AggregateAnalyzer.java [608-617]

 Map<String, String> fieldNameMapping = new HashMap<>();
 for (Pair<RexNode, String> arg : args) {
   if (arg.getKey() instanceof RexInputRef) {
-    String originalName = helper.inferNamedField(arg.getKey()).getRootName();
-    String outputName = arg.getValue();
-    if (!originalName.equals(outputName)) {
-      fieldNameMapping.put(originalName, outputName);
+    var namedField = helper.inferNamedField(arg.getKey());
+    if (namedField != null) {
+      String originalName = namedField.getRootName();
+      String outputName = arg.getValue();
+      if (!originalName.equals(outputName)) {
+        fieldNameMapping.put(originalName, outputName);
+      }
     }
   }
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion correctly identifies a potential NullPointerException if helper.inferNamedField(arg.getKey()) returns null. Adding a null check improves robustness, though the actual likelihood of this occurring depends on the codebase's guarantees about inferNamedField behavior.

Medium

@penghuo penghuo added PPL Piped processing language bugFix labels Feb 27, 2026
@github-actions
Copy link
Contributor

Persistent review updated to latest commit eddc6b9

…g renamed fields

When the DedupPushdownRule converts a dedup to an aggregation-based
top_hits query, fields that were renamed (via rename or eval) would
return null values. This happened because the TopHitsParser returned
results using original OpenSearch field names, but the output schema
expected the renamed names.

Added a field name mapping to TopHitsParser so it can translate original
OS field names to their renamed output names in the LITERAL_AGG (dedup)
aggregation response path.

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit 6eb5ac5

The testNoMvBasic and testNoMvWithEval failures in CalciteExplainIT
are unrelated to this PR's dedup pushdown changes and only occur on
Java 25. Re-triggering CI.

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit d82a0a1

String originalName = helper.inferNamedField(arg.getKey()).getRootName();
String outputName = arg.getValue();
if (!originalName.equals(outputName)) {
fieldNameMapping.put(originalName, outputName);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When two project args reference the same original field with different output names (e.g., eval pay2 = salary | rename salary as pay | dedup dept_id), the second HashMap.put silently overwrites the first mapping. One of the renamed fields would still get null values. This is an edge case from issue #5150 differential testing. Consider either documenting this as a known limitation or using a multimap / detecting the collision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a warning log when a collision is detected (when the same original field maps to multiple output names). The last mapping wins, which is documented in a code comment as a known limitation. This edge case requires two derived fields from the same original field before dedup, which is rare in practice.

: new LinkedHashMap<>(hit.getSourceAsMap());
hit.getFields().values().forEach(f -> map.put(f.getName(), f.getValue()));
return map;
return applyFieldNameMapping(map);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field name mapping is applied only in the multi-row return path (the else branch at line 140). The returnSingleValue path (used by ARG_MAX/ARG_MIN pushdown) and returnMergeValue path skip mapping entirely. If those paths are ever used with renamed fields via a future pushdown rule, they would have the same null-value bug. Worth confirming this is an intentional scope limitation for now, or if those branches should also apply the mapping defensively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this is intentional. The returnSingleValue and returnMergeValue paths use agg.getName() as the map key (not field names), so field name mapping is irrelevant for those paths. Added a clarifying comment in the code explaining why mapping is only applied in the multi-row return path.

…apping limitations

- Add warning log when HashMap collision is detected in AggregateAnalyzer
  field name mapping (when same original field maps to multiple output names)
- Add clarifying comments in TopHitsParser explaining why field name mapping
  is only needed in the multi-row return path

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit e9a24ad

Comment on lines +52 to +53
- is_true: datarows.0.1
- is_true: datarows.1.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare with real value, not just null value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — replaced is_true with match assertions for actual expected values (category and val columns).

Comment on lines +73 to +74
- is_true: datarows.0.2
- is_true: datarows.1.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare with real value, not just null value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated — replaced is_true with match assertions for actual expected values (category, value, and doubled columns).

…project#5150

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

Persistent review updated to latest commit e8a2e3a

Signed-off-by: opensearchpplteam <opensearchpplteam@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

Persistent review updated to latest commit fe1766a

@LantaoJin
Copy link
Member

Hold on merging, the current fix is more like some kind of workaround. I'll need some time to recall the implementation, and if I can suggest a new fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugFix PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Dedup aggregation pushdown nullifies renamed non-dedup fields via top_hits response mapping

3 participants