Skip to content

[BugFix] Fix PushDownContext shallow copy bug#5199

Draft
songkant-aws wants to merge 2 commits intoopensearch-project:mainfrom
songkant-aws:pushdown-action-shallow-copy-bugfix
Draft

[BugFix] Fix PushDownContext shallow copy bug#5199
songkant-aws wants to merge 2 commits intoopensearch-project:mainfrom
songkant-aws:pushdown-action-shallow-copy-bugfix

Conversation

@songkant-aws
Copy link
Contributor

@songkant-aws songkant-aws commented Mar 4, 2026

Description

PushdownContext.clone() / cloneWithoutSort() / cloneForAggregate() replays add(operation) to construct new context. Aggregation operations reuses the same AggPushDownAction instance in both old and new context due to shallow copy side effect happened during cloning PushdownContext. The following pushdown actions from other equivalent RelSubsets will modify this shared mutable action object, which could contaminate previous context's agg states.

Related Issues

Resolves #5125

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

PR Reviewer Guide 🔍

(Review updated until commit 21f8940)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
📝 TODO sections

🔀 No multiple PR themes
⚡ Recommended focus areas for review

Incomplete Copy

The copyAggregationBuilder method falls through to return builder (returning the original, uncopied instance) for any AggregationBuilder type not explicitly handled (e.g., RangeAggregationBuilder, FiltersAggregationBuilder, etc.). This means unrecognized builder types will still be shared across cloned contexts, silently reintroducing the shallow-copy bug for those types.

private static AggregationBuilder copyAggregationBuilder(AggregationBuilder builder) {
  if (builder instanceof CompositeAggregationBuilder composite) {
    return copyCompositeAggregationBuilder(composite);
  }
  if (builder instanceof TermsAggregationBuilder terms) {
    return copyTermsAggregationBuilder(terms);
  }
  if (builder instanceof MultiTermsAggregationBuilder multiTerms) {
    return copyMultiTermsAggregationBuilder(multiTerms);
  }
  if (builder instanceof DateHistogramAggregationBuilder dateHistogram) {
    return copyDateHistogramAggregationBuilder(dateHistogram);
  }
  if (builder instanceof HistogramAggregationBuilder histogram) {
    return copyHistogramAggregationBuilder(histogram);
  }
  if (builder instanceof TopHitsAggregationBuilder topHits) {
    return copyTopHitsAggregationBuilder(topHits);
  }
  if (builder instanceof NestedAggregationBuilder nested) {
    return copyNestedAggregationBuilder(nested);
  }
  return builder;
}
Shallow Sub-Aggregations

The copy inner classes (e.g., CompositeAggregationBuilderCopy, TermsAggregationBuilderCopy) call copySubAggregations which only does a shallow copy of sub-aggregations by re-adding the same sub-aggregation builder references. If sub-aggregations are also mutable and shared, mutations on them in one context will still affect the other. A recursive deep copy of sub-aggregations may be needed.

private static AggregatorFactories.Builder copySubAggregations(AggregationBuilder source) {
  AggregatorFactories.Builder copiedFactories = new AggregatorFactories.Builder();
  source.getSubAggregations().forEach(copiedFactories::addAggregator);
  source.getPipelineAggregations().forEach(copiedFactories::addPipelineAggregator);
  return copiedFactories;
}
Exception Swallowing

In copyDateHistogramInterval, both IllegalArgumentException and IllegalStateException are caught and silently ignored in the first try block. If getIntervalAsFixed() throws for a reason unrelated to the interval type (e.g., a programming error), the exception will be swallowed and the method will fall through to the calendar interval path, potentially producing incorrect results or a misleading error.

  try {
    fixedIntervalSetter.accept(source.getIntervalAsFixed());
    return;
  } catch (IllegalArgumentException | IllegalStateException ignored) {
    // Fallback to calendar interval.
  }
  try {
    calendarIntervalSetter.accept(source.getIntervalAsCalendar());
    return;
  } catch (IllegalArgumentException | IllegalStateException ignored) {
    throw new OpenSearchRequestBuilder.PushDownUnSupportedException(
        "Cannot copy interval for date histogram bucket " + source.name());
  }
}
Non-Copied Buckets

In pushDownSortIntoCompositeBucket, buckets that are NOT part of the collation (the "rest of buckets" path) are added to newBuckets directly from the original buckets list without copying. This means the original shared bucket objects are reused in the new composite builder, which could still be mutated by subsequent pushdown operations.

buckets.stream()
    .map(CompositeValuesSourceBuilder::name)
    .filter(name -> !selected.contains(name))
    .forEach(
        name -> {
          newBuckets.add(buckets.get(bucketNames.indexOf(name)));
          newBucketNames.add(name);
        });

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

PR Code Suggestions ✨

Latest suggestions up to 21f8940
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Deep-copy all composite buckets, not just sorted ones

The remaining (non-sorted) buckets are added to newBuckets as direct references from
the original buckets list without copying, while the sorted buckets are properly
deep-copied via copyCompositeBucket. This inconsistency means the non-sorted buckets
are still shared references and could be mutated by other PushDownContext instances,
defeating the purpose of this bug fix.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java [664-671]

 buckets.stream()
     .map(CompositeValuesSourceBuilder::name)
     .filter(name -> !selected.contains(name))
     .forEach(
         name -> {
-          newBuckets.add(buckets.get(bucketNames.indexOf(name)));
+          newBuckets.add(copyCompositeBucket(buckets.get(bucketNames.indexOf(name))));
           newBucketNames.add(name);
         });
Suggestion importance[1-10]: 8

__

Why: This is a real correctness issue - the non-sorted buckets are added as direct references from the original buckets list, while sorted buckets are deep-copied via copyCompositeBucket. This inconsistency means shared bucket references could be mutated across different PushDownContext instances, which is exactly the bug this PR is trying to fix.

Medium
Verify index consistency between test and expected output

The integration test uses opensearch-sql_test_index_bank in the expected output file
(explain_agg_consecutive_sorts_issue_5125.yaml) but the REST test creates and
queries the issue5125 index. Verify that the expected output file references the
correct index, or that the REST test is querying the intended index to ensure the
test validates the actual bug fix scenario.

integ-test/src/yamlRestTest/resources/rest-api-spec/test/issues/5125.yml [62]

+query: source=issue5125 | stats count() as c by gender | sort gender | sort - gender
 
-
Suggestion importance[1-10]: 5

__

Why: The suggestion raises a valid concern about the mismatch between the issue5125 index used in the REST test and opensearch-sql_test_index_bank referenced in the expected output YAML. However, the existing_code and improved_code are identical, meaning no actual fix is proposed — it only asks the user to verify. This is a legitimate observation worth investigating but doesn't constitute a code fix.

Low
General
Preserve root cause when rethrowing interval copy exception

The second catch block catches ignored but then immediately throws a new exception,
meaning the caught exception is silently discarded and the root cause is lost. The
throw statement should be outside the catch block (in a final else or after both try
blocks), or the caught exception should be chained as the cause of the new exception
to preserve the original error context.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java [422-439]

 private static void copyDateHistogramInterval(
     DateHistogramValuesSourceBuilder source,
     Consumer<DateHistogramInterval> fixedIntervalSetter,
     Consumer<DateHistogramInterval> calendarIntervalSetter) {
   try {
     fixedIntervalSetter.accept(source.getIntervalAsFixed());
     return;
   } catch (IllegalArgumentException | IllegalStateException ignored) {
     // Fallback to calendar interval.
   }
   try {
     calendarIntervalSetter.accept(source.getIntervalAsCalendar());
-    return;
-  } catch (IllegalArgumentException | IllegalStateException ignored) {
+  } catch (IllegalArgumentException | IllegalStateException e) {
     throw new OpenSearchRequestBuilder.PushDownUnSupportedException(
-        "Cannot copy interval for date histogram bucket " + source.name());
+        "Cannot copy interval for date histogram bucket " + source.name(), e);
   }
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion is valid - the caught exception e should be chained as the cause of the new PushDownUnSupportedException to preserve the root cause. However, the PushDownUnSupportedException constructor may not accept a cause parameter, and this is a minor diagnostic improvement rather than a correctness issue.

Low

Previous suggestions

Suggestions up to commit 62df8d2
CategorySuggestion                                                                                                                                    Impact
Possible issue
Copy remaining composite buckets to avoid shared mutable state

The remaining (non-sorted) buckets are added directly from the original buckets list
without copying, meaning they still share the same mutable
CompositeValuesSourceBuilder instances. This can cause the same shallow-copy bug
that this PR is trying to fix. Each remaining bucket should also be copied via
copyCompositeBucket.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java [610-617]

 buckets.stream()
     .map(CompositeValuesSourceBuilder::name)
     .filter(name -> !selected.contains(name))
     .forEach(
         name -> {
-          newBuckets.add(buckets.get(bucketNames.indexOf(name)));
+          newBuckets.add(copyCompositeBucket(buckets.get(bucketNames.indexOf(name))));
           newBucketNames.add(name);
         });
Suggestion importance[1-10]: 8

__

Why: The non-sorted buckets are added directly from the original buckets list without copying, which means they still share mutable CompositeValuesSourceBuilder instances. This is the exact bug the PR aims to fix, and the copyCompositeBucket method is already available for this purpose.

Medium
General
Fix incorrect null return when limit already satisfies bucket size

When size >= bucketSize, resizeAggregationForLimit returns null, which causes
pushDownLimitIntoBucketSize to return false (cannot push down). However,
canPushDownLimitIntoBucketSize returns false in the same case. This is consistent,
but the pushDownLimitIntoBucketSize caller in CalciteLogicalIndexScan uses
canPushDownLimitIntoBucketSize as a read-only probe and then calls
pushDownLimitIntoBucketSize to actually mutate — if resizeAggregationForLimit
returns the original builder unchanged (e.g. for LeafOnly), replaceRootBuilder is
skipped correctly. However, when size >= bucketSize, returning null means the limit
is NOT pushed down even though it could be considered already satisfied. Consider
returning the original builder (no resize needed) instead of null when size >=
bucketSize to correctly signal that push-down is possible (the bucket is already
small enough).

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java [702-713]

 private AggregationBuilder resizeAggregationForLimit(AggregationBuilder builder, int size) {
   Integer bucketSize = getBucketSize(builder);
   if (bucketSize != null) {
-    return size < bucketSize ? copyAndResizeBucketBuilder(builder, size) : null;
+    return size < bucketSize ? copyAndResizeBucketBuilder(builder, size) : builder;
   }
   if (builder instanceof ValuesSourceAggregationBuilder.LeafOnly<?, ?>) {
     // all metric aggregations generate one row and are effectively already limited.
     return builder;
   }
   throw new OpenSearchRequestBuilder.PushDownUnSupportedException(
       "Unknown aggregation builder " + builder.getClass().getSimpleName());
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion argues that returning null when size >= bucketSize incorrectly signals that push-down failed, but looking at the caller in CalciteLogicalIndexScan, canPushDownLimitIntoBucketSize is used as a read-only probe and pushDownLimitIntoBucketSize is the actual mutating call. The behavior is intentional — when the bucket is already small enough, no resize is needed and returning null (no-op) is consistent with the existing canPushDownLimitIntoBucketSize returning false. The suggestion may introduce unintended behavior changes.

Low
Throw on unrecognized builder type instead of silently sharing

When the builder type is not recognized (falls through all instanceof checks), the
original builder is returned as-is without copying. This silently shares the mutable
builder between the original and cloned AggPushDownAction, which is the exact bug
this PR aims to fix. An exception should be thrown for unrecognized types to make
the problem visible, or at minimum a comment should document why sharing is safe.

opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/context/AggPushDownAction.java [136-153]

 private static AggregationBuilder copyAggregationBuilder(AggregationBuilder builder) {
   if (builder instanceof CompositeAggregationBuilder composite) {
     return copyCompositeAggregationBuilder(composite);
   }
   if (builder instanceof TermsAggregationBuilder terms) {
     return copyTermsAggregationBuilder(terms);
   }
   if (builder instanceof MultiTermsAggregationBuilder multiTerms) {
     return copyMultiTermsAggregationBuilder(multiTerms);
   }
   if (builder instanceof TopHitsAggregationBuilder topHits) {
     return copyTopHitsAggregationBuilder(topHits);
   }
   if (builder instanceof NestedAggregationBuilder nested) {
     return copyNestedAggregationBuilder(nested);
   }
-  return builder;
+  throw new OpenSearchRequestBuilder.PushDownUnSupportedException(
+      "Cannot deep-copy unsupported AggregationBuilder type: " + builder.getClass().getSimpleName());
 }
Suggestion importance[1-10]: 4

__

Why: Throwing an exception for unrecognized builder types would make bugs more visible, but it could also break existing functionality if there are other builder types in use that are currently handled safely by the fall-through. The suggestion is valid as a defensive programming measure but carries risk.

Low

Signed-off-by: Songkan Tang <songkant@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 6, 2026

Persistent review updated to latest commit 21f8940

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reverse Optimization fails when creating consecutive sorts due to Calcite physical optimizer merging

1 participant