feat: Add Docker Compose support for TPC benchmarks [WIP]#3539
Draft
andygrove wants to merge 24 commits intoapache:mainfrom
Draft
feat: Add Docker Compose support for TPC benchmarks [WIP]#3539andygrove wants to merge 24 commits intoapache:mainfrom
andygrove wants to merge 24 commits intoapache:mainfrom
Conversation
Replace 9 per-engine shell scripts with a single `run.py` that loads per-engine TOML config files. This eliminates duplicated Spark conf boilerplate and makes it easier to add new engines or modify shared settings. Usage: `python3 run.py --engine comet --benchmark tpch [--dry-run]` Also moves benchmarks from `dev/benchmarks/` to `benchmarks/tpc/` and updates all documentation references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename create-iceberg-tpch.py to create-iceberg-tables.py with --benchmark flag supporting both tpch and tpcds table sets - Remove hardcoded TPCH_QUERIES from comet-iceberg.toml required env vars - Remove hardcoded ICEBERG_DATABASE default of "tpch" from comet-iceberg.toml - Add check_benchmark_env() in run.py to validate benchmark-specific env vars and default ICEBERG_DATABASE to the benchmark name - Update README with TPC-DS Iceberg table creation examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7 tasks
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12f88ef to
077c5db
Compare
The script now configures the Iceberg catalog via SparkSession.builder instead of requiring --conf flags on the spark-submit command line. This adds --warehouse as a required CLI arg, makes --catalog optional (default: local), and validates paths with clear error messages before starting Spark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Provides a containerized Spark standalone cluster for running TPC-H and TPC-DS benchmarks. Includes a Dockerfile with Java 8+17 support, a three-service Compose file (master, worker, bench runner), a memory-constrained overlay with cgroup metrics collection, and README documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ports SparkMetricsProfiler from unified-benchmark-runner branch to collect executor memory metrics via the Spark REST API during benchmark runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
07ec5c4 to
1cad3be
Compare
Use a fixed 2-worker setup so shuffles go through the network stack, better reflecting real cluster behavior. Merge the constrained memory overlay into the main compose file and use YAML anchors to avoid duplication. Update TPC-H to use 2 executors to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace ENGINE_JARS_DIR with individual COMET_JAR, GLUTEN_JAR, and ICEBERG_JAR env vars pointing to host paths. Each JAR is mounted into the container at a fixed path, making it easy to switch between JAR versions by changing the path and restarting. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a matplotlib script that generates four PNG charts from the JVM and cgroup metrics collected during TPC benchmark runs: - jvm_memory_usage.png: peak JVM heap/off-heap per executor over time - jvm_peak_memory.png: grouped bar chart of peak memory breakdown - cgroup_memory.png: container memory usage and RSS per worker - combined_memory.png: dual-axis overlay of JVM peaks and cgroup usage Also fix cgroup metrics collection: move the collector from a separate sidecar container (which could only see its own cgroup) into the worker container itself, so it reads the worker's actual memory stats. Add timestamp_ms to the profiling CSV output so the visualization script can automatically align JVM and cgroup timelines. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Generate per-worker/executor charts instead of combined ones for better readability. Use distinct colors per series type (Heap, OffHeap, OffHeapExec, cgroup usage/RSS). Add JVMOffHeapMemory to combined chart. Trim chart x-axis to the JVM profiler window on both ends. Add JAVA_HOME environment variable to all compose services to support engines that require a different JDK (e.g. Gluten with Java 8). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cgroup collector runs continuously and its CSV gets overwritten on cluster restart, so earlier engine data is lost. The new snapshot_cgroup_metrics() method filters the raw container-metrics CSVs to the profiler's start/stop time window and writes per-engine snapshots (e.g. comet-tpch-container-metrics-spark-worker-1.csv) alongside the existing JVM metrics CSV. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Gluten requires restarting the entire cluster with JAVA_HOME set to Java 8 (not just the bench container) - Use --output /results so output lands alongside cgroup CSVs - mkdir -p /tmp/spark-events needed in ephemeral bench container - Document cgroup snapshot output files produced by --profile - Add visualize-metrics.py usage example for chart generation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the dual-axis linear chart with a single y-axis log scale so all JVM and cgroup memory series are directly comparable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copies 22 TPC-H and 99 TPC-DS SQL query files (from SQLBench-H/DS, derived under the TPC Fair Use Policy) into benchmarks/tpc/queries/ so that benchmarks are self-contained. Removes all TPCH_QUERIES/TPCDS_QUERIES env var configuration from run.py, tpcbench.py, docker-compose.yml, and the README. Adds a requirements.txt with pyspark==3.5.2 and venv setup instructions. Excludes query files from both Maven and release RAT checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create arch-agnostic Java symlinks in the Dockerfile using TARGETARCH so the image works on both platforms. Rename JAVA_HOME to BENCH_JAVA_HOME in docker-compose.yml to prevent the host's JAVA_HOME from leaking into containers. Support both table.parquet and table directory layouts in tpcbench.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Docker Desktop memory instructions to README (need >= 48 GB for the default config). Mount Spark logs and work directories to $RESULTS_DIR so executor stderr survives container restarts. Expose port 4040 for the Spark Application UI on the bench container. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Comet JAR contains native libraries for a specific OS/arch. On macOS (Apple Silicon), build for linux/amd64 with --platform to match the standard release JARs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On macOS, host-built Comet JARs contain darwin native libraries that don't work inside Linux Docker containers. Dockerfile.build-comet compiles the Comet JAR inside a Linux container, producing a JAR with the correct linux native libraries for the container architecture. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Ubuntu 20.04 (GLIBC 2.31) to match the apache/spark base image. Install GCC 11 from ubuntu-toolchain-r PPA to work around the memcmp bug in GCC 9 (GCC #95189) that breaks aws-lc-sys. Remove static port 4040 from the bench service in docker-compose.yml to avoid conflicts (use -p 4040:4040 with docker compose run instead). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --rm to all docker compose run commands in the README to auto-remove containers on exit, and add -p 4040:4040 to expose the Spark Application UI. Also add output/ to .gitignore for Comet build artifacts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
apache/spark:3.5.2-python3with Java 8 + 17 for Comet and Gluten engine supportspark-master,spark-worker, andbenchrunnerdocker-compose.constrained.yml) with a cgroup metrics collector sidecarDepends on
This PR should be merged after #3538 (consolidate TPC benchmark scripts).
Test plan
docker build -t comet-bench -f benchmarks/tpc/infra/docker/Dockerfile .docker run --rm comet-bench ls /opt/benchmarks/docker run --rm comet-bench python3 /opt/benchmarks/run.py --helpdocker compose run🤖 Generated with Claude Code