Code for "SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes".
Nicholas Pfaff1,
Thomas Cohn1,
Sergey Zakharov2,
Rick Cory2,
Russ Tedrake1
1Massachusetts Institute of Technology,
2Toyota Research Institute
Fully automated text-to-scene generation. This entire community center was generated by SceneSmith without any human intervention, from a single 151-word text prompt. Beyond explicitly specified elements, SceneSmith places additional objects from inferred contextual information, such as ping pong paddles and balls placed near a ping pong table. Objects are generated on-demand, are fully separable (non-composite), and include estimated physical properties, enabling direct interaction within a simulation. The resulting scenes are immediately usable in arbitrary physics simulators (robots added for demonstration).
@misc{scenesmith2026,
title={SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes},
author={Nicholas Pfaff and Thomas Cohn and Sergey Zakharov and Rick Cory and Russ Tedrake},
year={2026},
eprint={2602.09153},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2602.09153},
}For Docker-based setup with GPU support, see Docker below.
This repository uses uv for dependency management.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | shInstall the dependencies (including dev tools like pytest) into .venv:
uv syncTo install without dev dependencies:
uv sync --no-devActivate the virtual env:
source .venv/bin/activateInstall the pre-commit hooks:
pre-commit installImportant: Installation is not complete yet. The default configuration requires additional data and model checkpoints. After installing dependencies, continue with:
- SAM3D Backend - 3D asset generation (default backend)
- Articulated Objects (ArtVIP) - Articulated furniture
- AmbientCG Materials - PBR materials
For distributing Blender rendering across multiple GPUs (recommended for parallel scene generation), install bubblewrap:
# Ubuntu/Debian
sudo apt-get install bubblewrapThis enables GPU isolation for EEVEE Next rendering, preventing OOM errors when running many scenes in parallel. Each scene's BlenderServer is isolated to a single GPU via OS-level namespacing. Without bubblewrap, all Blender instances share GPU 0.
SceneSmith supports two backends for open-set 3D asset generation. If using
strategy: "hssd" (retrieval from HSSD library), no generation backend is needed.
We support distributing generations across GPUs and hence recommend a multi-GPU node
for the best scene generation throughput when using generated assets. We found AWS
g6e.48xlarge instances to work great with our system.
Lower-quality asset generation but fits into 24GB GPU memory. The results will be significantly worse than when using SAM3D. We only recommend this as a proof-of-concept if your are highly GPU memory limited. Use SAM3D for production results.
-
Install submodules:
git submodule update --init --recursive
-
Install Hunyuan3D-2:
bash scripts/install_hunyuan3d.sh
-
Enable Hunyuan3D-2 in configuration:
# In your agent config files asset_manager: backend: "hunyuan3d"
Higher-quality asset generation but requires 32GB GPU memory. This is the backend used in the paper.
-
Request access to the SAM models:
Automatic approval usually happens within 30min.
-
Authenticate with HuggingFace (required for checkpoint download):
huggingface-cli login
-
Install SAM3D:
bash scripts/install_sam3d.sh
This will:
- Clone SAM3 and SAM 3D Objects repositories
- Install dependencies (pytorch3d, gsplat, kaolin, nvdiffrast)
- Download model checkpoints (~5GB)
-
Enable SAM3D in configuration:
# In your agent config files asset_manager: backend: "sam3d"
Set the following environment variables. These are required for both local and Docker usage (Docker Compose forwards them from the host).
# Required: OpenAI API key for GPT-5 agents and default image generation
export OPENAI_API_KEY="your-openai-key"
# Optional: Google API key for Gemini image generation backend
# Only required if using image_generation.backend: "gemini" in config
export GOOGLE_API_KEY="your-google-key"
# Optional: Separate API key for OpenAI Agents tracing
# Allows traces to appear on a different account than is used for billing
export OPENAI_TRACING_KEY="your-tracing-key"Alternatively, run SceneSmith in a Docker container with NVIDIA GPU support. All servers (geometry generation, retrieval, blender, etc.) are auto-managed by the pipeline inside the container.
- Docker Engine (apt-based, not snap): Install via
official instructions or
curl -fsSL https://get.docker.com | sudo sh. The snap version has sandboxing issues that prevent GPU access. - NVIDIA Container Toolkit: Install from the
NVIDIA repo,
then configure and restart Docker:
sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker
docker build -t scenesmith .Data directories are not baked into the image. Download them on the host (see the dataset sections below) and mount into the container.
SAM3D model weights require HuggingFace authentication and are mounted at runtime:
huggingface-cli login
# Run the host install script (`scripts/install_sam3d.sh`, only the checkpoint download part)
# Or manually download:
mkdir -p external/checkpoints
huggingface-cli download facebook/sam3 sam3.pt --local-dir external/checkpoints
huggingface-cli download facebook/sam-3d-objects \
--repo-type model \
--local-dir external/checkpoints/sam-3d-objects-download \
--include "checkpoints/*"
mv external/checkpoints/sam-3d-objects-download/checkpoints/* external/checkpoints/
rm -rf external/checkpoints/sam-3d-objects-download# Interactive shell with all volumes and env vars
docker compose run --rm scenesmith bash
# Then inside the container:
python main.py +name=my_experiment# Or run a one-off command directly
docker compose run --rm scenesmith python main.py +name=my_experiment
# Smoke test
docker compose run --rm scenesmith \
python -c "import torch; print(torch.cuda.is_available()); import scenesmith"
# Unit tests
docker compose run --rm scenesmith pytest tests/unit/ -x| Host Path | Container Path | Content |
|---|---|---|
./data/ |
/app/data/ |
HSSD models, Objaverse assets, materials, preprocessed indices |
./external/checkpoints/ |
/app/external/checkpoints/ |
SAM3D model weights |
./outputs/ |
/app/outputs/ |
Generated scenes |
Important: The Docker image does not include model checkpoints or datasets. Before running scene generation, you must download and mount the required data. The default configuration requires:
- SAM3D checkpoints (mounted via
external/checkpoints/)- Articulated Objects (ArtVIP) (mounted via
data/)- AmbientCG Materials (mounted via
data/)
SceneSmith supports multiple asset and material sources. The default configuration requires three data dependencies:
- SAM3D - 3D asset generation backend (see installation above)
- ArtVIP - Articulated furniture (cabinets, drawers, etc.)
- AmbientCG Materials - PBR materials for walls, floors, and surfaces
HSSD, Objaverse, and PartNet-Mobility are optional alternative asset sources that can be enabled via configuration overrides.
For asset retrieval using the HSSD object library instead of generative 3D models:
-
Accept the HSSD license at https://huggingface.co/datasets/hssd/hssd-models
-
Download HSSD models (~72GB, requires Git LFS):
cd data git lfs install git clone git@hf.co:datasets/hssd/hssd-models -
Download preprocessed data (~60MB + ~2GB):
This data is from HSM (https://arxiv.org/abs/2503.16848) and includes:
- CLIP indices and embeddings for semantic search (~60MB)
- Pre-validated support surfaces (~2GB, provides ~10x speedup)
bash scripts/download_hssd_data.sh
Or manually:
# Download CLIP indices wget https://github.com/3dlg-hcvc/hsm/releases/latest/download/data.zip unzip data.zip -d data/preprocessed # Download pre-validated support surfaces wget https://github.com/3dlg-hcvc/hsm/releases/latest/download/support-surfaces.zip unzip support-surfaces.zip -d data/hssd-models
-
Enable HSSD in configuration:
# In your experiment config file asset_manager: strategy: "hssd" # Use HSSD retrieval instead of generation
The strategy: "generated" uses either Hunyuan3D or SAM3D for open-set asset generation (controlled by backend setting). HSSD assets automatically use pre-validated support surfaces when available, eliminating the need to recompute them.
For asset retrieval using the ObjectThor subset of Objaverse:
-
Download ObjectThor data (~50GB assets + ~200MB features):
bash scripts/download_objaverse_data.sh
This downloads:
- ObjectThor assets (GLB meshes)
- Annotations with placement constraints and metadata
- Pre-computed CLIP features (3 views, 768-dim)
-
Preprocess for retrieval:
python scripts/prepare_objaverse.py
This creates
data/objathor-assets/preprocessed/with:- Averaged CLIP embeddings (768-dim per object)
- Metadata index with categories and bounding boxes
- Object category mapping for filtering
-
Enable Objaverse in configuration:
# In your experiment config file asset_manager: general_asset_source: "objaverse" # Use Objaverse retrieval
For articulated objects (cabinets with doors, drawers, etc.), we support two datasets: PartNet-Mobility and ArtVIP. Both require preprocessing to convert to our SDF format.
Note that the quality (both mesh quality and joint quality) of the PartNet-Mobility dataset is very low.
-
Download PartNet-Mobility from https://sapien.ucsd.edu/downloads
Extract to
data/partnet-mobility-v0/. -
Convert to SDF format:
python scripts/convert_partnet_mobility.py \ --input data/partnet-mobility-v0 \ --output data/partnet_mobility_sdfFor parallel processing, use the wrapper script:
bash scripts/convert_partnet_parallel.sh data/partnet-mobility-v0 data/partnet_mobility_sdf 8
This uses VLM analysis to determine:
- Physics properties (mass, inertia per link)
- Front-facing orientation (canonicalized to +Y forward, Z up)
- Scale correction
- Placement type (floor/wall/ceiling/on-object)
-
Compute CLIP embeddings for text-based retrieval:
python scripts/compute_articulated_embeddings.py \ --source partnet_mobility \ --data-path data/partnet_mobility_sdf \ --output-path data/partnet_mobility_sdf/embeddingsOptional: Keep rendered images for inspection:
python scripts/compute_articulated_embeddings.py \ --source partnet_mobility \ --data-path data/partnet_mobility_sdf \ --output-path data/partnet_mobility_sdf/embeddings \ --keep-renders
We provide preprocessed ArtVIP assets (converted to Drake SDFormat with collision geometries and CLIP embeddings) on HuggingFace:
# Download VHACD variant (recommended β tighter collision geometries)
huggingface-cli download nepfaff/scenesmith-preprocessed-data \
artvip/artvip_vhacd.tar.gz --repo-type dataset --local-dir .
mkdir -p data/artvip_sdf
tar xzf artvip/artvip_vhacd.tar.gz -C data/artvip_sdf
rm -rf artvipAlternatively, download the CoACD variant (can produce faster simulations):
huggingface-cli download nepfaff/scenesmith-preprocessed-data \
artvip/artvip_coacd.tar.gz --repo-type dataset --local-dir .
mkdir -p data/artvip_sdf
tar xzf artvip/artvip_coacd.tar.gz -C data/artvip_sdf
rm -rf artvipTo preprocess ArtVIP assets yourself instead (e.g., with updated data or custom settings), use mesh-to-sim-asset to convert from USD to Drake SDFormat, then compute CLIP embeddings:
python scripts/compute_articulated_embeddings.py \
--source artvip \
--data-path data/artvip_sdf \
--output-path data/artvip_sdf/embeddingsTest that retrieval works correctly:
python scripts/test_asset_retrieval.py \
--source partnet_mobility \
--query "wooden cabinet with drawers" \
--top-k 5 \
--output-path output/retrieval_testThis renders multi-view images of retrieved objects for visual inspection.
Add articulated sources to your agent configuration:
# In configurations/furniture_agent/base_furniture_agent.yaml
asset_manager:
router:
enabled: true
articulated:
use_top_k: 5 # Number of CLIP candidates before bbox ranking
sources:
partnet_mobility:
enabled: false # Disabled by default (low quality)
data_path: data/partnet_mobility_sdf
embeddings_path: data/partnet_mobility_sdf/embeddings
artvip:
enabled: true # Enabled by default
data_path: data/artvip_sdf
embeddings_path: data/artvip_sdf/embeddingsDownload free CC0 PBR materials from AmbientCG for scene rendering:
-
Download materials:
python scripts/download_ambientcg.py --output data/materials
Options:
# Download specific resolution/format python scripts/download_ambientcg.py -r 2K -f PNG --output data/materials # Limit number of materials (for testing) python scripts/download_ambientcg.py --limit 100 --output data/materials # Dry run to see what would be downloaded python scripts/download_ambientcg.py --dry-run
-
Download pre-computed CLIP embeddings (recommended):
huggingface-cli download nepfaff/scenesmith-preprocessed-data \ --repo-type dataset \ --include "ambientcg/embeddings/**" \ --local-dir data/scenesmith-preprocessed-data mv data/scenesmith-preprocessed-data/ambientcg/embeddings data/materials/embeddings rm -rf data/scenesmith-preprocessed-dataOr compute them yourself:
python scripts/compute_ambientcg_embeddings.py --materials-dir data/materials
-
Test retrieval:
python scripts/test_material_retrieval.py \ --materials-dir data/materials \ --query "red brick wall" \ --top-k 5 # Save preview images for inspection python scripts/test_material_retrieval.py \ --materials-dir data/materials \ --query "wooden floor" \ --top-k 5 \ --output-path output/material_test
python main.py +name=run_nameSet the scene prompts in experiment.prompts (configurations/experiment/base_experiment.yaml).
Set floor_plan_agent.mode="house" for house scenes and floor_plan_agent.mode="room" for single room scenes.
Note that you will need >=24GB of GPU memory for Hunyuan3D asset generation and >=32GB for SAM3D asset generation. The material and articulated retrieval servers require additional GPU memory when enabled. Hence, we recommend >=45GB of GPU memory to run the full pipeline as documented in the research paper. All code was tested with an L40S GPU.
The geometry generation server automatically detects and uses all available GPUs, spawning one worker process per GPU for parallel asset generation. This significantly speeds up image-to-3D asset generation and thus scene generation.
To control which GPUs are used, set CUDA_VISIBLE_DEVICES:
# Use only GPUs 0, 1, 2, 3 (4 workers)
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py +name=my_experiment
# Force single GPU mode
CUDA_VISIBLE_DEVICES=0 python main.py +name=my_experiment
# Use all available GPUs (default - no env var needed)
python main.py +name=my_experimentThe scene generation pipeline has five stages that run in order:
- floor_plan - Generate room geometry (walls, floor)
- furniture - Place furniture in the room
- wall_mounted - Place wall-mounted objects (mirrors, artwork, shelves, clocks)
- ceiling_mounted - Place ceiling fixtures (chandeliers, pendant lights, ceiling fans)
- manipuland - Place small objects on furniture surfaces
Use pipeline.start_stage and pipeline.stop_stage to control which stages run:
python main.py +name=my_experimentpython main.py +name=my_experiment experiment.pipeline.stop_stage=furniturepython main.py +name=my_experiment experiment.pipeline.start_stage=manipulandState persistence: The pipeline automatically saves state after each stage:
house_layout.json- After floor_plan stagescene_states/scene_after_furniture/- After furniture stagescene_states/scene_after_wall_objects/- After wall_mounted stagescene_states/scene_after_ceiling_objects/- After ceiling_mounted stagescene_states/final_scene/- After manipuland stage
When resuming from a later stage, the pipeline loads from the previous stage's saved state. This enables iterative development: generate expensive stages once, then quickly iterate on later stages.
Use resume_from_path to create multiple independent runs from a single
checkpoint. This is useful for A/B testing different configurations or debugging
a specific stage.
# First run: generate floor plans and furniture
python main.py +name=base_run experiment.pipeline.stop_stage=furniture
# Branch 1: add manipulands with default config
python main.py +name=branch_1 \
experiment.pipeline.start_stage=manipuland \
experiment.pipeline.resume_from_path=outputs/2025-12-21/10-30-45
# Branch 2: add manipulands with different config
python main.py +name=branch_2 \
experiment.pipeline.start_stage=manipuland \
experiment.pipeline.resume_from_path=outputs/2025-12-21/10-30-45 \
manipuland_agent.some_param=different_valueWhen resume_from_path is set:
- The source scene is copied to the new output directory
- Absolute paths in checkpoint files are automatically fixed
- The pipeline continues from
start_stagewith the new configuration
This creates fully independent output directories, preserving the original run.
Intermediate scene states are saved as scene.dmd.yaml files inside scene_renders/
directories during generation. View them interactively with Drake's model visualizer:
python -m pydrake.visualization.model_visualizer \
outputs/YYYY-MM-DD/HH-MM-SS/scene_000/room_*/scene_renders/furniture/renders_001/scene.dmd.yamlThis opens an interactive 3D viewer in the browser where you can inspect object placement, collision geometry, articulated joints, and scene structure at any render checkpoint.
Final combined house scenes use package://scene/ URIs for portability. These scenes
can be moved or shared without breaking file paths. To view them, set ROS_PACKAGE_PATH
to the scene directory:
# View a portable house scene
export ROS_PACKAGE_PATH=/path/to/outputs/YYYY-MM-DD/HH-MM-SS/scene_000:$ROS_PACKAGE_PATH
python -m pydrake.visualization.model_visualizer \
outputs/YYYY-MM-DD/HH-MM-SS/scene_000/combined_house_after_furniture/house.dmd.yamlEach scene directory contains a package.xml file that registers it as a ROS-style
package named "scene". Drake's model visualizer automatically discovers packages via
ROS_PACKAGE_PATH.
For programmatic loading (e.g., in Python scripts), register the package directly:
from pydrake.multibody.parsing import Parser
parser = Parser(plant)
parser.package_map().Add("scene", "/path/to/scene_000")Note that final scenes after all agent stages are saved to outputs/YYYY-MM-DD/HH-MM-SS/scene_000/combined_house/. This includes both a house.blend Blender file and a house.dmd.yaml Drake file. See below for converting this scene into alternative formats (MuJoCo or USD).
The generated output directories can be huge. To remove all files that aren't used in the final scene, use scripts/clean_scene_output.py.
SceneSmith's native output format is Drake Directives (.dmd.yaml).
MuJoCo and USD export require a separate virtual environment because:
mujocois not included in the main venvmujoco-usd-converterconflicts withbpy(Blender) due to incompatible OpenUSD (pxr) versions
Setup the MuJoCo export environment:
./scripts/setup_mujoco_export.shsource .mujoco_venv/bin/activate
# Export a generated scene
python scripts/export_scene_to_mujoco.py outputs/YYYY-MM-DD/HH-MM-SS/scene_000 \
-o mujoco_export
# Export a standalone SDF model (e.g., robot arm)
python scripts/export_scene_to_mujoco.py --sdf /path/to/model.sdf -o mujoco_export
# Export to USD format (adds usd/ subdirectory)
python scripts/export_scene_to_mujoco.py --sdf /path/to/model.sdf -o mujoco_export --usd
# View in MuJoCo
python -m mujoco.viewer --mjcf=mujoco_export/scene.xmlThe USD output will be in mujoco_export/usd/.
Note: Simulator export (MuJoCo, USD/Isaac Sim) is experimental. Scene quality may vary slightly across different simulators compared to the native Drake format. We welcome feedback and contributions for additional export targets.
The robot evaluation module provides tools for task-based scene generation and validation. It converts human tasks (e.g., "Find a fruit and place it on the table") into generated scenes and validates task completion.
Robot manipulation evaluation pipeline. Given a manipulation task (e.g., "Pick a fruit from the fruit bowl and place it on a plate"), an LLM generates diverse scene prompts specifying scene constraints implied by the task. SceneSmith generates scenes from each prompt. A robot policy attempts the task in simulation, and an evaluation agent verifies success using simulator state queries and visual observations. This enables scalable policy evaluation without manual environment or success predicate design.
The evaluation pipeline has 4 stages:
- Generate Prompts - LLM converts human task β diverse scene prompts
- Generate Scenes - Use
main.pywith generated prompts (SceneSmith pipeline) - Policy Interface - Convert scene β robot-executable poses (optional, not needed for language-conditioned policies)
- Validate - VLM agent checks if task is completed
Convert a human task description into diverse scene prompts:
python scripts/robot_eval/generate_prompts.py \
--task "Find a fruit and place it on the kitchen table" \
--output-dir outputs/eval_run \
--num-prompts 5This generates:
outputs/eval_run/prompts.csv- Scene prompts for main.pyoutputs/eval_run/task_metadata.yaml- Task metadata for validation
Run standard scene generation with the generated prompts:
python main.py +name=eval experiment.csv_path=outputs/eval_run/prompts.csvExtract robot-executable poses from a generated scene. This is useful for model-based policies that require explicit pose targets instead of natural language task descriptions. We used this as part of the proof-of-concept model-based policy in our paper. However, this component isn't very general and language-conditioned policies that could skip this step have potential to perform better.
python scripts/robot_eval/policy_interface.py \
--scene-state outputs/.../scene_002/combined_house/house_state.json \
--dmd outputs/.../scene_002/combined_house/house.dmd.yaml \
--scene-dir outputs/.../scene_002 \
--task "Find a speaker and place it on the bed" \
--output-json robot_commands.jsonArguments:
--scene-state- Path to scene_state.json (per-room) or house_state.json (combined house)--dmd- Path to scene.dmd.yaml or house.dmd.yaml (Drake scene with poses)--scene-dir- Scene root directory for package:// URI resolution (default: parent of DMD)
The policy interface uses a unified LLM agent with access to state tools (support, containment, spatial relations) and optionally vision tools (scene renders). The agent autonomously reasons about the task to:
- Parse goals and preconditions (e.g., "from the floor")
- Find objects matching categories
- Verify preconditions using tools
- Return ALL valid (target, reference) bindings ranked by confidence
The output includes both a sampled target pose and contained placement bounds (AABB shrunk by object half-extents) that robot policies can sample from.
After the robot executes the task, it writes the final object poses back to a .dmd.yaml file.
The validator then compares the original scene metadata with the robot's output poses to
determine task success.
Robot policy contract:
- Robot receives initial
scene.dmd.yaml(from scene generation) - Robot performs the task
- Robot outputs modified
scene.dmd.yamlwith updated object poses
Validation:
python scripts/robot_eval/validate.py \
--scene-state outputs/.../scene_002/combined_house/house_state.json \
--dmd outputs/.../scene_002/combined_house/house.dmd.yaml \
--scene-dir outputs/.../scene_002 \
--task "Find a speaker and place it on the bed"Arguments:
--scene-state- Path to scene_state.json (per-room) or house_state.json (combined house)--dmd- Path to scene.dmd.yaml or house.dmd.yaml (poses from robot output)--scene-dir- Scene root directory for package:// URI resolution (default: parent of DMD)
The validator loads object metadata from scene_state.json and poses from the DMD file. It uses geometric state tools and vision tools (renders) to assess task completion.
Run unit tests:
pytest tests/unit/ --testmonRun integration tests:
pytest tests/integration/ --testmonRun all tests:
pytest tests/ --testmonOmit --testmon to run the full test suite without caching. Use -x to stop on first failure.
Note that some heavy integration tests are skipped in the GitHub CI and must be run locally.
This project is licensed under the MIT License - see LICENSE for details.

