Conversation
|
@microsoft-github-policy-service agree |
There was a problem hiding this comment.
Pull request overview
Adds a new prompt converter that outputs Word .docx documents to support indirect prompt-injection scenarios, and wires it into the prompt-converter discovery surface.
Changes:
- Introduces
word_doc_converter.pyimplementing.docxgeneration and placeholder-based injection for existing documents. - Exposes the new converter via
pyrit.prompt_converter.__init__so it participates in converter discovery. - Adds
python-docxas a project dependency.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
pyrit/prompt_converter/word_doc_converter.py |
New Word .docx converter with new-doc generation and placeholder injection logic, plus identifier + serialization. |
pyrit/prompt_converter/__init__.py |
Exports the new converter for discovery via __all__ / get_converter_modalities. |
pyproject.toml |
Adds python-docx runtime dependency needed by the converter. |
| async def _serialize_docx(self, docx_bytes: bytes) -> DataTypeSerializer: | ||
| """Serialize the generated document using a data serializer.""" | ||
| extension = "docx" | ||
|
|
||
| serializer = data_serializer_factory( | ||
| category="prompt-memory-entries", | ||
| data_type="binary_path", | ||
| extension=extension, | ||
| ) | ||
| await serializer.save_data(docx_bytes) | ||
| return serializer |
There was a problem hiding this comment.
_serialize_docx is an async method but its name does not end with _async. Rename it (and its call site) to follow the project convention for async method naming.
| The placeholder must appear fully inside a single run; if it only exists | ||
| across multiple runs, it will not be replaced. | ||
| """ | ||
| assert self._injection_config.existing_docx is not None |
There was a problem hiding this comment.
Avoid using assert for required runtime checks. In optimized runs (python -O) assertions are stripped, which could allow existing_docx to be None here and lead to a less clear failure. Prefer an explicit check with a clear exception (or rely on the earlier constructor validation and remove the assertion entirely).
| assert self._injection_config.existing_docx is not None | |
| if self._injection_config.existing_docx is None: | |
| raise ValueError( | |
| "Cannot inject into an existing Word document because no 'existing_docx' " | |
| "path was provided in the injection configuration." | |
| ) |
| async def convert_async(self, *, prompt: str, input_type: PromptDataType = "text") -> ConverterResult: | ||
| """Convert the given prompt into a Word document (.docx). | ||
|
|
||
| If ``prompt_template`` is provided, the prompt is first used to render the | ||
| template via ``SeedPrompt.render_template_value``. Otherwise, the raw | ||
| ``prompt`` string is used as the content. | ||
|
|
||
| - When ``existing_docx`` is set, this content is injected into the | ||
| document by replacing the configured placeholder string. | ||
| - When no ``existing_docx`` is provided, a new document with a single | ||
| paragraph containing the content is created. | ||
|
|
||
| Args: | ||
| prompt: The prompt or dynamic data used to generate the content. | ||
| input_type: The type of input data. Must be ``"text"``. | ||
|
|
||
| Returns: | ||
| ConverterResult: Contains the path to the generated `.docx` file in | ||
| ``output_text`` and ``output_type="binary_path"``. | ||
|
|
||
| Raises: | ||
| ValueError: If the input type is not supported. | ||
| """ | ||
| if not self.input_supported(input_type): | ||
| raise ValueError("Input type not supported") | ||
|
|
||
| content = self._prepare_content(prompt) | ||
|
|
||
| if self._injection_config.existing_docx: | ||
| doc_bytes = self._inject_into_existing_docx(content) | ||
| else: | ||
| doc_bytes = self._generate_new_docx(content) | ||
|
|
||
| serializer = await self._serialize_docx(doc_bytes) | ||
|
|
||
| return ConverterResult(output_text=serializer.value, output_type="binary_path") |
There was a problem hiding this comment.
This PR introduces a new converter with non-trivial behavior (new doc generation + placeholder injection across runs), but there are no corresponding unit tests under tests/unit/converter/ (the repo has extensive converter test coverage, e.g., test_pdf_converter.py). Add unit tests for the main success and failure paths to prevent regressions.
| final content before injection. If provided, ``prompt`` should | ||
| be a dict-like object (or string representation) whose keys map | ||
| to the template parameters. |
There was a problem hiding this comment.
The docstring for prompt_template mentions that prompt can be a “dict-like object”, but PromptConverter.convert_async is typed/documented to accept prompt: str and other converters treat the templated case as a string that can be parsed into a dict. Consider updating this documentation to avoid implying that non-string prompts are supported by the converter interface.
| final content before injection. If provided, ``prompt`` should | |
| be a dict-like object (or string representation) whose keys map | |
| to the template parameters. | |
| final content before injection. If provided, ``prompt`` passed | |
| to ``convert_async`` must be a string whose contents can be | |
| interpreted as the template parameters (for example, a | |
| JSON-encoded or other parseable mapping of keys to values). |
| """Convert a text prompt into a Word (.docx) document. | ||
|
|
||
| This converter supports two main modes: | ||
|
|
||
| 1. **New document generation** | ||
| If no existing document is provided, the converter creates a simple `.docx` | ||
| containing the rendered prompt content in a single paragraph. | ||
|
|
||
| 2. **Placeholder-based injection into an existing document** | ||
| If an ``existing_docx`` is provided, the converter searches for a literal | ||
| placeholder string (for example ``{{INJECTION_PLACEHOLDER}}``) in the | ||
| document's paragraphs. When the placeholder is found fully inside a single | ||
| run, it is replaced with the rendered prompt content while preserving the | ||
| rest of the paragraph and its formatting. | ||
|
|
||
| .. important:: | ||
| Placeholders must be fully contained within a single run. If a | ||
| placeholder spans multiple runs (for example due to mixed formatting), | ||
| this converter will not replace it. This limitation is intentional to | ||
| avoid collapsing mixed formatting or rewriting complex run structures. | ||
|
|
||
| Security note: | ||
| This converter does **not** render Jinja2 templates from arbitrary | ||
| ``.docx`` content. Templating is handled via ``SeedPrompt`` (if provided), | ||
| and only the already-rendered text is injected into the document. This | ||
| avoids executing untrusted Jinja2 templates from document bodies. | ||
| """ |
There was a problem hiding this comment.
PR description mentions updated docs and added unit tests for this converter, but the changeset here only adds the converter, an __init__ export, and a dependency. If docs/tests are expected, they appear to be missing from this PR.
| from pyrit.prompt_converter.word_doc_converter import WordDoc_Converter | ||
| from pyrit.prompt_converter.zalgo_converter import ZalgoConverter |
There was a problem hiding this comment.
WordDoc_Converter export/import name likely needs to change if the converter class is renamed to match existing converter naming conventions (no underscores in class names).
| "python-dotenv>=1.0.1", | ||
| "pypdf>=6.6.2", | ||
| "reportlab>=4.4.4", | ||
| "python-docx>=1.1.0", |
There was a problem hiding this comment.
Keep dependencies roughly sorted for maintainability; python-docx is currently placed after reportlab (R) but should be grouped with the other py* / python-* dependencies (near python-dotenv / pypdf).
| "python-dotenv>=1.0.1", | |
| "pypdf>=6.6.2", | |
| "reportlab>=4.4.4", | |
| "python-docx>=1.1.0", | |
| "pypdf>=6.6.2", | |
| "python-docx>=1.1.0", | |
| "python-dotenv>=1.0.1", | |
| "reportlab>=4.4.4", |
| placeholder: str | ||
|
|
||
|
|
||
| class WordDoc_Converter(PromptConverter): |
There was a problem hiding this comment.
The class name WordDoc_Converter is inconsistent with the naming pattern used by other converters (e.g., PDFConverter, UrlConverter) and will surface in discovery APIs (e.g., get_converter_modalities). Consider renaming to WordDocConverter and updating the corresponding import/export in pyrit.prompt_converter.__init__.
| class WordDoc_Converter(PromptConverter): | |
| class WordDocConverter(PromptConverter): |
Description
Implemented a Word document file converter to support indirect prompt injection scenarios, addressing FEAT File Converter: Word Document (#424).
Feature: WordDoc_Converter
Added WordDoc_Converter in pyrit.prompt_converter.word_doc_converter as a PromptConverter that transforms text prompts into .docx files.
Exposed via pyrit.prompt_converter.init for discovery through get_converter_modalities.
Supported Modes
New Document Generation
Creates a simple .docx with optionally templated prompt content in a single paragraph.
Placeholder-Based Injection
Replaces a literal placeholder in an existing .docx (within paragraph runs) with rendered content while preserving formatting.
Templating and Safety
Uses SeedPrompt.render_template_value for templating.
Does not render Jinja2 templates from arbitrary .docx content.
Injected content is plain text output from SeedPrompt.
Formatting Constraints
Placeholders must be fully contained within a single run.
If spanning multiple runs, replacement is skipped and a warning is logged to avoid formatting issues.
Tests
Added unit tests for:
Creating a new .docx from a plain text prompt.
Injecting content when the placeholder is within a single run.
Verifying no replacement and warning when placeholders span multiple runs.
Documentation
Updated documentation to describe:
WordDoc_Converter
Supported modes
Placeholder requirements
Use of SeedPrompt for templating