Skip to content

FEAT Add WordDocConverter#1368

Open
mhadica wants to merge 2 commits intoAzure:mainfrom
mhadica:main
Open

FEAT Add WordDocConverter#1368
mhadica wants to merge 2 commits intoAzure:mainfrom
mhadica:main

Conversation

@mhadica
Copy link

@mhadica mhadica commented Feb 12, 2026

Description

Implemented a Word document file converter to support indirect prompt injection scenarios, addressing FEAT File Converter: Word Document (#424).

Feature: WordDoc_Converter
Added WordDoc_Converter in pyrit.prompt_converter.word_doc_converter as a PromptConverter that transforms text prompts into .docx files.
Exposed via pyrit.prompt_converter.init for discovery through get_converter_modalities.

Supported Modes
New Document Generation
Creates a simple .docx with optionally templated prompt content in a single paragraph.
Placeholder-Based Injection
Replaces a literal placeholder in an existing .docx (within paragraph runs) with rendered content while preserving formatting.

Templating and Safety
Uses SeedPrompt.render_template_value for templating.
Does not render Jinja2 templates from arbitrary .docx content.
Injected content is plain text output from SeedPrompt.

Formatting Constraints
Placeholders must be fully contained within a single run.
If spanning multiple runs, replacement is skipped and a warning is logged to avoid formatting issues.

Tests
Added unit tests for:
Creating a new .docx from a plain text prompt.
Injecting content when the placeholder is within a single run.
Verifying no replacement and warning when placeholders span multiple runs.

Documentation
Updated documentation to describe:
WordDoc_Converter
Supported modes
Placeholder requirements
Use of SeedPrompt for templating

@mhadica
Copy link
Author

mhadica commented Feb 12, 2026

@microsoft-github-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new prompt converter that outputs Word .docx documents to support indirect prompt-injection scenarios, and wires it into the prompt-converter discovery surface.

Changes:

  • Introduces word_doc_converter.py implementing .docx generation and placeholder-based injection for existing documents.
  • Exposes the new converter via pyrit.prompt_converter.__init__ so it participates in converter discovery.
  • Adds python-docx as a project dependency.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
pyrit/prompt_converter/word_doc_converter.py New Word .docx converter with new-doc generation and placeholder injection logic, plus identifier + serialization.
pyrit/prompt_converter/__init__.py Exports the new converter for discovery via __all__ / get_converter_modalities.
pyproject.toml Adds python-docx runtime dependency needed by the converter.

Comment on lines +247 to +257
async def _serialize_docx(self, docx_bytes: bytes) -> DataTypeSerializer:
"""Serialize the generated document using a data serializer."""
extension = "docx"

serializer = data_serializer_factory(
category="prompt-memory-entries",
data_type="binary_path",
extension=extension,
)
await serializer.save_data(docx_bytes)
return serializer
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_serialize_docx is an async method but its name does not end with _async. Rename it (and its call site) to follow the project convention for async method naming.

Copilot generated this review using guidance from repository custom instructions.
The placeholder must appear fully inside a single run; if it only exists
across multiple runs, it will not be replaced.
"""
assert self._injection_config.existing_docx is not None
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using assert for required runtime checks. In optimized runs (python -O) assertions are stripped, which could allow existing_docx to be None here and lead to a less clear failure. Prefer an explicit check with a clear exception (or rely on the earlier constructor validation and remove the assertion entirely).

Suggested change
assert self._injection_config.existing_docx is not None
if self._injection_config.existing_docx is None:
raise ValueError(
"Cannot inject into an existing Word document because no 'existing_docx' "
"path was provided in the injection configuration."
)

Copilot uses AI. Check for mistakes.
Comment on lines +120 to +155
async def convert_async(self, *, prompt: str, input_type: PromptDataType = "text") -> ConverterResult:
"""Convert the given prompt into a Word document (.docx).

If ``prompt_template`` is provided, the prompt is first used to render the
template via ``SeedPrompt.render_template_value``. Otherwise, the raw
``prompt`` string is used as the content.

- When ``existing_docx`` is set, this content is injected into the
document by replacing the configured placeholder string.
- When no ``existing_docx`` is provided, a new document with a single
paragraph containing the content is created.

Args:
prompt: The prompt or dynamic data used to generate the content.
input_type: The type of input data. Must be ``"text"``.

Returns:
ConverterResult: Contains the path to the generated `.docx` file in
``output_text`` and ``output_type="binary_path"``.

Raises:
ValueError: If the input type is not supported.
"""
if not self.input_supported(input_type):
raise ValueError("Input type not supported")

content = self._prepare_content(prompt)

if self._injection_config.existing_docx:
doc_bytes = self._inject_into_existing_docx(content)
else:
doc_bytes = self._generate_new_docx(content)

serializer = await self._serialize_docx(doc_bytes)

return ConverterResult(output_text=serializer.value, output_type="binary_path")
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces a new converter with non-trivial behavior (new doc generation + placeholder injection across runs), but there are no corresponding unit tests under tests/unit/converter/ (the repo has extensive converter test coverage, e.g., test_pdf_converter.py). Add unit tests for the main success and failure paths to prevent regressions.

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +75
final content before injection. If provided, ``prompt`` should
be a dict-like object (or string representation) whose keys map
to the template parameters.
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring for prompt_template mentions that prompt can be a “dict-like object”, but PromptConverter.convert_async is typed/documented to accept prompt: str and other converters treat the templated case as a string that can be parsed into a dict. Consider updating this documentation to avoid implying that non-string prompts are supported by the converter interface.

Suggested change
final content before injection. If provided, ``prompt`` should
be a dict-like object (or string representation) whose keys map
to the template parameters.
final content before injection. If provided, ``prompt`` passed
to ``convert_async`` must be a string whose contents can be
interpreted as the template parameters (for example, a
JSON-encoded or other parseable mapping of keys to values).

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +57
"""Convert a text prompt into a Word (.docx) document.

This converter supports two main modes:

1. **New document generation**
If no existing document is provided, the converter creates a simple `.docx`
containing the rendered prompt content in a single paragraph.

2. **Placeholder-based injection into an existing document**
If an ``existing_docx`` is provided, the converter searches for a literal
placeholder string (for example ``{{INJECTION_PLACEHOLDER}}``) in the
document's paragraphs. When the placeholder is found fully inside a single
run, it is replaced with the rendered prompt content while preserving the
rest of the paragraph and its formatting.

.. important::
Placeholders must be fully contained within a single run. If a
placeholder spans multiple runs (for example due to mixed formatting),
this converter will not replace it. This limitation is intentional to
avoid collapsing mixed formatting or rewriting complex run structures.

Security note:
This converter does **not** render Jinja2 templates from arbitrary
``.docx`` content. Templating is handled via ``SeedPrompt`` (if provided),
and only the already-rendered text is injected into the document. This
avoids executing untrusted Jinja2 templates from document bodies.
"""
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions updated docs and added unit tests for this converter, but the changeset here only adds the converter, an __init__ export, and a dependency. If docs/tests are expected, they appear to be missing from this PR.

Copilot uses AI. Check for mistakes.
Comment on lines +100 to 101
from pyrit.prompt_converter.word_doc_converter import WordDoc_Converter
from pyrit.prompt_converter.zalgo_converter import ZalgoConverter
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WordDoc_Converter export/import name likely needs to change if the converter class is renamed to match existing converter naming conventions (no underscores in class names).

Copilot uses AI. Check for mistakes.
Comment on lines 51 to +54
"python-dotenv>=1.0.1",
"pypdf>=6.6.2",
"reportlab>=4.4.4",
"python-docx>=1.1.0",
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep dependencies roughly sorted for maintainability; python-docx is currently placed after reportlab (R) but should be grouped with the other py* / python-* dependencies (near python-dotenv / pypdf).

Suggested change
"python-dotenv>=1.0.1",
"pypdf>=6.6.2",
"reportlab>=4.4.4",
"python-docx>=1.1.0",
"pypdf>=6.6.2",
"python-docx>=1.1.0",
"python-dotenv>=1.0.1",
"reportlab>=4.4.4",

Copilot uses AI. Check for mistakes.
placeholder: str


class WordDoc_Converter(PromptConverter):
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name WordDoc_Converter is inconsistent with the naming pattern used by other converters (e.g., PDFConverter, UrlConverter) and will surface in discovery APIs (e.g., get_converter_modalities). Consider renaming to WordDocConverter and updating the corresponding import/export in pyrit.prompt_converter.__init__.

Suggested change
class WordDoc_Converter(PromptConverter):
class WordDocConverter(PromptConverter):

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants