-
Notifications
You must be signed in to change notification settings - Fork 665
FEAT Add WordDocConverter #1368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -97,6 +97,7 @@ | |
| from pyrit.prompt_converter.unicode_sub_converter import UnicodeSubstitutionConverter | ||
| from pyrit.prompt_converter.url_converter import UrlConverter | ||
| from pyrit.prompt_converter.variation_converter import VariationConverter | ||
| from pyrit.prompt_converter.word_doc_converter import WordDoc_Converter | ||
| from pyrit.prompt_converter.zalgo_converter import ZalgoConverter | ||
|
Comment on lines
+100
to
101
|
||
| from pyrit.prompt_converter.zero_width_converter import ZeroWidthConverter | ||
|
|
||
|
|
@@ -177,6 +178,7 @@ | |
| "UnicodeSubstitutionConverter", | ||
| "UrlConverter", | ||
| "VariationConverter", | ||
| "WordDoc_Converter", | ||
| "VariationSelectorSmugglerConverter", | ||
| "WordIndexSelectionStrategy", | ||
| "WordKeywordSelectionStrategy", | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,258 @@ | ||||||||||||||||
| from __future__ import annotations | ||||||||||||||||
|
|
||||||||||||||||
| # Copyright (c) Microsoft Corporation. | ||||||||||||||||
| # Licensed under the MIT license. | ||||||||||||||||
|
|
||||||||||||||||
| from dataclasses import dataclass | ||||||||||||||||
| import ast | ||||||||||||||||
| import hashlib | ||||||||||||||||
| from io import BytesIO | ||||||||||||||||
| from pathlib import Path | ||||||||||||||||
| from typing import Any, Dict, Optional | ||||||||||||||||
|
|
||||||||||||||||
| from docx import Document # type: ignore[import-untyped] | ||||||||||||||||
|
|
||||||||||||||||
| from pyrit.common.logger import logger | ||||||||||||||||
| from pyrit.identifiers import ConverterIdentifier | ||||||||||||||||
| from pyrit.models import PromptDataType, SeedPrompt, data_serializer_factory | ||||||||||||||||
| from pyrit.models.data_type_serializer import DataTypeSerializer | ||||||||||||||||
| from pyrit.prompt_converter.prompt_converter import ConverterResult, PromptConverter | ||||||||||||||||
|
|
||||||||||||||||
|
|
||||||||||||||||
| @dataclass | ||||||||||||||||
| class _WordDocInjectionConfig: | ||||||||||||||||
| """Configuration for how to inject content into a Word document.""" | ||||||||||||||||
|
|
||||||||||||||||
| existing_docx: Optional[Path] | ||||||||||||||||
| placeholder: str | ||||||||||||||||
|
|
||||||||||||||||
|
|
||||||||||||||||
| class WordDoc_Converter(PromptConverter): | ||||||||||||||||
|
||||||||||||||||
| class WordDoc_Converter(PromptConverter): | |
| class WordDocConverter(PromptConverter): |
Copilot
AI
Feb 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description mentions updated docs and added unit tests for this converter, but the changeset here only adds the converter, an __init__ export, and a dependency. If docs/tests are expected, they appear to be missing from this PR.
Copilot
AI
Feb 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for prompt_template mentions that prompt can be a “dict-like object”, but PromptConverter.convert_async is typed/documented to accept prompt: str and other converters treat the templated case as a string that can be parsed into a dict. Consider updating this documentation to avoid implying that non-string prompts are supported by the converter interface.
| final content before injection. If provided, ``prompt`` should | |
| be a dict-like object (or string representation) whose keys map | |
| to the template parameters. | |
| final content before injection. If provided, ``prompt`` passed | |
| to ``convert_async`` must be a string whose contents can be | |
| interpreted as the template parameters (for example, a | |
| JSON-encoded or other parseable mapping of keys to values). |
Copilot
AI
Feb 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduces a new converter with non-trivial behavior (new doc generation + placeholder injection across runs), but there are no corresponding unit tests under tests/unit/converter/ (the repo has extensive converter test coverage, e.g., test_pdf_converter.py). Add unit tests for the main success and failure paths to prevent regressions.
Copilot
AI
Feb 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using assert for required runtime checks. In optimized runs (python -O) assertions are stripped, which could allow existing_docx to be None here and lead to a less clear failure. Prefer an explicit check with a clear exception (or rely on the earlier constructor validation and remove the assertion entirely).
| assert self._injection_config.existing_docx is not None | |
| if self._injection_config.existing_docx is None: | |
| raise ValueError( | |
| "Cannot inject into an existing Word document because no 'existing_docx' " | |
| "path was provided in the injection configuration." | |
| ) |
Copilot
AI
Feb 16, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_serialize_docx is an async method but its name does not end with _async. Rename it (and its call site) to follow the project convention for async method naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep
dependenciesroughly sorted for maintainability;python-docxis currently placed afterreportlab(R) but should be grouped with the otherpy*/python-*dependencies (nearpython-dotenv/pypdf).