Add PDF Image Extractor script with README documentation#500
Add PDF Image Extractor script with README documentation#500gracetyy wants to merge 1 commit intowasmerio:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new utility script that recursively extracts all embedded images from PDF files in a directory tree. The script uses PyMuPDF (fitz) to process PDFs and supports optional deduplication of images per document.
Key changes:
- Adds
pdf_image_extractor.pywith command-line interface for PDF image extraction - Includes comprehensive README with usage examples and documentation
- Supports customizable output directory and per-PDF deduplication options
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| PDF Image Extractor/pdf_image_extractor.py | Main script implementing recursive PDF scanning and image extraction logic with CLI argument parsing |
| PDF Image Extractor/README.md | Documentation covering requirements, usage, CLI options, and output structure |
Code Quality Observations:
The implementation is generally well-structured with clear separation of concerns. However, there are a few technical issues to address:
-
Potential crash with
os.path.commonpath()(lines 14-16): The code usesos.path.commonpath([pdf_path, output_root])which can raise aValueErroron Windows when paths are on different drives, or when they don't share a common ancestor. This would crash the script in common scenarios where users specify an output directory on a different drive. The logic appears intended to mirror the directory structure, but using the common path as the base is problematic. A simpler approach would be to calculate the relative path from the inputpdf_dirdirectory. -
Inefficient directory creation logic (lines 35-36): The condition
if img_count == 0 and not os.path.exists(output_folder)only creates the directory before writing the first image. Whileos.makedirs()is called withexist_ok=True, the double-check is redundant. It would be clearer to create the directory once before the loop if there are images to extract. -
Redundant deduplication checks (lines 27-30): The code checks
if deduptwice - once to skip duplicates and again to add to the set. This could be simplified to a single conditional block. -
Missing requirements.txt: Several other projects in this repository include a
requirements.txtfile (e.g., PDF Merger, Image Watermarker, Image to ASCII). Adding one for this project would improve consistency and make dependency installation clearer for users. -
Missing error handling for image extraction: If
doc.extract_image(xref)fails (line 32), the script will crash. While PyMuPDF is generally robust, adding a try-except block would make the script more resilient.
Documentation:
The README is well-written with clear examples and appropriate detail. The structure follows good practices with separate sections for requirements, usage, examples, and output structure.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
This PR adds a new script, PDF Image Extractor, which recursively scans a directory tree for PDF files and extracts all embedded images from each document.
PDFwithin the input root directory by default (customizable via--out).--dedupflag to enable per-PDF deduplication of images.Additional notes: