-
Notifications
You must be signed in to change notification settings - Fork 184
Open
Description
When converting HTML that contains circular parent-child references in the parsed BeautifulSoup tree (produced by certain PDF-to-HTML pipelines), process_element and process_tag recurse infinitely, crashing with a RecursionError.
Encountered when using markdownify via marker-pdf to convert PDF documents. The HTML produced by the PDF parser contained structures that caused BeautifulSoup's html.parser to create a non-tree graph where a descendant node held a reference back to an ancestor creating an unbounded call stack.
The mutual recursion introduced in 1.2.2 between process_element and process_tag has no cycle guard:
- process_tag iterates node.children and calls process_element for each child
- process_element calls process_tag for any Tag node
RecursionError: maximum recursion depth exceeded
File "markdownify/__init__.py", line 232, in process_element
return self.process_tag(node, parent_tags=parent_tags)
File "markdownify/__init__.py", line 287, in process_tag
child_strings = [
File "markdownify/__init__.py", line 288, in <listcomp>
self.process_element(el, parent_tags=parent_tags_for_children)
File "markdownify/__init__.py", line 232, in process_element
return self.process_tag(node, parent_tags=parent_tags)
... (repeating until stack exhausted)
Suggested fix:
Pass a visited set of node ids through the call chain to detect and break cycles:
def process_element(self, node, parent_tags=None, _visited=None):
if isinstance(node, NavigableString):
return self.process_text(node, parent_tags=parent_tags)
else:
return self.process_tag(node, parent_tags=parent_tags, _visited=_visited)
def process_tag(self, node, parent_tags=None, _visited=None):
if parent_tags is None:
parent_tags = set()
# Cycle detection
if _visited is None:
_visited = set()
node_id = id(node)
if node_id in _visited:
return ''
_visited.add(node_id)
# ... rest of method unchanged, but pass _visited= to process_element calls
child_strings = [
self.process_element(el, parent_tags=parent_tags_for_children, _visited=_visited)
for el in children_to_convert
]
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels