Make parsing of text be non-quadratic.#579
Conversation
In Python, appending strings is not guaranteed to be constant-time,
since they are documented to be immutable. In some corner cases,
CPython is able to make these operations constant-time, but reaching
into ETree objects is not such a case.
This leads to parse times being quadratic in the size of the text in
the input in pathological cases where parsing outputs a large number
of adjacent text nodes which must be combined (e.g. HTML-escaped
values). Specifically, we expect doubling the size of the input to
result in approximately doubling the time to parse; instead, we
observe quadratic behavior:
```
In [1]: import html5lib
In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)
```
Switch from appending to the internal `str`, to appending text to an
array of text chunks, as appends can be done in constant time. Using
`bytearray` is a similar solution, but benchmarks slightly worse
because the strings must be encoded before being appended.
This improves parsing of text documents noticeably:
```
In [1]: import html5lib
In [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)
2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)
3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)
8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)
```
|
This solution can’t work, as it’s a breaking change to the public API. Before: >>> html5lib.parse("hello")[1].text
'hello'After: >>> html5lib.parse("hello")[1].text
<html5lib.treebuilders.etree.TextBuffer object at 0x7ff2e31268d0> |
|
From what I can see, there are also plenty of operations in the html5lib-python/html5lib/_tokenizer.py Line 215 in fd4f032 |
|
@lopuhin That line is slow even in CPython. In CPython, appending a character is only O(1) if the string is a local variable inside a function with no other references. It is O(n) for an object property import timeit
def linear_local(n):
s = ""
for i in range(n):
s += "a" # fast
def quadratic_object(n):
class C: pass
c = C()
c.s = ""
for i in range(n):
c.s += "a" # slow
def quadratic_array(n):
a = [""]
for i in range(n):
a[0] += "a" # slow
def quadratic_global(n):
global s
s = ""
for i in range(n):
s += "a" # slow
def quadratic_nonlocal(n):
s = ""
def inner():
nonlocal s
for i in range(n):
s += "a" # slow
inner()
for f in [linear_local, quadratic_object, quadratic_array, quadratic_global, quadratic_nonlocal]:
for n in [100000, 200000, 400000, 800000]:
print(f.__name__, n, timeit.timeit(lambda: f(n), number=1))Output with CPython 3.13.2: |
|
Good point, thank you! Indeed I can reproduce the slowness of a particular HTML under CPython as well, although the difference is less than under GraalPy. |
In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case.
This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior:
Switch from appending to the internal
str, to appending text to an array of text chunks, as appends can be done in constant time. Usingbytearrayis a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended.This improves parsing of text documents noticeably:
Old flamegraph:

New flamegraph:
