Sanitize json with control characters by trifle · Pull Request #473 · python273/vk_api

trifle · 2022-10-15T12:40:33Z

Hi!

I recently saw that VK seems to be really bad at sanitizing some fields such as usernames.
Loading a bunch of members with vk_api yielded an ugly
JSONDecodeError: Invalid control character at: line 1 column 68248 (char 68247)

Inspecting this by hand, I found that the response contains this abomination:

In [32]: e[68200:68259]
Out[32]: ',"is_closed":false},{"id":27497241,"nickname":"\x01","domain":'

Of course, <0x01> is not a codepoint that you'd want in a nickname, ever!

Since the data is nonsensical, not printable and potentially dangerous, my suggestion would be to catch JSONDecodeErrors and try to sanitize the raw content before parsing it with json.loads.

There are many options for replacing the problematic characters, but regex should be reasonably fast. As a bonus, it allows us to catch control characters as a category (\p{C}) rather than listing them by hand.

python273 · 2022-10-15T13:33:27Z

might be better to use strict=False instead of regexp?

https://docs.python.org/3/library/json.html#json.JSONDecoder

trifle · 2022-10-15T13:39:45Z

Hi @python273, thanks for taking the time to look into it!

I've actually thought about using that flag: As far as I read it, this simply allows the json decoder to pass on this invalid data.
But I don't think it's best to use that here, because the resulting data would almost certainly cause problems for people. I mean, is there any valid use case where you would want a null byte or similar in a VK field to process further?

I also feel that delimiters are another special case. As far as I see it, having "\r" or "\n" in json data (unescaped!) means that almost certainly something on the sending end is broken. Because all tooling that works with newlines will be broken, and you essentially can't rely on any of the data to remain sensible (i.e. where do you stop parsing? Just kill everything before the newline?).

A final consideration is the maintainability. From that perspective I could see why using the json decoder flag is more attractive.

What do you think?

python273 · 2022-10-15T14:06:59Z

I reported the bug to VK, hopefully they just fix their JSON encoder to escape control chars ('\x01' as '\\u0001')

Not sure about regexp, since it alters the content of the response, so let's see what VK replies

trifle · 2022-10-15T14:10:13Z

Thanks, that's a great idea.
I've encountered way too many double- or under-encoded data, one less is certainly a good idea.
I'll keep using my fork in the meantime, so feel free to close this PR.
BTW, there still is a typo, the import is named re not rx, I'll change that.

Sanitize json

d75950e

Pascal Jürgens added 2 commits October 15, 2022 16:10

Fix namespace typo and switch to robust json parsing

03f45cc

Use working regex for control character range

cff791d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sanitize json with control characters#473

Sanitize json with control characters#473
trifle wants to merge 3 commits intopython273:masterfrom
trifle:master

trifle commented Oct 15, 2022

Uh oh!

python273 commented Oct 15, 2022

Uh oh!

trifle commented Oct 15, 2022

Uh oh!

python273 commented Oct 15, 2022 •

edited

Loading

Uh oh!

trifle commented Oct 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

trifle commented Oct 15, 2022

Uh oh!

python273 commented Oct 15, 2022

Uh oh!

trifle commented Oct 15, 2022

Uh oh!

python273 commented Oct 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trifle commented Oct 15, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

python273 commented Oct 15, 2022 •

edited

Loading