Sanitize json with control characters#473
Conversation
|
might be better to use https://docs.python.org/3/library/json.html#json.JSONDecoder |
|
Hi @python273, thanks for taking the time to look into it! I've actually thought about using that flag: As far as I read it, this simply allows the json decoder to pass on this invalid data. I also feel that delimiters are another special case. As far as I see it, having "\r" or "\n" in json data (unescaped!) means that almost certainly something on the sending end is broken. Because all tooling that works with newlines will be broken, and you essentially can't rely on any of the data to remain sensible (i.e. where do you stop parsing? Just kill everything before the newline?). A final consideration is the maintainability. From that perspective I could see why using the json decoder flag is more attractive. What do you think? |
|
I reported the bug to VK, hopefully they just fix their JSON encoder to escape control chars ('\x01' as '\\u0001') Not sure about regexp, since it alters the content of the response, so let's see what VK replies |
|
Thanks, that's a great idea. |
Hi!
I recently saw that VK seems to be really bad at sanitizing some fields such as usernames.
Loading a bunch of members with vk_api yielded an ugly
JSONDecodeError: Invalid control character at: line 1 column 68248 (char 68247)Inspecting this by hand, I found that the response contains this abomination:
Of course, <0x01> is not a codepoint that you'd want in a nickname, ever!
Since the data is nonsensical, not printable and potentially dangerous, my suggestion would be to catch
JSONDecodeErrors and try to sanitize the raw content before parsing it with json.loads.There are many options for replacing the problematic characters, but regex should be reasonably fast. As a bonus, it allows us to catch control characters as a category (
\p{C}) rather than listing them by hand.