-
Notifications
You must be signed in to change notification settings - Fork 280
Finally, correct UTF-8 and charset handling: #1759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cromedome
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I think this is close to ready. See other comments.
The tests are a little noisy; please disregard if this is expected:
Subroutine from_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 15.
Subroutine to_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 17.
Subroutine decode_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 19.
Subroutine encode_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 25.
Subroutine serialize redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 32.
Subroutine deserialize redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 50.
Subroutine _valid_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 61.
Subroutine _decode_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 67.
Subroutine _ensure_characters redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 73.
Subroutine _ensure_scalar redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 106.
Subroutine _invalid_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 117.
Let me see how to cut this down. We don't like noise. :) |
|
The tests consistently fail trying to install Dancer2 2.000001 from CPAN. (Not sure why it's even installing this as a dependency to testing Dancer2 code branch.) When I try to install it myself, all tests pass. |
|
Reworked another case that |
1f5ca45 to
bbaf1f3
Compare
veryrusty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UTF-8 by default, yet changable is you want something else. 👍
Nice work @xsawyerx !!
05be997 to
2fc1029
Compare
|
Rebased. |
There are a bunch of tickets on this, and this finally resolves them.
We now:
* have decode boundaries
* enforce JSON characters
* honor charset config
Dancer2 mixed bytes and characters across requests, serializers, and responses.
UTF‑8 in paths/params was left as bytes, JSON could double‑encode, and
charset config was ignored in some response paths.
So my approach is:
* Enforce a clear invariant: bytes at the edges, characters inside.
* Decode request inputs (`path`/`params`/`uploads`) to characters when valid
UTF‑8.
* Keep JSON serializers emitting UTF‑8 bytes, but guarantee their input is
character data. Add strict mode to fail on invalid UTF‑8.
* Honor configured charset consistently for text responses and file/template IO.
I changed the following:
* Request boundary decoding with lenient/strict UTF‑8:
* Decode `PATH_INFO` and params when valid UTF‑8, warn and keep bytes
by default. `strict_utf8` turns those warnings into errors.
* Added `strict_utf8` config (and env `DANCER_STRICT_UTF8`), passed
into request and serializer configs.
* JSON serialization:
* Walk data structures to ensure scalars are characters. Avoid
decoding ASCII bytes to preserve numeric scalars.
* Always emit UTF‑8 bytes. `strict_utf8` errors are re‑thrown instead
of logged.
* Response encoding/charset:
* `Response->charset` is now `Str` with a predicate. Empty disables encoding.
* Prefer charset from `Content‑Type` header. Otherwise use response
charset.
* Response `serializer` is `rw` so `set serializer => ...` updates the
current response and avoids stale serializer refs (fixes `log_levels`
test).
* File/template IO:
* File utilities accept an optional charset. `Template::Tiny` honors
app charset.
* Error pages read with configured charset.
Also updated documentation on:
* Corrected JSON keyword docs: both `to_json` and `encode_json` return UTF‑8
bytes. Clarified expected UTF‑8 input for `decode`/`from_json`.
* Added `strict_utf8` config and clarified charset behavior.
Tests:
* Updated expectations around default charset (no charset unless configured).
* Added `utf8_strict.t` for lenient/strict UTF‑8 behavior.
* Adjusted tests that assumed byte output when charset is unset.
Why this works:
* Eliminates double‑encoding and mojibake by ensuring JSON sees characters.
* Makes UTF‑8 routing and params consistent with body decoding.
* Honors charset config uniformly across responses and templates.
* Provides a strict mode for apps that want hard UTF‑8 enforcement.
We're assuming we will fail trying to load a misspelled module name but on case insensitive systems it finds it, loads it, and gives a warning about redefined subroutines. We cannot use `no warnings qw<redefine>` because the warning is happening in the core code, not in the test.
When Unicode::UTF8 is missing, we fall back to `Encode::decode` for UTF‑8 validation and decoding. `Encode::decode` modifies its input in place. So calling it "just to validate" with `FB_CROAK` can consume or clear the string, even if you never use the decoded result. In CI (without `Unicode::UTF8`), this meant we were validating bytes and then later trying to decode an already‑consumed string, which produced empty strings and broke param parsing, uploads, and paths. First, we need to make the validation non-destructive so before we make a copy of the input before calling `decode(..., FB_CROAK)`. Then we need to skip decoding when we don't need to decode so `_decode_bytes()` checks if PSGI gave us a character string (`utf8::is_utf8($bytes)`) when it's supposed to give us bytes, and if it's ASCII-only (`$bytes !~ /.../`). That way we don't decode unnecessarily and avoid messing up ASCII paths/params. (This is what Unicode::UTF8 path does.)
We now treat "characters with no charset" as a boundary error and resolve it in a predictable way. If a response body is still a Perl character string and no charset is configured, Response now logs a single warning (once per process), assumes UTF‑8, and encodes to bytes so PSGI always receives valid output. In `strict_utf8` mode, this is a fatal error instead of a guess. This prevents "HTTP::Message content must be bytes" failures for apps that return character data (e.g. `YAML::Dump`) without a charset. * Add `Response->log_cb` and `strict_utf8` support, plus a one‑time guard for the "no charset" warning. * Route warning through the app logger (+ weaken). * Pass `strict_utf8`/`log_cb` when building the response in `Core::App`.
Make charset default to UTF‑8, restoring the practical pre‑change behavior where responses were UTF‑8 unless explicitly overridden. This prevents "characters with no charset" errors. Update documentation to state UTF‑8 as the default charset and describe the opt‑out behavior.
2fc1029 to
1fc4149
Compare
|
Rebased again after merging #1767 but still lots of noisy tests... |
There are a bunch of tickets on this, and this finally resolves them. Closes #686, #1124, #1143, #1594.
This helps us have decode boundaries, enforce JSON characters, and honor charset config.
Basically, Dancer2 mixed bytes and characters across requests, serializers, and responses. UTF‑8 in paths/params was left as bytes, JSON could double‑encode, and charset config was ignored in some response paths.
So my approach is:
path/params/uploads) to characters when valid UTF‑8. (Plack itself doesn't do this.)The goal is to:
Changes:
PATH_INFOand params when valid UTF‑8, warn and keep bytes by default.strict_utf8turns those warnings into errors.strict_utf8config (and envDANCER_STRICT_UTF8), passed into request and serializer configs.strict_utf8errors are re‑thrown instead of logged.Response->charsetis added. Empty means no encoding.Content‑Typeheader. Otherwise use response charset.serializerisrwsoset serializer => ...updates the current response and avoids stale serializer refs (fixeslog_levelstest).Template::Tinyhonors app charset.Also updated documentation on:
to_jsonandencode_jsonreturn UTF‑8 bytes. Clarified expected UTF‑8 input fordecode/from_json.strict_utf8config and clarified charset behavior.Tests:
utf8_strict.tfor lenient/strict UTF‑8 behavior.