Finally, correct UTF-8 and charset handling: #1759

xsawyerx · 2026-01-21T13:07:37Z

There are a bunch of tickets on this, and this finally resolves them. Closes #686, #1124, #1143, #1594.

This helps us have decode boundaries, enforce JSON characters, and honor charset config.

Basically, Dancer2 mixed bytes and characters across requests, serializers, and responses. UTF‑8 in paths/params was left as bytes, JSON could double‑encode, and charset config was ignored in some response paths.

So my approach is:

Enforce a clear rule: bytes at the edges, characters inside.
Decode request inputs (path/params/uploads) to characters when valid UTF‑8. (Plack itself doesn't do this.)
Keep JSON serializers emitting UTF‑8 bytes, but guarantee their input is character data. Add strict mode to fail on invalid UTF‑8.
Honor configured charset consistently for text responses and file/template IO.

The goal is to:

Eliminates double‑encoding and mojibake by ensuring JSON sees characters.
Makes UTF‑8 routing and params consistent with body decoding.
Honors charset config uniformly across responses and templates.
Provides a strict mode for apps that want hard UTF‑8 enforcement.

Changes:

Request boundary decoding with lenient/strict UTF‑8:
- Decode PATH_INFO and params when valid UTF‑8, warn and keep bytes by default. strict_utf8 turns those warnings into errors.
- Added strict_utf8 config (and env DANCER_STRICT_UTF8), passed into request and serializer configs.
JSON serialization:
- Walk data structures to ensure scalars are characters. Avoid decoding ASCII bytes to preserve numeric scalars.
- Always emit UTF‑8 bytes. strict_utf8 errors are re‑thrown instead of logged.
Response encoding/charset:
- Response->charset is added. Empty means no encoding.
- Prefer charset from Content‑Type header. Otherwise use response charset.
- Response serializer is rw so set serializer => ... updates the current response and avoids stale serializer refs (fixes log_levels test).
File/template IO:
- File utilities accept an optional charset. Template::Tiny honors app charset.
- Error pages read with configured charset.

Also updated documentation on:

Corrected JSON keyword docs: both to_json and encode_json return UTF‑8 bytes. Clarified expected UTF‑8 input for decode/from_json.
Added strict_utf8 config and clarified charset behavior.

Tests:

Updated expectations around default charset (no charset unless configured).
Added utf8_strict.t for lenient/strict UTF‑8 behavior.
Adjusted tests that assumed byte output when charset is unset.

cromedome

This is awesome! I think this is close to ready. See other comments.

The tests are a little noisy; please disregard if this is expected:

Subroutine from_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 15.
Subroutine to_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 17.
Subroutine decode_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 19.
Subroutine encode_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 25.
Subroutine serialize redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 32.
Subroutine deserialize redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 50.
Subroutine _valid_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 61.
Subroutine _decode_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 67.
Subroutine _ensure_characters redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 73.
Subroutine _ensure_scalar redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 106.
Subroutine _invalid_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 117.

lib/Dancer2/Core/Request.pm

lib/Dancer2/Core/Role/Serializer.pm

lib/Dancer2/Core/App.pm

xsawyerx · 2026-01-22T07:38:57Z

The tests are a little noisy; please disregard if this is expected:

Let me see how to cut this down. We don't like noise. :)

xsawyerx · 2026-01-22T08:07:47Z

The tests consistently fail trying to install Dancer2 2.000001 from CPAN. (Not sure why it's even installing this as a dependency to testing Dancer2 code branch.) When I try to install it myself, all tests pass.

lib/Dancer2/Core/Request.pm

xsawyerx · 2026-01-22T15:43:19Z

Reworked another case that Auth::Extensible found.

t/dsl/send_as.t

veryrusty

UTF-8 by default, yet changable is you want something else. 👍

Nice work @xsawyerx !!

xsawyerx · 2026-02-08T12:46:29Z

Rebased.

There are a bunch of tickets on this, and this finally resolves them. We now: * have decode boundaries * enforce JSON characters * honor charset config Dancer2 mixed bytes and characters across requests, serializers, and responses. UTF‑8 in paths/params was left as bytes, JSON could double‑encode, and charset config was ignored in some response paths. So my approach is: * Enforce a clear invariant: bytes at the edges, characters inside. * Decode request inputs (`path`/`params`/`uploads`) to characters when valid UTF‑8. * Keep JSON serializers emitting UTF‑8 bytes, but guarantee their input is character data. Add strict mode to fail on invalid UTF‑8. * Honor configured charset consistently for text responses and file/template IO. I changed the following: * Request boundary decoding with lenient/strict UTF‑8: * Decode `PATH_INFO` and params when valid UTF‑8, warn and keep bytes by default. `strict_utf8` turns those warnings into errors. * Added `strict_utf8` config (and env `DANCER_STRICT_UTF8`), passed into request and serializer configs. * JSON serialization: * Walk data structures to ensure scalars are characters. Avoid decoding ASCII bytes to preserve numeric scalars. * Always emit UTF‑8 bytes. `strict_utf8` errors are re‑thrown instead of logged. * Response encoding/charset: * `Response->charset` is now `Str` with a predicate. Empty disables encoding. * Prefer charset from `Content‑Type` header. Otherwise use response charset. * Response `serializer` is `rw` so `set serializer => ...` updates the current response and avoids stale serializer refs (fixes `log_levels` test). * File/template IO: * File utilities accept an optional charset. `Template::Tiny` honors app charset. * Error pages read with configured charset. Also updated documentation on: * Corrected JSON keyword docs: both `to_json` and `encode_json` return UTF‑8 bytes. Clarified expected UTF‑8 input for `decode`/`from_json`. * Added `strict_utf8` config and clarified charset behavior. Tests: * Updated expectations around default charset (no charset unless configured). * Added `utf8_strict.t` for lenient/strict UTF‑8 behavior. * Adjusted tests that assumed byte output when charset is unset. Why this works: * Eliminates double‑encoding and mojibake by ensuring JSON sees characters. * Makes UTF‑8 routing and params consistent with body decoding. * Honors charset config uniformly across responses and templates. * Provides a strict mode for apps that want hard UTF‑8 enforcement.

We're assuming we will fail trying to load a misspelled module name but on case insensitive systems it finds it, loads it, and gives a warning about redefined subroutines. We cannot use `no warnings qw<redefine>` because the warning is happening in the core code, not in the test.

When Unicode::UTF8 is missing, we fall back to `Encode::decode` for UTF‑8 validation and decoding. `Encode::decode` modifies its input in place. So calling it "just to validate" with `FB_CROAK` can consume or clear the string, even if you never use the decoded result. In CI (without `Unicode::UTF8`), this meant we were validating bytes and then later trying to decode an already‑consumed string, which produced empty strings and broke param parsing, uploads, and paths. First, we need to make the validation non-destructive so before we make a copy of the input before calling `decode(..., FB_CROAK)`. Then we need to skip decoding when we don't need to decode so `_decode_bytes()` checks if PSGI gave us a character string (`utf8::is_utf8($bytes)`) when it's supposed to give us bytes, and if it's ASCII-only (`$bytes !~ /.../`). That way we don't decode unnecessarily and avoid messing up ASCII paths/params. (This is what Unicode::UTF8 path does.)

We now treat "characters with no charset" as a boundary error and resolve it in a predictable way. If a response body is still a Perl character string and no charset is configured, Response now logs a single warning (once per process), assumes UTF‑8, and encodes to bytes so PSGI always receives valid output. In `strict_utf8` mode, this is a fatal error instead of a guess. This prevents "HTTP::Message content must be bytes" failures for apps that return character data (e.g. `YAML::Dump`) without a charset. * Add `Response->log_cb` and `strict_utf8` support, plus a one‑time guard for the "no charset" warning. * Route warning through the app logger (+ weaken). * Pass `strict_utf8`/`log_cb` when building the response in `Core::App`.

Make charset default to UTF‑8, restoring the practical pre‑change behavior where responses were UTF‑8 unless explicitly overridden. This prevents "characters with no charset" errors. Update documentation to state UTF‑8 as the default charset and describe the opt‑out behavior.

xsawyerx · 2026-02-08T12:48:11Z

Rebased again after merging #1767 but still lots of noisy tests...

xsawyerx added Bug Enhancement labels Jan 21, 2026

cromedome requested changes Jan 22, 2026

View reviewed changes

lib/Dancer2/Core/Request.pm Show resolved Hide resolved

lib/Dancer2/Core/Role/Serializer.pm Show resolved Hide resolved

veryrusty reviewed Jan 22, 2026

View reviewed changes

lib/Dancer2/Core/App.pm Show resolved Hide resolved

veryrusty reviewed Jan 22, 2026

View reviewed changes

lib/Dancer2/Core/Request.pm Outdated Show resolved Hide resolved

xsawyerx force-pushed the sawyer/utf8-once-and-for-all branch from 1f5ca45 to bbaf1f3 Compare January 22, 2026 20:42

cromedome reviewed Jan 23, 2026

View reviewed changes

t/dsl/send_as.t Show resolved Hide resolved

xsawyerx mentioned this pull request Jan 26, 2026

fix utf8 in url #1143

Open

xsawyerx self-assigned this Jan 29, 2026

veryrusty approved these changes Feb 8, 2026

View reviewed changes

xsawyerx force-pushed the sawyer/utf8-once-and-for-all branch from 05be997 to 2fc1029 Compare February 8, 2026 12:46

xsawyerx added 8 commits February 8, 2026 13:47

Get some verbosity

38fd7cf

No need for explicit copy

3933cf4

Adding test from GH #1143

ee55cfc

xsawyerx force-pushed the sawyer/utf8-once-and-for-all branch from 2fc1029 to 1fc4149 Compare February 8, 2026 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finally, correct UTF-8 and charset handling: #1759

Finally, correct UTF-8 and charset handling: #1759

Uh oh!

xsawyerx commented Jan 21, 2026 •

edited

Loading

Uh oh!

cromedome left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xsawyerx commented Jan 22, 2026

Uh oh!

xsawyerx commented Jan 22, 2026

Uh oh!

Uh oh!

xsawyerx commented Jan 22, 2026

Uh oh!

Uh oh!

veryrusty left a comment

Uh oh!

xsawyerx commented Feb 8, 2026

Uh oh!

xsawyerx commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Finally, correct UTF-8 and charset handling: #1759

Are you sure you want to change the base?

Finally, correct UTF-8 and charset handling: #1759

Uh oh!

Conversation

xsawyerx commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cromedome left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xsawyerx commented Jan 22, 2026

Uh oh!

xsawyerx commented Jan 22, 2026

Uh oh!

Uh oh!

xsawyerx commented Jan 22, 2026

Uh oh!

Uh oh!

veryrusty left a comment

Choose a reason for hiding this comment

Uh oh!

xsawyerx commented Feb 8, 2026

Uh oh!

xsawyerx commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xsawyerx commented Jan 21, 2026 •

edited

Loading