Skip to content

Conversation

@xsawyerx
Copy link
Member

@xsawyerx xsawyerx commented Jan 21, 2026

There are a bunch of tickets on this, and this finally resolves them. Closes #686, #1124, #1143, #1594.

This helps us have decode boundaries, enforce JSON characters, and honor charset config.

Basically, Dancer2 mixed bytes and characters across requests, serializers, and responses. UTF‑8 in paths/params was left as bytes, JSON could double‑encode, and charset config was ignored in some response paths.

So my approach is:

  • Enforce a clear rule: bytes at the edges, characters inside.
  • Decode request inputs (path/params/uploads) to characters when valid UTF‑8. (Plack itself doesn't do this.)
  • Keep JSON serializers emitting UTF‑8 bytes, but guarantee their input is character data. Add strict mode to fail on invalid UTF‑8.
  • Honor configured charset consistently for text responses and file/template IO.

The goal is to:

  • Eliminates double‑encoding and mojibake by ensuring JSON sees characters.
  • Makes UTF‑8 routing and params consistent with body decoding.
  • Honors charset config uniformly across responses and templates.
  • Provides a strict mode for apps that want hard UTF‑8 enforcement.

Changes:

  • Request boundary decoding with lenient/strict UTF‑8:
    • Decode PATH_INFO and params when valid UTF‑8, warn and keep bytes by default. strict_utf8 turns those warnings into errors.
    • Added strict_utf8 config (and env DANCER_STRICT_UTF8), passed into request and serializer configs.
  • JSON serialization:
    • Walk data structures to ensure scalars are characters. Avoid decoding ASCII bytes to preserve numeric scalars.
    • Always emit UTF‑8 bytes. strict_utf8 errors are re‑thrown instead of logged.
  • Response encoding/charset:
    • Response->charset is added. Empty means no encoding.
    • Prefer charset from Content‑Type header. Otherwise use response charset.
    • Response serializer is rw so set serializer => ... updates the current response and avoids stale serializer refs (fixes log_levels test).
  • File/template IO:
    • File utilities accept an optional charset. Template::Tiny honors app charset.
    • Error pages read with configured charset.

Also updated documentation on:

  • Corrected JSON keyword docs: both to_json and encode_json return UTF‑8 bytes. Clarified expected UTF‑8 input for decode/from_json.
  • Added strict_utf8 config and clarified charset behavior.

Tests:

  • Updated expectations around default charset (no charset unless configured).
  • Added utf8_strict.t for lenient/strict UTF‑8 behavior.
  • Adjusted tests that assumed byte output when charset is unset.

Copy link
Contributor

@cromedome cromedome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! I think this is close to ready. See other comments.

The tests are a little noisy; please disregard if this is expected:

Subroutine from_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 15.
Subroutine to_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 17.
Subroutine decode_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 19.
Subroutine encode_json redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 25.
Subroutine serialize redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 32.
Subroutine deserialize redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 50.
Subroutine _valid_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 61.
Subroutine _decode_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 67.
Subroutine _ensure_characters redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 73.
Subroutine _ensure_scalar redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 106.
Subroutine _invalid_utf8 redefined at /Users/jason/code/PerlDancer/Dancer2/.build/M0rLVtsgh3/blib/lib/Dancer2/Serializer/jSoN.pm line 117.

@xsawyerx
Copy link
Member Author

The tests are a little noisy; please disregard if this is expected:

Let me see how to cut this down. We don't like noise. :)

@xsawyerx
Copy link
Member Author

The tests consistently fail trying to install Dancer2 2.000001 from CPAN. (Not sure why it's even installing this as a dependency to testing Dancer2 code branch.) When I try to install it myself, all tests pass.

@xsawyerx
Copy link
Member Author

Reworked another case that Auth::Extensible found.

@xsawyerx xsawyerx force-pushed the sawyer/utf8-once-and-for-all branch from 1f5ca45 to bbaf1f3 Compare January 22, 2026 20:42
@xsawyerx xsawyerx mentioned this pull request Jan 26, 2026
@xsawyerx xsawyerx self-assigned this Jan 29, 2026
Copy link
Member

@veryrusty veryrusty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UTF-8 by default, yet changable is you want something else. 👍

Nice work @xsawyerx !!

@xsawyerx xsawyerx force-pushed the sawyer/utf8-once-and-for-all branch from 05be997 to 2fc1029 Compare February 8, 2026 12:46
@xsawyerx
Copy link
Member Author

xsawyerx commented Feb 8, 2026

Rebased.

There are a bunch of tickets on this, and this finally resolves them.

We now:
* have decode boundaries
* enforce JSON characters
* honor charset config

Dancer2 mixed bytes and characters across requests, serializers, and responses.
UTF‑8 in paths/params was left as bytes, JSON could double‑encode, and
charset config was ignored in some response paths.

So my approach is:

* Enforce a clear invariant: bytes at the edges, characters inside.
* Decode request inputs (`path`/`params`/`uploads`) to characters when valid
  UTF‑8.
* Keep JSON serializers emitting UTF‑8 bytes, but guarantee their input is
  character data. Add strict mode to fail on invalid UTF‑8.
* Honor configured charset consistently for text responses and file/template IO.

I changed the following:

* Request boundary decoding with lenient/strict UTF‑8:
    * Decode `PATH_INFO` and params when valid UTF‑8, warn and keep bytes
      by default. `strict_utf8` turns those warnings into errors.
    * Added `strict_utf8` config (and env `DANCER_STRICT_UTF8`), passed
      into request and serializer configs.
* JSON serialization:
    * Walk data structures to ensure scalars are characters. Avoid
      decoding ASCII bytes to preserve numeric scalars.
    * Always emit UTF‑8 bytes. `strict_utf8` errors are re‑thrown instead
      of logged.
* Response encoding/charset:
    * `Response->charset` is now `Str` with a predicate. Empty disables encoding.
    * Prefer charset from `Content‑Type` header. Otherwise use response
      charset.
    * Response `serializer` is `rw` so `set serializer => ...` updates the
      current response and avoids stale serializer refs (fixes `log_levels`
      test).
* File/template IO:
    * File utilities accept an optional charset. `Template::Tiny` honors
      app charset.
    * Error pages read with configured charset.

Also updated documentation on:

* Corrected JSON keyword docs: both `to_json` and `encode_json` return UTF‑8
  bytes. Clarified expected UTF‑8 input for `decode`/`from_json`.
* Added `strict_utf8` config and clarified charset behavior.

Tests:

* Updated expectations around default charset (no charset unless configured).
* Added `utf8_strict.t` for lenient/strict UTF‑8 behavior.
* Adjusted tests that assumed byte output when charset is unset.

Why this works:

* Eliminates double‑encoding and mojibake by ensuring JSON sees characters.
* Makes UTF‑8 routing and params consistent with body decoding.
* Honors charset config uniformly across responses and templates.
* Provides a strict mode for apps that want hard UTF‑8 enforcement.
We're assuming we will fail trying to load a misspelled module name
but on case insensitive systems it finds it, loads it, and gives
a warning about redefined subroutines.

We cannot use `no warnings qw<redefine>` because the warning is
happening in the core code, not in the test.
When Unicode::UTF8 is missing, we fall back to `Encode::decode`
for UTF‑8 validation and decoding.

`Encode::decode` modifies its input in place. So calling it
"just to validate" with `FB_CROAK` can consume or clear the
string, even if you never use the decoded result.

In CI (without `Unicode::UTF8`), this meant we were validating
bytes and then later trying to decode an already‑consumed
string, which produced empty strings and broke param parsing,
uploads, and paths.

First, we need to make the validation non-destructive so before
we make a copy of the input before calling
`decode(..., FB_CROAK)`.

Then we need to skip decoding when we don't need to decode so
`_decode_bytes()` checks if PSGI gave us a character string
(`utf8::is_utf8($bytes)`) when it's supposed to give us bytes,
and if it's ASCII-only (`$bytes !~ /.../`). That way we don't
decode unnecessarily and avoid messing up ASCII paths/params.

(This is what Unicode::UTF8 path does.)
We now treat "characters with no charset" as a boundary error and
resolve it in a predictable way. If a response body is still a Perl
character string and no charset is configured, Response now logs a
single warning (once per process), assumes UTF‑8, and encodes to bytes
so PSGI always receives valid output.

In `strict_utf8` mode, this is a fatal error instead of a guess.
This prevents "HTTP::Message content must be bytes" failures for apps
that return character data (e.g. `YAML::Dump`) without a charset.

* Add `Response->log_cb` and `strict_utf8` support, plus a one‑time
  guard for the "no charset" warning.
* Route warning through the app logger (+ weaken).
* Pass `strict_utf8`/`log_cb` when building the response in
  `Core::App`.
Make charset default to UTF‑8, restoring the practical pre‑change
behavior where responses were UTF‑8 unless explicitly overridden.

This prevents "characters with no charset" errors.

Update documentation to state UTF‑8 as the default charset and
describe the opt‑out behavior.
@xsawyerx xsawyerx force-pushed the sawyer/utf8-once-and-for-all branch from 2fc1029 to 1fc4149 Compare February 8, 2026 12:47
@xsawyerx
Copy link
Member Author

xsawyerx commented Feb 8, 2026

Rebased again after merging #1767 but still lots of noisy tests...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

to_json is double encoding UTF8?

3 participants