Fix/tokenize and concatenate invalid token by evcyen · Pull Request #1179 · TransformerLensOrg/TransformerLens

evcyen · 2026-02-19T19:55:11Z

Description

tokenize_and_concatenate can produce token IDs that are outside the model’s vocabulary when the tokenizer has no padding token and the dataset has fewer tokens than one full sequence. In that case the function pads to form at least one batch. It was using a synthetic pad token ID (added only for internal chunk tokenization) in the returned tensor, so the user received invalid token IDs and could hit errors when passing the result to the model.

What changed: We now record whether the tokenizer already had a pad token before we add one. When we pad the sequence in the short-sequence branch, we use tokenizer.eos_token_id instead of tokenizer.pad_token_id when we added the pad token ourselves, so the returned tensor only contains token IDs that are in the model’s original vocab.

Fixes #1139

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

Checklist:

I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have not rewritten tests relating to key interfaces which would affect backward compatibility

jlarson4 · 2026-02-19T20:41:33Z

Thanks for all these little bug fixes @evcyen! I am going to merge this one. If you've got the time to rebase the other two PRs to dev, they should pass CI. Assuming everything passes, I will merge them as well.

evcyen added 5 commits February 19, 2026 14:43

use EOS token instead of pad

0fb7fb1

simplify

0fe3198

fix variable naming

af48758

add test

49f593f

fix black errors

000c5f4

jlarson4 merged commit ebbb965 into TransformerLensOrg:dev Feb 19, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/tokenize and concatenate invalid token#1179

Fix/tokenize and concatenate invalid token#1179
jlarson4 merged 5 commits intoTransformerLensOrg:devfrom
evcyen:fix/tokenize-and-concatenate-invalid-token

evcyen commented Feb 19, 2026

Uh oh!

jlarson4 commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

evcyen commented Feb 19, 2026

Description

Type of change

Checklist:

Uh oh!

jlarson4 commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments