Fix/tokenize and concatenate invalid token#1179
Merged
jlarson4 merged 5 commits intoTransformerLensOrg:devfrom Feb 19, 2026
Merged
Fix/tokenize and concatenate invalid token#1179jlarson4 merged 5 commits intoTransformerLensOrg:devfrom
jlarson4 merged 5 commits intoTransformerLensOrg:devfrom
Conversation
Collaborator
|
Thanks for all these little bug fixes @evcyen! I am going to merge this one. If you've got the time to rebase the other two PRs to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
tokenize_and_concatenatecan produce token IDs that are outside the model’s vocabulary when the tokenizer has no padding token and the dataset has fewer tokens than one full sequence. In that case the function pads to form at least one batch. It was using a synthetic pad token ID (added only for internal chunk tokenization) in the returned tensor, so the user received invalid token IDs and could hit errors when passing the result to the model.What changed: We now record whether the tokenizer already had a pad token before we add one. When we pad the sequence in the short-sequence branch, we use
tokenizer.eos_token_idinstead oftokenizer.pad_token_idwhen we added the pad token ourselves, so the returned tensor only contains token IDs that are in the model’s original vocab.Fixes #1139
Type of change
Checklist: