Skip to content

Fix/tokenize and concatenate invalid token#1179

Merged
jlarson4 merged 5 commits intoTransformerLensOrg:devfrom
evcyen:fix/tokenize-and-concatenate-invalid-token
Feb 19, 2026
Merged

Fix/tokenize and concatenate invalid token#1179
jlarson4 merged 5 commits intoTransformerLensOrg:devfrom
evcyen:fix/tokenize-and-concatenate-invalid-token

Conversation

@evcyen
Copy link

@evcyen evcyen commented Feb 19, 2026

Description

tokenize_and_concatenate can produce token IDs that are outside the model’s vocabulary when the tokenizer has no padding token and the dataset has fewer tokens than one full sequence. In that case the function pads to form at least one batch. It was using a synthetic pad token ID (added only for internal chunk tokenization) in the returned tensor, so the user received invalid token IDs and could hit errors when passing the result to the model.

What changed: We now record whether the tokenizer already had a pad token before we add one. When we pad the sequence in the short-sequence branch, we use tokenizer.eos_token_id instead of tokenizer.pad_token_id when we added the pad token ourselves, so the returned tensor only contains token IDs that are in the model’s original vocab.

Fixes #1139

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have not rewritten tests relating to key interfaces which would affect backward compatibility

@jlarson4
Copy link
Collaborator

Thanks for all these little bug fixes @evcyen! I am going to merge this one. If you've got the time to rebase the other two PRs to dev, they should pass CI. Assuming everything passes, I will merge them as well.

@jlarson4 jlarson4 merged commit ebbb965 into TransformerLensOrg:dev Feb 19, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments