I am currently setting up a safety steering experiment for African languages (Yoruba) using HookedTransformer with GPT-2-medium.
I've observed that tonal characters (e.g., 'ọ' in 'Atọwọda') trigger extreme fragmentation compared to Latin scripts.
Example:
Input: "Oye Atọwọda"
Output: ['O', 'ye', 'ĠAt', 'á', '»', 'į', 'w', 'á', '»', 'į', 'da']
The single word Atọwọda is split into 9 tokens, mostly byte-level fallbacks.
Question:
For activation patching, this makes it difficult to isolate the "semantic" token. Is there a recommended heuristic in TransformerLens for pooling activations across these fragmented byte-tokens (e.g. taking the mean of the byte-span)? Or is the standard practice to simply ignore the byte-level noise?