
Share
Newer LLMs like GPT-5 and Claude 4.5 excel at character-level tasks, surpassing earlier models in counting, manipulating individual characters, and solving complex encoding challenges.
Recently, I’ve been diving into how the latest generations of large language models (LLMs) handle character-level tasks. Specifically, I tested their ability to count characters, manipulate characters within sentences, and solve encoding and ciphers. The results were surprising: the newest models like GPT-5 and Claude 4.5 have made significant strides in these areas compared to their predecessors.
One of the key challenges for LLMs has been handling individual characters. This is because all text is encoded as tokens via the model’s tokenizer, which typically represents clusters of characters or even full words (especially common ones in the training dataset). This tokenization makes it difficult for models to perform tasks that require a more granular level of control.
To illustrate this, I tested several OpenAI models with the prompt: "Replace all letters 'r' in the sentence 'I really love a ripe strawberry' with the letter 'l', and then convert all letters 'l' to 'r'." Here’s how they performed:
For this test, I disabled reasoning for the GPT-5 models to ensure a fair comparison. Reasoning can significantly aid in similar tasks (some models use chain of thought directly in the output), but I wanted to focus on the generational improvements from raw model enhancements. Notably, GPT-5 Nano is the only new generation model that made a mistake, likely due to its smaller size. Starting with GPT 4.1, models consistently completed this task without issues. For context, Claude Sonnet 4 from Anthropic was also able to handle this task around the same time as GPT 4.1.
Counting characters is another area where LLMs have historically struggled. I tested several models on the sentence: "I wish I could come up with a better example sentence." The goal was to count the total number of characters in the sentence accurately. Here’s what I found:

This highlights a significant improvement with GPT-4.1 and newer generations. The ability to accurately count characters is crucial for various applications, from data validation to text processing tasks that require precise character-level control.
These improvements in character manipulation and counting are not just academic curiosities; they have practical implications for developers and researchers working with LLMs. For instance:
The latest generations of LLMs are showing significant improvements in handling character-level tasks. This progress is driven by advancements in model architecture and training techniques, making it easier for practitioners to leverage these models for precise text manipulation and processing. As the field continues to evolve, we can expect even more sophisticated capabilities from future models.
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
15 October 2025
88 articles
Related Articles
Related Articles
More Stories