Welcome to Deep Reading. I'm Lance Cummings, and this is where we take short, focused dives into AI research that actually matters for content professionals.
A few weeks ago, I posted on LinkedIn about some fascinating research on how AI systems process content—specifically, how we should be "chunking" documents for optimal AI performance. The response was immediate and eye-opening.
"I don't need a study to tell me the ideal chunk size. We solved this in the 1990s with topic models."
Michael Iantosca, content and knowledge engineer on Linkedin, replied with something that stopped me in my tracks:
"I don't need a study to tell me the ideal chunk size. We solved this in the 1990s with topic models."
That comment crystallized something I'd been wrestling with. What if all this cutting-edge AI research isn't discovering new principles—but rather providing empirical validation for content design practices we've known for decades?
Today, let's dig into what recent chunking research actually tells us, and why it validates what content pros have known for a while..
This Deep Reading analysis is also available as a podcast episode. Add "Cyborgs Writing" to your favorite podcast app to hear these in bite-sized, biweekly episodes (more or less).
The Chunking Problem
First, let's establish what we're talking about.
When AI systems process documents, they can't handle everything at once—they need to break content into smaller pieces called "chunks."
Think of it like this.
If you're trying to help someone understand a cookbook, you wouldn't read the entire thing aloud in one breath. You'd break it into logical sections—ingredients here, technique there, troubleshooting tips over there.
Its the same with AI.
Most AI systems don't chunk content the way a human would. Instead, they use what researchers call "fixed-size chunking"—essentially counting words or tokens and cutting every 500 or 1000 words, regardless of where topics begin and end.
It's a bit like cutting that cookbook recipe right in the middle of the ingredients list, then wondering why the AI can't understand what flour is for.
Recent research from teams, like Juvekar and Purwar, and Yepes et al., has been testing different approaches to this chunking problem. And their findings are fascinating—not because they're revolutionary, but because they're validating principles content professionals have been practicing for decades.
What the Research Actually Shows
Let me walk you through the key findings, because they tell a story that should sound very familiar.
We've always known that rhetorical purpose should drive information architecture.
First discovery: Juvekar and Purwar (2024) found that AI systems perform better when they use 40-70% of their available context window, not 100%.
In other words, less information, carefully curated, beats more information thrown together. This flies in the face of the "more data is better" assumption that dominates AI discussions.
Second discovery: When Yepes et al. (2024) tested "element-based chunking"—respecting headings, sections, tables, and natural document structure—it consistently outperformed arbitrary text division.
They found 53% accuracy versus 50% for basic methods, and did it while using roughly half the total chunks (62,529 versus 112,155).
Simply replace "element-based chunking" with "structured content design" and you've just described what technical writers have been doing since the rise of structured authoring in the 1990s.
(If you are not familiar, think of this as structuring all information like recipes at scale.)
When we create content with clear headings, logical section breaks, and focused paragraphs, we're essentially pre-chunking for comprehension.
We're saying: "This is one complete thought. This is where a new concept begins. This table contains related information that should stay together."
The third insight comes from Bhat et al. (2025), whose multi-dataset research showed that optimal chunk size depends entirely on the type of questions users ask and the kind of answers they need.
Their testing across six different datasets revealed that fact-based questions work better with smaller, focused chunks of 64-128 tokens, while complex explanations require larger contextual chunks of 512-1024 tokens.
But again—isn't this exactly what we do when we design content?
We write concise definitions for reference material, but detailed procedures for complex tasks. We structure FAQ differently than troubleshooting guides.
We've always known that rhetorical purpose should drive information architecture.
Why This Matters Now
So why am I spending a Deep Reading episode on research that essentially proves what we already knew?
Because validation matters—especially when it comes with hard numbers and empirical evidence.
For years, content professionals have advocated for structured, topic-oriented design. We've argued for clear information architecture, logical chunking, and user-centered organization. But these principles often get dismissed as "nice to have" or subjective preferences.
Now we have research showing that these practices don't just improve human comprehension—they're essential for AI performance. And given that AI-assisted content systems are becoming standard tools, this research provides a clear business case for content strategy.
Here are three takeaways you can act on immediately:
First: Audit your content for natural breaking points.
Are you writing in focused, topically coherent blocks? Or are you creating dense walls of text that challenge both human readers and AI systems?
Second: Invest in your information architecture.
Those headers, bullet points, and section breaks you carefully craft aren't just visual formatting—they're semantic signals that guide both human understanding and AI processing.
Third: Match your content structure to user intent.
Quick reference material should be tightly focused. Complex procedures need more context. Let the rhetorical purpose drive the chunking, not arbitrary technical constraints.
Respect the Content Architecture
The most encouraging thing about this research isn't that it's discovering new truths about content design. It's that it's providing empirical validation for principles our profession has been practicing all along.
The path forward for AI-optimized content isn't better algorithms or smarter chunking strategies. It's respecting the content architecture that professionals have been building for decades.
Thanks for taking this deep dive with me. I'm Lance Cummings, and this has been Deep Reading. Until next time, keep exploring how the intersection of human insight and AI capability can make our work more effective.
References
Bhat, S. R., Rudat, M., Spiekermann, J., & Flores-Herr, N. (2025). Rethinking chunk size for long-document retrieval: A multi-dataset analysis. arXiv preprint arXiv:2505.21700v2. https://arxiv.org/abs/2505.21700
Juvekar, K., & Purwar, A. (2024). Introducing a new hyper-parameter for RAG: Context window utilization. arXiv preprint arXiv:2407.19794v2. https://arxiv.org/abs/2407.19794
Yepes, A. J., You, Y., Milczek, J., Laverde, S., & Li, L. (2024). Financial report chunking for effective retrieval augmented generation. arXiv preprint arXiv:2402.05131v3. https://arxiv.org/abs/2402.05131