Cyborgs Writing
Cyborgs Writing Podcast
What is RAG ... No Really, What is It?
0:00
-9:04

What is RAG ... No Really, What is It?

Deep Reading, Episode 5

Welcome to Deep Reading. I’m Lance Cummings from Cyborgs Writing and today I’m digging into some research on RAG. Many of us (including myself) know vaguely what RAG is … But details matter.

I’ve been doing a lot of research on RAG lately, or retrieval-augmented generation … I mean really digging into the research papers. If you’re creating content for AI systems, you need to understand how these systems actually work.

Not just the marketing version. The real mechanics.

And here’s what I’ve discovered: most explanations of RAG are technically correct but can be misleading.

They’ll tell you “RAG retrieves relevant documents from a database and uses them to generate answers.”

True. But that definition hides crucial details that completely change how you should think about creating AI-ready content.

Today, I want to walk you through what RAG really does, step by step, drawing on the foundational research and recent surveys, because understanding these details will shift how you approach content structure.

By the way, I’m revising and re-platforming my course on Writing with Machines. As I work through this redesign, the course will be available for paid subscribers to preview and provide feedback.

If you’re interested in being part of that process, consider subscribing. A separate message will go out to paid subscribers soon.

The Chunking Revelation

Let me start with something that surprised me. Your RAG system never actually sees your documents.

Here’s what happens.

When you upload a user manual or a knowledge base article to a RAG system, before anyone even asks a question, the system immediately breaks that document into chunks.

For more on chunking, check out my last deep reading.

The 2024 survey by Gao, et al. on RAG for large language models notes that these chunks typically contain 100 to 500 tokens—roughly 75 to 375 words each. Each chunk gets converted into a numerical representation called an embedding and stored in a database as a completely separate, independent item.

From that moment forward, the RAG system only retrieves and works with those individual chunks.

It has zero awareness that chunk 23 and chunk 24 came from the same manual. It doesn’t know they were originally adjacent. It doesn’t know they both came from the installation section.

Think about what this means. Your carefully structured 50-page troubleshooting guide becomes 100 disconnected fragments. The system treats each one like a separate document.

This is why I say the basic definition is misleading. When people say “RAG retrieves relevant documents,” what they really mean is “RAG retrieves relevant chunks.” And chunks are not documents. They’re fragments that the system created, often by just counting words and cutting wherever it hits the limit.

If you write a procedure and it gets split across three chunks, the system might retrieve chunk one without chunk two, giving an incomplete answer.

The beautiful document structure you created? Gone the moment it gets chunked.

The Multi-Stage Pipeline

Now let’s look at how RAG actually retrieves those chunks, because this is where it gets really interesting.

The original RAG paper by Lewis, et al. back in 2020 introduced the foundational architecture with the goal of combining what they call parametric memory, which is the model’s internal knowledge, with non-parametric memory, which is your external database.

But what’s evolved since then is how sophisticated the retrieval process has become.

Most people think of RAG as: search, find, answer. But Gao’s 2024 survey identifies what they call “Advanced RAG” and “Modular RAG” approaches that use multi-stage pipelines.

Stage one is initial retrieval. When you ask a question, the system searches through all those chunks and pulls back a broad set of candidates—typically 20 to 100 chunks.

This stage is fast but imprecise. It’s like casting a wide net. The system uses hybrid search, combining both keyword matching with algorithms like “BM25 and semantic similarity” using “dense vector embeddings”.

In other words, it’s looking for your exact words AND for content that means the same thing using different words.

Stage two is re-ranking. Now the system gets more careful. Research by Nogueira, et al. on multi-stage document ranking showed that cross-encoders, that jointly encode the query and chunks together, can significantly outperform the initial retrieval.

This stage takes those 20 to 100 candidates and examines each one more closely. It looks at your question and each chunk together, as a pair, and scores how well they actually match.

This is computationally expensive but much more accurate. This stage narrows the results down to maybe 2 to 6 chunks—just the best ones.

Stage three is generation. Only those final ranked chunks get passed to the language model to actually generate your answer.

Why this Matters

Now why does this three-stage process matter for content creators?

Because different content optimizations help at different stages.

  1. Clear, descriptive titles help stage one’s keyword matching. If your title is vague or generic, the initial retrieval might miss relevant content entirely.

  2. Consistent terminology helps stage two’s semantic scoring—if you use three different terms for the same concept, you’re making it harder for the re-ranker to recognize relevance.

  3. And focused, single-topic chunks help stage three generation, because the language model can work with content that’s directly relevant rather than having to extract the useful parts from a mixed-topic chunk.

This is why understanding the pipeline matters. You might not be optimizing for one search—you could be optimizing for a three-stage process, and each stage has different needs.

Practical Implications

So what does this mean for how you create content?

First: Think in chunks from the beginning. Don’t create long documents and hope the chunking works out.

Design each section to be a complete, standalone unit—one topic, clearly titled, with all necessary context. Because that’s what the RAG system will work with. If your content naturally maps to good chunks, you’re starting with an advantage.

Second: Optimize for each stage of the pipeline. That means descriptive titles for retrieval, consistent terminology throughout for re-ranking, and focused topics for generation.

A chunk titled “Overview” tells the retrieval system nothing. A chunk titled “Installing the Database on Linux Servers” tells it exactly what’s inside.

Third: Test how your content actually chunks. If you’re serious about AI-ready content, you need to see what happens when it gets chunked.

Does a procedure get split? Does the context get separated from the steps? Understanding this lets you adjust your structure before problems appear.

RAG isn’t magic.

It’s a mechanical process with specific steps, and understanding those steps completely changes how you think about content structure.

You’re not writing for human readers who can flip back and forth through a document. You’re writing for a system that will fragment your content, search through those fragments, and reassemble pieces into answers.

That requires a different approach to organization, and that’s what I’m exploring with structured content and microcontent principles.

What’s Next

So the next time someone tells you they’re optimizing content for RAG, ask them how and for what kind of RAG? Because that detail matters.

I’m currently developing testing protocols to verify some of these AI-ready content methods—actually putting structured content approaches through systematic evaluation with RAG systems.

If you’re working on similar research or want to compare notes, reach out. I’m exploring this stuff out loud because I think we’re all trying to figure out what good content looks like in this new context.

Until next time—keep reading deeply.


Sources

Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., & Wang, H. (2024). Retrieval-augmented generation for large language models: A survey. Transactions of the Association for Computational Linguistics, 12, 1-25. https://arxiv.org/abs/2312.10997

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459-9474. https://arxiv.org/abs/2005.11401

Nogueira, R., Yang, W., Cho, K., & Lin, J. (2019). Multi-stage document ranking with BERT. arXiv preprint. https://arxiv.org/abs/1910.14424

Leave a comment

Discussion about this episode

User's avatar