What 17 Student Chatbots Showed Me About Structured Content
An informal classroom experiment with information types

Over the last year, content professionals have been telling me a version of the same thing.
The way content is structured affects how AI systems behave with it. Cleaner, more deliberate content seems to produce cleaner, more deliberate answers.
We all see this in practice. We all feel it. But pinning it down is harder than it should be. Proving it to a CEO, a CTO, or a colleague who hasn’t lived in the work is difficult to acheive.
This past semester I had an informal opportunity to look at the question. My Writing with AI students partnered with Dr. Yang Song's machine learning students through UNCW's SAIL initiative (Seahawks Advancing Interdisciplinary Learning), a program designed to support cross-disciplinary student collaboration.
Our goal was to design AI chatbots in partnership with New Hanover Disaster Coalition, focused on emergency communication for community members. The English students built the knowledge bases and system prompts, while the machine learning students focused on model integration and testing infrastructure.
By the end of the semester, we had 17 working chatbots across 17 distinct disaster-preparedness topics … and an external assessment of each one.
Teams that built narrow, curated knowledge bases organized by information type consistently outperformed teams that scraped pages or pasted in lists of URLs.
And while this wasn’t a controlled experiment, the pattern was visible enough that I want to share it.
The question content professionals keep asking
A lot of the content world is in the middle of a quiet identity crisis right now.
Many technical writers, content designers, knowledge engineers, documentation leads have all spent a decade building competence in topic-based authoring, structured documentation, modular content, and taxonomies.
Then generative AI arrived, and a lot of our colleagues started asking: does any of that still matter? Can’t we just point a model at a SharePoint folder and call it a day?
People who actually do this work know the answer. Of course it matters.
Of course the model doesn’t magically organize unstructured content.
Of course the same RAG pipeline will produce wildly different outputs depending on what you feed it.
But the question isn’t whether we know this. The question is how to demonstrate it to people who don’t.
The challenge is that most of the evidence is anecdotal. A senior content strategist describes their internal experience improving a chatbot at their company, and a skeptic shrugs because it’s one company, one product, one set of conditions.
Most of the controlled studies in this space are narrowly technical (which embedding model performs best, which chunk size optimizes retrieval). It’s not the kind of evidence that lands with a content team trying to convince a leadership group to invest in content infrastructure.
What I had this semester was something different. A setting where 17 teams worked on the same kind of problem, using comparable tools, with comparable assessment afterward. I could see what each team had actually done with their content.
The project
The goal of this project wasn’t really to build AI chatbots … but to use that project to give students an interdisciplinary experience that would help them understand AI from both perspectives … computer engineering and professional writing.
I use information types with students because it’s accessible. You can engage with structured content as a way of thinking before you have to engage with it as a way of encoding.
The foundation for this spring’s work was laid the previous fall by two other faculty and their students. Dr. Ian Weaver, in the Professional Writing Program, taught an advanced professional writing course that consulted directly with the New Hanover Disaster Coalition to map the community network a system like this would actually serve.
His class built on the research of Dr. Jill Waity, Professor of Sociology and chair of the Sociology and Criminology Department at UNCW, whose community-engaged work on disaster preparedness and resilience in rural and regional communities established the empirical foundation for understanding user needs.
Ian’s students ran usability tests on existing disaster-preparedness materials from agencies like NC State Extension, Ready.gov, and the American Red Cross, and built a taxonomy of disaster-preparedness topics relevant to local needs.
Working in parallel in the Fall, Dr. Gulustan Dogan’s machine learning class produced what we might call a first draft of a knowledge graph for the domain for initial tests on the existing content.
The two fall classes then handed off the project to our classes this spring. The work moved through four classes across two semesters, with each phase building on the previous. Most of what I’m reporting here is the visible piece of a longer chain of student and faculty work that started before my students arrived.
This semester, each team was assigned a node on an adaption of that taxonomy. Seventeen teams, 17 topics, all roughly comparable in scope.
Each team had to do three things:
Build a knowledge base for their topic,
Write a system prompt that defined the chatbot’s behavior and voice, and
Design test questions to evaluate their bot.
Most teams used some combination of LM Studio, AnythingLLM, or Google AI studio. Then I asked teams to structure their knowledge bases using a microcontent approach.
Specifically, I introduced them to Precision Content’s authoring methodology, which sits at the layer underneath things like DITA and XML that many enterprise content systems rely on. The core idea is that you look at a topic and organize it around information types: a concept is one thing, a process is another, a principle is another, a procedure is another. Each has a recognizable shape and serves a different reader need.
I use information types with students because it’s accessible. You can engage with structured content as a way of thinking before you have to engage with it as a way of encoding.
Most undergraduates can’t be productively dropped into a DITA toolchain in a semester. But they can absolutely learn to look at a body of content and ask: which parts of this are concepts, which are processes, which are principles?
Not every team did this. I asked them to. I gave them the tools. But the assignment structure didn’t enforce it, and several teams defaulted to what was easier: scraping URLs, dumping in PDFs, pasting links to UNCW or county emergency management pages.
By the end of the project, the 17 knowledge bases ranged from carefully typed microcontent on one end to lightly organized link lists on the other.
One technical note about the conditions. Yang's lab machines have 8GB of VRAM, which meant most teams were running small, quantized Llama 3 models rather than the frontier systems people associate with conversational AI.
(For non-technical folk, that just means students were using inferior models.)
We're probably two years behind well-funded labs on hardware, and students had to develop extra craftiness to make their bots work within those limits.
The constraint was also informative, though. Smaller models with shorter context windows make the prompt and knowledge variables more visible than they would be at scale. The patterns I'm describing happen with frontier models too, just less obviously.
And in the broader practitioner conversation, the case for lighter models paired with well-structured knowledge is getting stronger. It's more cost-efficient, more accessible to organizations without GPU budgets, and often gets you closer to a usable result faster than scaling the model alone would.
How we assessed the chatbots
We used a Perplexity to evaluate each team’s chatbot, because it wasn’t the model the students used to build their own bots. It came in fresh, with no exposure to the student work, and it had access to its own retrieval over the open web. That gave us a meaningful external check rather than asking a model to grade itself.
For each team, we fed Perplexity their test questions, their bot’s responses, and their own human-written reference answers. Perplexity scored two things on a 1–5 scale: correctness and comprehensiveness. It also produced qualitative feedback for each response, explaining where the bot landed and where it fell short.
We also ran an embedding similarity comparison between each chatbot’s response and the team’s reference answer, hoping to get a third measurement that captured semantic alignment independent of Perplexity’s judgment.
Dr. Song and I then met to talk through the Perplexity results team by team, comparing the assessment against what we’d seen of each team’s content and prompt design.
Before I go further, this wasn’t really a controlled study. The variables weren’t held constant. Different teams used different models, different RAG setups, different testing question styles. Team capacity varied. Topic difficulty varied. Perplexity is a real evaluator but not a calibrated assessment instrument. It has its own retrieval biases and its own way of weighing comprehensiveness, and at times we saw it run overly critical on supporting details when the core content was actually solid.
But the pattern was striking enough that I think it’s worth replicating with a more rigorous design.
What we saw
The clearest contrast came from a coincidence in topic assignment. Three teams ended up working on closely related topics around food safety during power outages.
All three had access to the same source material: CDC, FDA, USDA, and Red Cross guidance on the 40°F/2-hour rule, perishable food handling, refrigerator and freezer timelines.
One of those teams scored a perfect 5/5 on correctness across all seven of its test questions. Perplexity’s assessment described the bot as “fully aligning with FEMA, CDC, and utility best practices” and “delivering critical, life-saving guidance.” That team had built a narrow, deliberately curated knowledge base. Its test questions matched what the knowledge base was designed to answer. Its system prompt kept the bot pulling cleanly from the structured content.
A second team working on essentially the same topic averaged 3.6 on correctness and 2.8 on comprehensiveness. The assessment called several responses “partially correct but falls short in both factual precision and practical completeness, especially for a life-critical topic.” Same source material was available. Same kinds of tools. Same model class. The difference was in how the content was organized before it got to the model.
A third team in this cluster scored well overall but had one telling error. The bot confidently gave incorrect refreezing guidance for thawed meat. Their knowledge base was structured but had a gap, and the bot filled the gap with confident-sounding false information.
The other vivid case involved phone numbers. Two teams built bots focused on accessing help: one for mental health resources, one for general post-disaster assistance. Both teams built their knowledge bases primarily from links to UNCW and local resource pages, expecting the RAG pipeline to surface the right information when asked. Both bots hallucinated.
When asked about UNCW counseling services, the mental health generated a phone number where the last four digits were wrong. The accessing-help bot gave UNCW’s main switchboard number as the campus police department.
Neither of these is a small failure. These are help-seeking bots. The user dialing the number the bot provides is, by definition, someone in distress. Hallucinated contact information is the most consequential failure a chatbot in this domain can have, because it’s the failure that translates most directly into a real-world safety problem.
And it’s the kind of failure that scraping URLs produces. The bot has seen “UNCW police” and a phone number near each other in the source material, and it generates a plausible-looking answer that happens to be wrong.
Teams that built around information types didn’t make these errors as often. Their bots failed in different ways, but they tended to be merely incomplete rather than confidently wrong.
The team that scored highest in the entire cohort built around clear, source-grounded content about pet-friendly shelter access. Five out of six responses scored 5/5. Perplexity described them as “drawing directly from NC Emergency Management and United Way protocols.” That team’s knowledge base was small, narrow, and rigorously sourced. Their bot reflected that.
What this argues
I’ll restate the caveat. This wasn’t science. The variables weren’t held constant. Seventeen undergraduate teams working under real time pressure with novice technical skills is not a controlled environment. I am not making a claim that the structured-content advantage I’m describing here is statistically established.
But in a setting with 17 comparable teams working on comparable problems, the teams that organized their content by information type before encoding it into a knowledge base consistently produced more accurate, less hallucinated, less confidently-wrong chatbots than the teams that scraped URLs or dropped in link lists.
You don’t need DITA to start. You don’t need an XML toolchain. What you need is to start asking, for each piece of content you’re feeding a system, what kind of information this is and what shape it should take.
This is what the content professionals I talk to have been telling me about for a while. The model isn’t doing the structuring work for you. The model can only reflect what it’s been given.
Content that is already organized by meaning gives the retrieval layer something coherent to retrieve. Content that hasn’t done that work yet gives the retrieval layer raw material it has no choice but to guess.
This is also, I think, the right argument for taking microcontent seriously as a methodology, separate from any specific encoding standard.
You don’t need DITA to start. You don’t need an XML toolchain. What you need is to start asking, for each piece of content you’re feeding a system, what kind of information this is and what shape it should take.
Precision Content’s typed-information approach is the one I’m using with students because it’s accessible to people who aren’t ready for the deeper technical apparatus, and because the gains it produces are real, even at the entry level.
What I’d want to test next
Seventeen student projects in a single semester is not where this story should end. A few things I’d want to look at, ideally with someone who can design the experiment properly:
A genuine controlled comparison. Same topic, same model, same testing framework with three knowledge bases built three ways (scraped, lightly organized, typed-microcontent). Measure performance directly.
Specific information types. Which contribute most to AI performance? Is the gain mostly in concepts because they anchor retrieval? In processes because they answer the “how do I” questions users actually ask? In principles because they shape the bot’s voice and reasoning? I have hypotheses; I don’t have data.
Scaling effects. Does this hold at larger knowledge base sizes? My intuition is that the structured advantage compounds, but I’d want to see it.
Cross-domain. Does the pattern hold beyond disaster preparedness? I suspect it does, but disaster preparedness has unusually well-established authoritative sources. A domain with messier sources might tell a different story.
If you’re working in this space and have done something more rigorous, I’d be glad to compare notes. The professional intuition is solid. We need the demonstrations.
➡️ And if you want to learn how to build chatbots like this, check out my course on Writing with AI. This is the same process my students used and is currently available free for paid subscribers for a limited time.



