Testing as Rhetorical Proof
How the library of Alexandria might judge good prompts
Last semester, my students and I built a writing feedback chatbot for our technical communication course. In testing, it worked beautifully. Clear, specific feedback that maintained professional warmth. We deployed it.
Within two weeks, students started reporting inconsistent experiences. The same submission structure that earned detailed feedback on Monday produced superficial responses on Wednesday.
One student showed me screenshots. For example, the chatbot had praised her conclusion as “effectively synthesized” in the morning, then flagged the identical paragraph as “needing stronger connections” that afternoon. Same prompt. Same model version. Same student text.
This isn’t a bug. It’s just the way it is when working with language models. And it’s why prompt testing requires something more rigorous than “try it and see if it looks good.”
Evaluation as Craft
The ancient Greeks had a term for what we need: kritikē technē … Or the art of judgment. The word kritikē comes from krinō (to separate, to decide), and technē means a teachable craft. Together, they describe a disciplined practice for evaluating the worth, correctness, and fitness of language.
The grammarians who were a part of the library of Alexandria developed kritikē into a systematic discipline that created repeatable procedures for evaluating texts. Working with multiple manuscript copies of Homer, they faced a problem familiar to anyone testing AI outputs: variant versions of the same content, with no obvious way to determine which was best.
Their solution was to operationalize judgment. They established criteria, collected evidence across variants, applied standards consistently, and recorded their decisions so others could follow the reasoning. Judgment became a craft that could be taught, reproduced, and improved.
This is exactly what prompt testing requires. When we evaluate AI outputs, we’re not asking “does this sound good?” We’re asking whether the outputs meet specific criteria reliably enough for a particular purpose.
That question demands a method—articulated standards, systematic procedures, transparent documentation.

Why Structured Prompting Demands Systematic Testing
When people claim structured prompting is dead, they’re usually working with single interactions or more dialogic collaborations. Ask a question, get an answer, move on (or continue working on that one instance). In that context, casual prompting often works fine.
But the moment you’re building something that needs to perform reliably across users, sessions, and contexts, you’re no longer in single-interaction territory.
This could be:
a classroom assistant
a documentation helper, or
a content generation workflow.
You’re building a system. And systems require consistency that casual prompting can’t guarantee.
My research on prompt format bears this out. When I tested the same complex task across four different structures, the outputs varied dramatically. Not just in efficiency (processing time ranged from 64 to 120 seconds) but in character. The unstructured prompt produced exploratory, wandering responses. JSON triggered mechanical, compliance-document prose. Natural structure with clear sections generated focused, efficient communication.
Each format created different statistical conditions for the model’s token prediction. JSON tokens co-occur with technical documentation patterns in training data, so generating JSON-formatted input increases the probability of formal, exhaustive output patterns. Unstructured conversational input co-occurs with exploratory discussion, so the model follows those statistical tendencies.
Structure gives you leverage over consistency, but only if you verify that your structure actually produces the consistency you need. And structure isn’t the only variable. Temperature settings, model selection, and context length all affect output character. (I explore temperature’s effects in a separate lesson.) Testing helps you understand how these variables interact for your specific use case.
Testing Prompts vs. Testing Code
When we test software, we’re verifying logical operations. Given input X, does the function return output Y? The relationship is deterministic. Run the test a thousand times, get the same result a thousand times.
Prompt testing operates on different principles. We’re examining rhetorical reliability, not verifying logic. Does the relationship we’ve established between human intention and machine interpretation remain stable across time, context, and varied inputs?
This distinction matters because it changes what we’re looking for. Code either passes or fails. Prompts exist on a spectrum of reliability, and our job is to understand where on that spectrum a given prompt sits for a given purpose.
The writing feedback chatbot didn’t “fail” in any binary sense. It produced plausible feedback every time. The question was whether that feedback remained consistent enough to be pedagogically useful … And whether students could trust that the evaluation criteria were being applied reliably rather than arbitrarily.
That’s a question of judgment, not logic. And answering it requires a method for judgment.
What We’re Judging
When you test a prompt systematically, you’re evaluating three aspects of the human-AI collaboration you’ve created.
Stance stability. Every prompt establishes a rhetorical stance, or a position from which the AI speaks. “You are a writing tutor who provides constructive feedback focused on argument structure and evidence use” isn’t just an instruction. It’s establishing a consistent voice and perspective.
But does that stance actually persist?
With our classroom chatbot, testing revealed that the constructive-tutor stance held firm for the first few exchanges in a conversation, then gradually drifted toward generic encouragement.
This could happen because the model’s training data contains patterns where tutoring interactions soften over time, or because earlier instructions lose influence as the context window fills with conversation.
Whatever the mechanism, the effect was measurable: stance drift under extended use. Testing helped us identify where drift occurred so we could add stabilizing elements—periodic reinforcement of the evaluative criteria, structural markers that maintained the rigorous-feedback pattern.
Interpretive framework reliability. Your prompt doesn’t just tell the AI what to do. It shapes how inputs get processed. When our chatbot prompt said “evaluate based on the technical communication rubric criteria,” we were creating conditions where rubric-related language would influence the output. But those conditions had gaps we didn’t anticipate.
The rubric worked well for standard assignments because the model had seen similar patterns. But when students submitted creative approaches, like an infographic, the statistical patterns broke down. The model couldn’t match rubric language to unfamiliar input formats, so it defaulted to surface-level observations about grammar and formatting. Testing with diverse input types revealed these blind spots. The fix wasn’t clarifying instructions—it was providing examples of the rubric applied to non-standard formats, giving the model patterns to match against.
Collaborative boundaries. Every prompt creates what I think of as a collaborative space, or the zone where human intention and machine capability overlap productively. Testing maps the edges of this space.
For the classroom chatbot, we needed to know: What types of student writing produce useful feedback? Where does the feedback quality drop off? What submission characteristics cause confusion or generic responses? Which edge cases does the prompt handle gracefully, and which break it entirely?
These boundaries aren’t obvious from the prompt text. They emerge only through running varied inputs through the system and observing where reliability holds and where it fractures.
Knowing what to judge is only half the challenge. The Alexandrian grammarians understood this. They didn’t just identify what made a text authentic or well-formed. They also developed procedures for making those judgments systematically: comparing variants, marking uncertainties, documenting reasoning.
Prompt testing requires both dimensions. We need criteria for evaluation—what counts as stable stance, reliable interpretation, appropriate boundaries. And we need procedures for applying those criteria.
The Rhetorical Appeals as Evaluation Criteria
The Alexandrian grammarians faced a problem we might recognize: they had no original to compare against. When scholars assessed a line of Homer, they weren’t checking it against some authoritative master copy—none existed. Homer was oral tradition committed to writing centuries after composition, and every manuscript was a copy of copies, each with its own variants and corruptions.
So how did these scholars develop criteria for judgment? By immersion in the corpus itself. They studied patterns across many manuscripts, inferring what Homeric diction typically looked like, identifying metrical conventions from the poems themselves, developing a sense of stylistic consistency through deep familiarity with the work. They’re standards emerged from the body of texts, then got applied back to evaluate individual passages.
We’re doing something similar with AI outputs. There’s no “ideal response” to compare against—just multiple outputs from which we infer what “good” looks like for a particular purpose. Our criteria emerge from examining what works, identifying patterns that characterize successful responses, and then applying those standards to evaluate new outputs.
But rhetoric offers a framework that accelerates this process: the three appeals. Aristotle identified ethos (credibility), pathos (emotional engagement), and logos (reasoning) as the fundamental dimensions of persuasive communication. These aren’t just persuasion techniques—they’re categories for evaluating whether communication works.
We’ve applied them to speeches, text, digital media … And now AI outputs.
When we adapt them for prompt testing, they become three lenses for examining output quality.
Ethos Testing: Can the Output Be Trusted?
Ethos in classical rhetoric establishes the speaker’s credibility and character. For AI outputs, we’re not assessing whether the model has credibility (it doesn’t, inherently), but whether the outputs are trustworthy enough for the intended purpose.
Trustworthiness breaks down into two components: consistency and accuracy.
Consistency asks whether the same prompt produces comparable outputs across multiple runs. This matters because inconsistent outputs can’t be trusted for systematic use. If a documentation prompt generates comprehensive coverage on one run and superficial summaries on the next, you can’t build a workflow around it.
Testing for consistency is straightforward: run the same prompt with the same input multiple times and compare the outputs. But “same” doesn’t mean identical. The question is whether variation falls within acceptable bounds for your purpose.
Consider a blog title generator. Testing the same article summary five times might produce five different titles—but if all five maintain brand voice, include relevant keywords, and target the right audience, that variation is a feature for brainstorming purposes. The prompt has sufficient ethos for generating options.
Contrast that with a product description prompt. If testing reveals 30% variation in which technical specifications get mentioned, the prompt lacks the consistency required for that task. Product descriptions need completeness, not creativity. The prompt would need explicit checklists and verification steps until testing shows reliable coverage of required elements.
Accuracy asks whether the outputs are factually correct and appropriately grounded. This is particularly critical for prompts that draw on domain knowledge or make claims that could be verified.
Testing for accuracy requires reference points—either human expert review or comparison against known-correct information. For our classroom chatbot, we tested accuracy by having instructors evaluate whether the AI’s feedback aligned with how they would assess the same submissions. Where the AI and instructors diverged significantly, we examined whether the prompt’s criteria were unclear or whether the model was introducing its own evaluation standards.
The ethos question for any prompt is: Can I trust this output enough to use it for its intended purpose? Testing answers that question with evidence rather than hope.
Pathos Testing: Is the Emotional Register Appropriate?
Pathos in classical rhetoric involves emotional appeal—engaging the audience’s feelings appropriately for the context. For AI outputs, we’re testing whether the tone and emotional register remain appropriate across different inputs and contexts.
This matters more than many practitioners realize. Tone inconsistency can undermine otherwise solid content. A customer service prompt that sounds helpful for simple questions but becomes condescending for complex ones will damage relationships regardless of how accurate the information is.
Imagine an automated feedback system for student writing. The prompt might maintain an encouraging tone when reviewing strong work but shift to patronizing reassurance for weaker submissions. Phrases like “You tried your best” and “Don’t worry, writing is hard” appearing only in responses to struggling students would unintentionally signal that the system had already judged them as less capable.
In this scenario, the prompt’s ethos could be fine—consistent, accurate feedback. But its pathos would be off, treating different students with different levels of respect based on submission quality.
Testing pathos requires diverse inputs that trigger different emotional contexts. For a feedback system, this means testing with:
Strong submissions (does it avoid excessive praise that might seem hollow?)
Weak submissions (does it maintain respect while identifying problems?)
Frustrated student language (does it respond with patience rather than matching the frustration?)
Confused questions (does it clarify without condescension?)
For a customer service prompt, you’d test across complaint types, customer tones, and issue severity. For documentation, you might test whether the prompt maintains appropriate professional distance when explaining both mundane features and exciting new capabilities.
The pathos question is: Does the emotional register remain appropriate across the full range of likely inputs? Testing reveals where tone calibration breaks down.
Logos Testing: Is the Reasoning Sound?
Logos in classical rhetoric involves logical argument, or the structure and validity of reasoning. For AI outputs, we’re testing whether the logical framework established in the prompt actually governs how outputs get generated.
This goes beyond checking factual accuracy (that’s ethos). Logos testing examines whether the prompt’s stated priorities, decision rules, and evaluation criteria actually shape the output—or whether they get overridden by other patterns in the model’s training.
Consider a documentation prompt that claims to prioritize accuracy but consistently chooses simpler explanations over precise ones. The prompt might include both “maintain technical accuracy” and “explain in accessible language.” In practice, accessibility could win out every tim—the AI sacrificing precision for readability without letting the user know.
This wouldn’t be a failure of the model. It would be a logical contradiction in the prompt that testing reveals. “Accurate and accessible” sounds reasonable until you encounter cases where accuracy requires technical precision that isn’t accessible. Without guidance for resolving that tension, the model could default to patterns from its training data, where accessible explanations are more common than technically precise ones.
Testing for logos means deliberately creating inputs that force your prompt’s priorities into conflict:
If your prompt says “be concise but thorough,” test with topics that can’t be covered both concisely and thoroughly. Which wins?
If your prompt prioritizes “user benefit” and “technical accuracy,” test with features where the accurate description doesn’t sound beneficial. What happens?
If your prompt establishes an evaluation hierarchy (“first check X, then Y, then Z”), test with inputs where X and Y suggest different conclusions. Does the hierarchy hold?
The logos question is: When the prompt’s instructions compete, does the output resolve conflicts the way I intend? Testing surfaces hidden contradictions and reveals which instructions actually govern behavior.
Combining the Three Lenses
Most prompt testing requires all three lenses, but their relative weight depends on purpose.
For a research summarization prompt, logos dominates. You need the reasoning structure to govern output reliably. Ethos matters for accuracy, but pathos is less critical since emotional register in research summaries is relatively narrow.
For a customer-facing chatbot, pathos may matter most. Users will forgive minor inconsistencies or occasional reasoning gaps if the tone feels right. They won’t forgive condescension or inappropriate cheerfulness when they’re frustrated.
For a compliance documentation prompt, ethos is paramount. Consistency and accuracy are non-negotiable. Pathos and logos matter, but trustworthiness is the threshold requirement.
When designing your testing approach, identify which appeals are critical for your use case and weight your testing accordingly. A prompt can have strong ethos but weak pathos (consistent and accurate but tonally inappropriate), or strong logos but weak ethos (sound reasoning but inconsistent execution). Testing across all three reveals the full picture.
From Criteria to Procedures
Knowing what to evaluate doesn’t tell you how to evaluate it. The Alexandrian grammarians understood this. They developed not just standards for judgment but systematic procedures for applying those standards: methods for comparing variants, marking uncertainties, and documenting reasoning so others could follow or challenge their conclusions.
These procedures translate surprisingly well to prompt testing. The Alexandrians were solving a version of our problem: multiple variant texts, no definitive original, and the need for judgments that could be taught, reproduced, and defended.
Recension: Multi-Run Comparison
The Alexandrians framed this work as diorthōsis and ekdosis. They collated multiple manuscript “witnesses” to identify variants, marked doubtful lines, and recorded their comparative reasoning in commentaries. Rather than trusting any single copy, they corrected the text and then issued a stabilized edition, choosing readings based on consistent patterns across the evidence.
For prompt testing, this becomes multi-run comparison. Never evaluate a prompt based on a single output. Run the same prompt with the same input multiple times and compare results.
This sounds obvious, but it’s surprisingly rare in practice. Most prompt development follows a pattern: write prompt, test once, adjust if the output looks wrong, test once more, deploy. That’s like Alexandrians looking at one manuscript and declaring it authoritative.
Multi-run comparison reveals what single tests hide. When I tested my four prompt formats, I didn’t just run each once. Each variation ran under controlled conditions, with metrics tracked across runs. The patterns emerged from comparison, not from any single output.
For practical testing, I recommend a minimum of three runs for informal evaluation and five or more for anything you’ll deploy in production. Compare outputs looking for:
Structural consistency. Do the outputs follow the same organization? If your prompt specifies a format, does that format hold across runs, or does it drift?
Coverage variation. Do all runs address the same key points, or do some outputs omit information that others include? For the blog title generator, variation in titles is fine. For product descriptions, variation in which features get mentioned is a problem.
Tonal range. Do all outputs stay within the same emotional register, or do some runs produce noticeably different tones? This is your pathos check.
Priority adherence. When the prompt contains competing instructions, do all runs resolve the conflict the same way? This is your logos check.
The goal isn’t identical outputs. That’s neither possible nor desirable. The goal is understanding the range of variation your prompt produces and determining whether that range falls within acceptable bounds for your purpose.
Athetesis: Marking Uncertainty
When Alexandrian grammarians encountered lines they suspected were spurious or corrupted, they didn’t simply delete them. They marked them with an obelos—a horizontal line indicating doubt. This kept them visible in the text. Future scholars could see the judgment, assess the reasoning, and reach their own conclusions.
This practice, called athetesis, prioritized transparency over tidiness. A marked line told readers: “This is questionable, but I’m preserving it so you can evaluate my judgment.”
For prompt testing, this becomes uncertainty flagging. When you identify problems in AI outputs, mark them explicitly rather than silently fixing them or discarding the output entirely.
This matters for two reasons. First, patterns of uncertainty reveal prompt weaknesses. If you’re consistently flagging the same type of problem, you’ve identified where your prompt needs revision. Silent fixes hide these patterns.
Second, flagged outputs become training data for your own judgment. Over time, a collection of marked outputs teaches you (and your team) what to watch for. The Alexandrians built scholia (or commentary traditions) around their marked texts. You can build similar institutional knowledge around flagged AI outputs.
A simple flagging system might include markers like:
H for hallucination (unsupported claims or fabricated details)
T for tone problems (inappropriate emotional register)
I for incompleteness (missing required elements)
C for contradiction (conflicts with prompt instructions or internal inconsistency)
D for drift (departure from established stance or format)
The specific markers matter less than consistent use. Pick a system and apply it across all your testing so patterns become visible.
Scholia: Documenting Your Reasoning
The Alexandrians didn’t just mark problems—they explained their judgments. Marginal notes called scholia documented why a line was suspect, what alternatives existed, and how the editor had reasoned through the decision. These annotations accumulated over generations, creating a scholarly conversation around the text.
For prompt testing, this becomes documented evaluation. Don’t just record that an output passed or failed—record why.
This is where most testing falls apart. Teams run prompts, glance at outputs, make a gut judgment, and move on. Nothing gets written down. A month later, no one remembers why certain prompt versions were rejected or what problems the current version was designed to solve.
Documented evaluation doesn’t require elaborate systems. A simple log capturing the following for each test proves valuable:
The input used. What specific content did you feed the prompt? Save it so tests can be reproduced.
The output received. Keep the full output, not just a summary or judgment.
Your assessment. Did it pass or fail on ethos, pathos, logos? What specific problems did you identify? Use your flagging system.
Your reasoning. Why did you judge it this way? What would have made it better? This is the scholia—the part that teaches future evaluators (including future you) how to think about the prompt’s performance.
When you revise a prompt based on testing, document what you changed and why. Link the revision to the specific test failures that motivated it. This creates a trail that makes prompt development cumulative rather than circular.
Putting Procedures into Practice
These Alexandrian procedures provide the methodological foundation. But you still need practical workflows for implementing them. The approach you choose depends on your technical resources and scale.
Keep reading with a 7-day free trial
Subscribe to Cyborgs Writing to keep reading this post and get 7 days of free access to the full post archives.



