Building an AI reviewer that works at the speed of content

Project overview

As our content design team grew, I saw an opportunity to put the content design system we'd spent years building to a new use. Having watched the same content governance issues play out when I was lead editor for the technical writing team, I knew static documentation wouldn't be enough to keep our standards consistent at scale.

I got the team on board and started building a custom AI agent trained on our own guidelines, centred around trustworthiness from the start. The challenge was maintaining project momentum with constraints that were difficult to control: limited tooling and a content design system that we were still building out.

The problem

Our team had invested years into building a rigorous content design system: a style guide, a glossary, and a growing library of content components. We had everything documented, but it just wasn't being used consistently.

This wasn't a behavioural issue. We were expecting writers to cross-reference guidelines as they drafted copy, but in reality, that's not how writing really works. Without context to guide writers towards the relevant guidelines, the rules just didn't stick.

The result? Inconsistent terminology, minor style deviations, and a review bottleneck that was costing us. The other senior writer and I were each spending an estimated 5 hours a week on style and terminology review, which was time we weren't spending on strategic-level work like domain modelling and content frameworks. It was necessary work for the glossary we were building, though, and this bottleneck was slowing us down.

That got me thinking:

How could we enforce our content rules and guidelines without depending on people's memory?

The biggest constraint

The first thing that came to mind was the stringent security restrictions we had to abide by. We weren't allowed to use any native third-party integrations, Figma plugins, or any software that wasn't explicitly approved by IT.

If there was a solution, it would have to come from within.

The breakthrough

The turning point came when the user research team demoed the CLI version of our internal AI agent to us. They were using it to consolidate research insights, and I immediately saw that it was capable of a lot more than I had originally thought. In our previous experiments with the web version, we had encountered a relatively limited featureset.

We ran an experiment in which we manually exported frames from Figma to test whether the agent could ingest images, and it could.

Now, the agent could review UI copy in context, not just as raw strings, so it could evaluate content the way users would encounter it.

The end-to-end approach

After identifying the problem and discovering a solution, the team and I divided the work out into phases:

1

Converting the knowledge base from unstructured to structured data

The obvious first input was our style guide, but it was written for humans, not machines. I asked the other senior writer to convert it to a machine-readable format that would be optimized for consistent interpretation by the agent.

We also experimented with different chunking strategies to ensure that our embeddings would include just the right amount of context for the agent to parse, meaning that the output it generated would actually be useful to the writer.

2

Sequencing the content architecture

The style guide was only the start. We defined a roadmap to embed our full content design system as training data for our custom agent:

Style guide (complete) — Voice, tone, grammar, and formatting rules
Glossary (complete) — ~250 terms with definitions, usage notes, and stop words
Content component library (planned) — Reusable patterns to replace ad-hoc string creation

The goal was for each layer to work together and govern different aspects of our content: how we write, what we call things, and the structure. With these three layers, the agent should have a complete picture of our content standards.

3

Building the evaluation framework

We knew that this governance system would only be as good as its calibration, so I designed a human-in-the-loop evaluation process alongside another writer on the team.

Every month, we batch review 50+ strings to independently assess whether we agree with the agent's output. If not, we document that rationale. Then, we calculate inter-rater reliability using Cohen's Kappa, which gives us a statistically significant measure of how consistently the agent's judgments align with ours.

Why did we design the evaluation framework this way? It's because of trust.

We knew it was important to build a data-driven evidence base to show the team and our stakeholders that the agent was capable of reviewing our content outputs to a high standard.

The outcomes

The agent is in active development, with the evaluation framework already running. Here's what we've seen so far:

1

Governance that scales without extra headcount

Once fully trained, the agent operates as a persistent reviewer, and is available at any point in the writing process without competing priorities.

2

Senior writers unblocked

Getting routine style and terminology review off our plates frees us up for more strategic-level work like content strategy and upstream product involvement, and lets us focus on judgment calls an AI agent can't make.

As an example, I spend a lot of time explaining how and when to use terms to junior writers. Now, the agent can provide that context in a natural, conversational way.

Here's a real response from our agent:

3

An evaluation framework designed to catch what metrics miss

Based on early team feedback on the agent's output, we designed the evaluation framework to highlight the gap between rule-following and writer intent, specifically to catch what a pure accuracy metric wouldn't reveal.

That calibration data will be instrumental in making the agent trustworthy enough to actually use (and to help refine our style guidelines!).