Observed Analytical Engine
Building a methodology-governed AI research platform on Claude — purpose-built for public-interest organisational accountability.
Accountability infrastructure. Built to withstand scrutiny.
Observed is a public-interest research platform designed to produce rigorous, defensible analysis of organisational workplace conditions. The system compares publicly available signals against a validated eight-domain benchmark framework drawn from peer-reviewed academic literature and regulatory standards.
The challenge was not building an AI that could produce analysis. It was building one that could produce analysis that a lawyer, a journalist, or a regulator could pick apart — and not find a single unsupportable claim.
Validated benchmark domains
Each domain grounded in peer-reviewed literature and regulatory frameworks. No general AI knowledge used for benchmark claims.
Analytical passes per matter
Every analysis runs the same sequence. Source collection, privacy scan, legal scan, classification, comparison — in order, every time.
Zero findings of wrongdoing
The system produces benchmark comparisons, not verdicts. Every output is legally conscious by design.
Equal weight to positive signals
The engine is explicitly designed not to default to finding problems. Strong performance is named and evidenced with the same rigour as concerns.
Accountability research has a reliability problem.
Organisations operating in the public interest — charities, healthcare providers, community services, schools — are subject to growing public and media scrutiny of their internal culture and governance. But the tools available for that scrutiny are blunt: anonymous review platforms with no methodology, media coverage driven by individual complaints, and formal regulatory processes that move slowly and narrowly.
The result is a landscape where serious organisational harm can go undocumented, but so can serious misrepresentation. Both failures are costly — to workers, to organisations, and to public trust.
What didn’t exist was a repeatable, evidence-governed framework that could assess organisational conditions consistently, defend its conclusions, and give equal analytical weight to positive and negative signals.
The constraints that shaped the build
- Every finding had to be traceable to a specific, public, dated source
- No natural person could be named as a subject of findings
- No output could constitute — or imply — a legal determination
- The engine had to be incapable of simply finding what an operator wanted it to find
- Positive organisational performance had to be surfaced with the same confidence as risk signals
- The system had to stop itself at defined points and wait for a human
- Right of response had to be structurally embedded, not optional
- The methodology had to be fully disclosed in every output
System architecture
The design problem was not capability — it was constraint. The goal was an AI system that was genuinely difficult to misuse, even by the person operating it.
Benchmark library development
Eight domain documents were built from primary academic and regulatory sources — covering workplace bullying, psychosocial risk, psychological safety, leadership, worker wellbeing, legal frameworks, and cross-sector context. These documents are the only permitted source for benchmark claims. The AI cannot substitute general knowledge.
The nine-pass analytical pipeline
Rather than a single prompt-to-output process, Observed runs every analysis through nine discrete passes. Source collection and deduplication, then privacy scan, then legal risk, then signal classification, benchmark comparison, gap analysis, confidence rating, contradiction check, and human review gate — in that order, with no skipping. The sequence is the quality control.
Source type hierarchy and weighting
A tiered source weighting system applies consistently across every analysis. ERA decisions and WorkSafe enforcement notices sit at the top. Anonymous review platforms sit at the bottom. The same weighting logic applies regardless of whether a signal is positive or negative — the system cannot treat a favourable review platform signal as more credible than an unfavourable one.
Compliant language architecture
The operating manual contains a non-negotiable language framework. The engine compares signals against benchmarks — it never makes findings. Every statement is structurally reviewed against a set of prohibited phrasings and required reformulations before output. This is not stylistic guidance; it is the mechanism that keeps the platform on the right side of defamation law.
Forced human checkpoints
Three passes require a complete stop and explicit operator confirmation before the system proceeds. The AI cannot work around these gates — they are structural, not advisory. At Pass 9, the AI presents a full draft and a 16-item human review checklist. Nothing is approved for publication until the operator has worked through every item and typed one of three specific responses.
Equal-weight positive signal framework
A deliberate design decision addressed the default failure mode of accountability research: finding what you expect to find. The V1.1 update embedded explicit positive signal language, outcome evidence for strong performance, and active checks that flag when the engine has undersold well-evidenced positive findings. The system is designed to say clearly when an organisation is doing well.
What the system delivers
A research platform that can be published, defended, and trusted — by operators, by organisations being analysed, and by the public reading the outputs.
Legally defensible outputs
Every output is structured to avoid findings of wrongdoing, named individual subjects, and implied legal conclusions. The methodology disclosure block is mandatory on every report.
Consistent analytical standard
The nine-pass pipeline applies identically regardless of the organisation being analysed, its size, sector, or political context. The same benchmark. The same sequence. The same confidence rating definitions.
Full output pack per analysis
Each approved matter produces a full analysis report, blog article, three NotebookLM source packs for audio and video, an SEO pack, a featured image brief, and a right of response letter.
Operator accountability built in
The system records the benchmark library version, the human review date, the reviewer identity, and the right of response outcome on every published report. There is no way to publish without this audit trail.
Structural resistance to bias
The engine cannot receive private communications, suppressed proceedings, or the operator’s own professional opinion as source material. These are not policy preferences — they are hard exclusions written into the operating manual.
Self-stopping on risk
Seven defined conditions cause the engine to pause and refuse to continue — from unlawfully obtained material to active legal correspondence. The stop condition and the required action are stated clearly to the operator.
Capability was never the hard part.
Building an AI system that can produce high-quality analytical writing is not difficult in 2026. Building one that cannot be manipulated into producing something harmful, legally exposed, or analytically dishonest is a very different engineering problem.
The most important design decisions in Observed are not technical. They are structural. The forced stops. The language framework. The source exclusions. The equal-weight positive signal requirement. These are the mechanisms that make the output trustworthy — not the quality of the prose.
A system like this is only as reliable as the constraints written into it. And the hardest constraints to write are the ones that protect the platform from the person running it.
What made this work
- Designing for accountability, not just capability
- Treating the operating manual as both a functional specification and a legal instrument
- Building explicit failure states — the system knows when to stop
- Separating benchmark knowledge from the AI’s general training to prevent drift
- Designing positive signal detection with the same rigour as risk signal detection
- Embedding the methodology disclosure as a non-negotiable output component
- Using version control on the benchmark library itself
- Building right of response into the pipeline — not as an afterthought, but as a gate
Have a question about custom AI system design?
Common questions about building methodology-governed, accountability-grade AI systems on Claude.
What does it mean to build a custom analytical engine on Claude?
Rather than using Claude as a general-purpose assistant, Observed uses it as a tightly constrained specialist system. The operating manual — a detailed system prompt — defines the engine’s identity, its rules, its analytical sequence, its language framework, and its stop conditions. Claude executes the methodology; the methodology governs what Claude can and cannot do.
Why build this on Claude specifically?
The system requires long, structured analytical outputs with consistent language behaviour across a multi-pass pipeline. It also requires a model that can follow complex, layered instructions without drifting from them mid-analysis. Claude’s instruction-following capability and extended context handling make it well-suited to this kind of methodology-governed work.
How do you stop the AI from making things up?
Two mechanisms work together. First, all benchmark claims must be drawn from the domain library documents — not from the model’s general training data. Second, every source used in an analysis must be specific, named, dated, and publicly accessible. The human review checklist includes an explicit check for both of these. If a claim can’t be traced to a source document, it doesn’t get published.
What stops an operator from misusing the system?
The source exclusions are structural — private communications, suppressed proceedings, and the operator’s own opinion simply cannot be submitted as source material. The human review checklist requires the operator to actively confirm that no steps were skipped, no sources were double-counted, and that positive signals were given equal weight. The system also stops itself and names the reason when it detects conditions that suggest misuse.
Can this kind of system be built for other analytical or research applications?
Yes. The design pattern — a validated knowledge base, a structured analytical pipeline, enforced human checkpoints, and a non-negotiable language framework — applies to any context where you need AI to produce defensible, methodology-governed outputs. Risk assessment, compliance analysis, due diligence, regulatory research, and investigative journalism all have similar requirements.
How do you handle updates to the benchmark framework?
The benchmark library is version-controlled. Every published report records the library version used. When the framework is updated, prior analyses are not retroactively affected — they were conducted against the version in force at the time. This creates a clean audit trail and prevents retrospective reinterpretation of published findings.
How long does it take to build something like this?
The operating manual for Observed — which is the core of the system — took multiple iterations to get right. The benchmark library development, the language framework, the confidence rating definitions, and the human review checklist each required careful drafting and testing. This is not a prompt engineering exercise. It is a system design exercise that uses prompt engineering as one of its tools.
Start with a free Decision Clarity Session.
A Decision Clarity Session is a no-obligation conversation where we listen to where you are, what you are trying to achieve, and what is getting in the way. If you are thinking about building a governed AI system — for accountability, research, compliance, or analysis — you will leave with a clearer view of what that actually takes.