Observed — Analytical Engine Case Study | Changeable

ai analytical engine organisational research

Observed Analytical Engine

This ai analytical engine organisational research case study shows how Changeable built a methodology-governed AI research platform for public-interest organisational accountability, defensible evidence review and human-approved publication.

Book a Decision Clarity Session → Back to Case Studies

Service Custom AI System Design & Claude Development

Client Observed — Public Interest Research Platform, NZ

Date May 2026

How the system works

Validated benchmark frameworkEight peer-reviewed domains anchor every analysis

Nine-pass analytical sequenceStructured pipeline with enforced human checkpoints

Signal classification engineSources weighted, mapped, and compared against benchmark

Human review gateNo output leaves the system without operator sign-off

Project overview

Accountability infrastructure. Built to withstand scrutiny.

Observed is a public-interest research platform designed to produce rigorous, defensible analysis of organisational workplace conditions. The system compares publicly available signals against a validated eight-domain benchmark framework informed by peer-reviewed literature, regulatory standards and relevant WorkSafe New Zealand guidance.

The challenge was not building an AI that could produce analysis. It was designing a system where material claims could be traced to specific sources, tested against the methodology and reviewed before publication.

Validated benchmark domains

Each domain grounded in peer-reviewed literature and regulatory frameworks. No general AI knowledge used for benchmark claims.

Analytical passes per matter

Every analysis runs the same sequence. Source collection, privacy scan, legal scan, classification, comparison — in order, every time.

◎

Zero findings of wrongdoing

The system produces benchmark comparisons, not verdicts. Every output is legally conscious by design.

▤

Equal weight to positive signals

The engine is explicitly designed not to default to finding problems. Strong performance is named and evidenced with the same rigour as concerns.

The problem

Accountability research has a reliability problem.

Organisations operating in the public interest — charities, healthcare providers, community services, schools — are subject to growing public and media scrutiny of their internal culture and governance. But the tools available for that scrutiny are blunt: anonymous review platforms with no methodology, media coverage driven by individual complaints, and formal regulatory processes that move slowly and narrowly.

The result is a landscape where serious organisational harm can go undocumented, but so can serious misrepresentation. Both failures are costly — to workers, to organisations, and to public trust.

What didn’t exist was a repeatable, evidence-governed framework that could assess organisational conditions consistently, defend its conclusions, and give equal analytical weight to positive and negative signals.

The constraints that shaped the build

Every finding had to be traceable to a specific, public, dated source
No natural person could be named as a subject of findings, with privacy controls informed by the Office of the Privacy Commissioner
No output could constitute — or imply — a legal determination
The engine had to be incapable of simply finding what an operator wanted it to find
Positive organisational performance had to be surfaced with the same confidence as risk signals
The system had to stop itself at defined points and wait for a human
Right of response had to be structurally embedded, not optional
The methodology had to be fully disclosed in every output

How ai analytical engine organisational research was structured

The design problem was not capability but constraint. The goal was a governed AI system that was difficult to misuse, supported by defined AI governance, source controls, human checkpoints and publication rules.

Benchmark library development

Eight domain documents were built from primary academic and regulatory sources covering workplace bullying, psychosocial risk, psychological safety, leadership, worker wellbeing, legal frameworks and cross-sector context. This controlled benchmark data model is the permitted source for benchmark claims, preventing the system from substituting unverified general knowledge.

The nine-pass analytical pipeline

Rather than a single prompt-to-output process, Observed runs every analysis through nine discrete passes: source collection and deduplication, privacy scan, legal risk review, signal classification, benchmark comparison, gap analysis, confidence rating, contradiction checking and a human review gate. The fixed sequence is part of the quality control.

Source type hierarchy and weighting

A tiered source weighting system applies consistently across every analysis. ERA decisions and WorkSafe enforcement notices sit at the top. Anonymous review platforms sit at the bottom. The same weighting logic applies regardless of whether a signal is positive or negative — the system cannot treat a favourable review platform signal as more credible than an unfavourable one.

Compliant language architecture

The operating manual contains a non-negotiable language framework. The engine compares signals against benchmarks — it never makes findings. Every statement is structurally reviewed against a set of prohibited phrasings and required reformulations before output. This is not stylistic guidance; it is the mechanism that keeps the platform on the right side of defamation law.

Forced human checkpoints

Three passes require a complete stop and explicit operator confirmation before the system proceeds. The AI cannot work around these gates — they are structural, not advisory. At Pass 9, the AI presents a full draft and a 16-item human review checklist. Nothing is approved for publication until the operator has worked through every item and typed one of three specific responses.

Equal-weight positive signal framework

A deliberate design decision addressed the default failure mode of accountability research: finding what you expect to find. The V1.1 update embedded explicit positive signal language, outcome evidence for strong performance, and active checks that flag when the engine has undersold well-evidenced positive findings. The system is designed to say clearly when an organisation is doing well.

What the system delivers

A research platform that can be published, defended, and trusted — by operators, by organisations being analysed, and by the public reading the outputs.

↯

Legally defensible outputs

Every output is structured to avoid findings of wrongdoing, named individual subjects, and implied legal conclusions. The methodology disclosure block is mandatory on every report.

◈

Consistent analytical standard

The nine-pass pipeline applies identically regardless of the organisation being analysed, its size, sector, or political context. The same benchmark. The same sequence. The same confidence rating definitions.

▣

Full output pack per analysis

Each approved matter produces a full analysis report, blog article, three NotebookLM source packs for audio and video, an SEO pack, a featured image brief, and a right of response letter.

⊕

Operator accountability built in

The system records the benchmark library version, the human review date, the reviewer identity, and the right of response outcome on every published report. There is no way to publish without this audit trail.

◎

Structural resistance to bias

The engine cannot receive private communications, suppressed proceedings, or the operator’s own professional opinion as source material. These are not policy preferences — they are hard exclusions written into the operating manual.

⌬

Self-stopping on risk

Seven defined conditions cause the engine to pause and refuse to continue — from unlawfully obtained material to active legal correspondence. The stop condition and the required action are stated clearly to the operator.

Design lessons

Capability was never the hard part.

Building an AI system that can produce high-quality analytical writing is not difficult in 2026. Building one that cannot be manipulated into producing something harmful, legally exposed, or analytically dishonest is a very different engineering problem.

The most important design decisions in Observed are not technical. They are structural. The forced stops. The language framework. The source exclusions. The equal-weight positive signal requirement. These are the mechanisms that make the output trustworthy — not the quality of the prose.

A system like this is only as reliable as the constraints written into it. And the hardest constraints to write are the ones that protect the platform from the person running it.

What made this work

Designing for accountability, not just capability
Treating the operating manual as both a functional specification and a legal instrument
Building explicit failure states — the system knows when to stop
Separating benchmark knowledge from the AI’s general training to prevent drift
Designing positive signal detection with the same rigour as risk signal detection
Embedding the methodology disclosure as a non-negotiable output component
Using version control on the benchmark library itself
Building right of response into the pipeline — not as an afterthought, but as a gate

Questions

Have a question about custom AI system design?

Common questions about building methodology-governed, accountability-grade AI systems on Claude.

What does it mean to build a custom analytical engine on Claude?

Rather than using Claude as a general-purpose assistant, Observed uses it as a tightly constrained specialist system. The operating manual — a detailed system prompt — defines the engine’s identity, its rules, its analytical sequence, its language framework, and its stop conditions. Claude executes the methodology; the methodology governs what Claude can and cannot do.

Why build this on Claude specifically?

The system requires long, structured analytical outputs with consistent language behaviour across a multi-pass pipeline. It also requires a model that can follow complex, layered instructions without drifting from them mid-analysis. Claude’s instruction-following capability and extended context handling make it well-suited to this kind of methodology-governed work.

How do you stop the AI from making things up?

Two mechanisms work together. First, all benchmark claims must be drawn from the domain library documents — not from the model’s general training data. Second, every source used in an analysis must be specific, named, dated, and publicly accessible. The human review checklist includes an explicit check for both of these. If a claim can’t be traced to a source document, it doesn’t get published.

What stops an operator from misusing the system?

The source exclusions are structural — private communications, suppressed proceedings, and the operator’s own opinion simply cannot be submitted as source material. The human review checklist requires the operator to actively confirm that no steps were skipped, no sources were double-counted, and that positive signals were given equal weight. The system also stops itself and names the reason when it detects conditions that suggest misuse.

Can this kind of system be built for other analytical or research applications?

Yes. The design pattern — a validated knowledge base, a structured analytical pipeline, enforced human checkpoints, and a non-negotiable language framework — applies to any context where you need AI to produce defensible, methodology-governed outputs. Risk assessment, compliance analysis, due diligence, regulatory research, and investigative journalism all have similar requirements.

How do you handle updates to the benchmark framework?

The benchmark library is version-controlled. Every published report records the library version used. When the framework is updated, prior analyses are not retroactively affected — they were conducted against the version in force at the time. This creates a clean audit trail and prevents retrospective reinterpretation of published findings.

How long does it take to build something like this?

The operating manual for Observed — which is the core of the system — took multiple iterations to get right. The benchmark library development, the language framework, the confidence rating definitions, and the human review checklist each required careful drafting and testing. This is not only a prompt engineering exercise. It is a system design exercise that uses prompt engineering as one component alongside methodology design, governance, testing and human review.

◈

Start with a free Decision Clarity Session.

A Decision Clarity Session is a no-obligation conversation where we listen to where you are, what you are trying to achieve, and what is getting in the way. If you are thinking about building a governed AI system — for accountability, research, compliance, or analysis — you will leave with a clearer view of what that actually takes.

Free Decision Clarity Session → Get in Contact →