Trustworthy AI for high-stakes work — starting with evidence

Trustworthy AI for high-stakes evidence review.

Evidence Synthesis AI screens studies the way a careful expert would — deciding confidently where the call is clear, flagging the genuine judgment calls for a human, and logging its reasoning for every decision. Built for the teams who can’t afford a confident mistake: systematic reviewers, and the drug-safety groups monitoring literature for adverse events.

~80

systematic reviews published every day

>1 yr

average review, from registration to publication

swing in a model's confident-error rate between reviews

01 · The problem

Screening is the bottleneck. It’s also the trap.

Title-and-abstract screening is the single largest time sink in evidence work. The obvious fix is to automate it — but a naive AI screener makes high-stakes review worse, not faster. An overconfident one corrupts every downstream step. An over-cautious one wipes out the time savings that justified using AI at all. The question was never whether AI can screen. It’s whether you can trust how it decides — and prove it afterwards.

02 · How it works

Decisive where it’s sure. Careful where it isn’t.

1

Handles the confident calls on its own

Clear includes and excludes decided automatically where the system is well-calibrated.

2

Surfaces only the genuine judgment calls

Ambiguous studies are routed to a human, so reviewer attention goes where it actually matters.

3

Shows the reasoning behind every decision

Each include, exclude, or deferral comes with the rationale the system used to get there.

4

Override anything, audit everything

Reviewers can change any decision, and the full log is the trail regulators expect.

See the full product

03 · Why you can trust it

Built on deference-aware evaluation.

Most AI metrics reward confident answers and penalise hesitation — the wrong incentive when overconfidence carries real cost. Deference-aware evaluation measures whether a system recognises the limits of its own competence and steps back when it should. It credits considered deferral as correct, separates it from genuine confident error, and surfaces a class of failures that more data and bigger models won’t fix.

Validation

Validated across 6 frontier models and 5 medical domains — 2,729 studies, 16,374 screening decisions.

6

frontier models

5

medical domains

2,729

studies

16,374

screening decisions

04 · Who it’s for

The teams who can’t afford a confident mistake.

Systematic review teams

Title-and-abstract screening without the year-long grind, with an audit trail that holds up to peer scrutiny.

Pharmacovigilance

Continuous literature monitoring for adverse events, with the documented decision trail regulators expect.

Research consultancies

Evidence work at speed, with rigour you can defend to a client or regulator.

05 · Research

White Paper · 2026 · Hopperlace Research · DOI: 10.17605/OSF.IO/A69YH

Poster · Workshop on Technical AI Governance Research (TAIGR), ICML 2026

Deference-Aware Evaluation for Human-in-the-Loop AI Systems

A framework for evaluating AI systems on their capacity to recognise the limits of their own competence and defer when appropriate, alongside standard accuracy. The paper identifies two failure modes that conventional metrics conflate — penalised conservatism and genuine confident errors — and introduces deference-aware metrics that distinguish them. A cross-domain audit of six frontier models across five medical domains (2,729 studies, 16,374 screening decisions) shows that no single model is uniformly safe, and isolates a structural class of failures that calibration, ensembling, and model scaling cannot fix.

Read on OSF

“A model’s confident-error rate can swing more than eightfold from one review to the next — which is why screening needs an evaluation layer that knows when its own judgments can be trusted.”

06 · Team

Who we are

MW

Martin Walker, MPH

Co-founder, Evidence Synthesis

Background in evidence-based health and systematic review evidence synthesis; brings the domain experience that keeps the system honest about clinical reality.

YS

Yuyu Shen

Founder

A decade building production AI across Meta, Walmart, Beamery, and Cleo; founded Hopperlace to close a gap that kept reappearing — AI deployed in high-stakes work without the means to know when its outputs can be trusted.

07 · The bigger picture

The bigger picture

Behind the product is one conviction: you should be able to see who and what stands behind an AI system before you rely on it — how it behaves, and who makes and backs it.

Independent · public-interest

That principle is why we build evidence tools that prove their own trustworthiness. It’s also why we’re building Value Compass — an independent project that brings together what’s known about who makes and funds AI tools, and the values they operate by, so people can weigh that alongside whether a tool does the job.

Contact

Get in touch

Running a systematic review or pharmacovigilance team? We’re onboarding early pilots.

hello@hopperlace.ai