Validity & Parity Studies

We test ourselves in public.

Most synthetic-persona companies will tell you their personas are realistic. We’d rather show you our numbers — including the ones that don’t flatter us. Every parity study we’ve run is on this page, with its full methodology and replication code.

1
Studies shipped
100
Personas tested
2,000
Ground-truth cells
100%
Published openly

Completed studies

100 Synthicant Personas vs 100 Real Humans

·Published
License
CC BY 4.0
Participants
100 sampled from 2,058
Questions
5 classic heuristics-and-biases questions, 20 numeric responses each
Headline numbers
Likert task accuracy (within 1 of 7)70.2%
GPT-4.1-mini on same dataset (published)71.7%
Gemini-2.5 Flash on same dataset (published)69.4%
Variance recovery (Synthicant SD ÷ Human SD)32.9%
Marble MC (denominator neglect)62% vs 50% chance
Vaccine MC (omission/framing)26% vs 25% chance
What it tells us

Synthicant matches frontier-model accuracy on rating tasks but compresses individual variance to roughly one-third of the real human population's. This is a fundamental LLM-as-twin limitation, not a Synthicant-specific bug — GPT and Gemini show the same pattern.

Coming next

Longitudinal test-retest comparison

Twin-2K-500 includes a wave-4 test-retest where participants re-answered earlier questions, with a published human ceiling of 81.7%. A follow-up parity using that wave-4 data — or a similar published longitudinal dataset — would let us measure Synthicant's between-session consistency directly against the human test-retest ceiling. If you know of a candidate dataset, send it our way.

Two-step rating prompt validation

The variance-collapse finding from the Twin-2K-500 study is driving a roadmap change — a two-step rating prompt that asks the persona to articulate their general rating tendency in words before producing a number. The next parity will be run with the new prompt to test whether it recovers the missing variance.

How we run a parity study

Public datasets only

Every parity study runs against a published, peer-reviewed or peer-released human dataset. No private benchmarks. No customer data.

Per-participant ground truth

We compare Synthicant's responses to specific human responses, not just to group means. Means alone hide variance collapse and other failure modes.

Full attribution

Every dataset is credited to its authors with full citations and license terms. We do not redistribute restricted materials.

Replication code

Every study ships with the runner script. Same seed, same questions, same persona-builder pipeline. Anyone can re-run it.

Failures included

If Synthicant performs poorly on a benchmark, that becomes a section of the report and a roadmap item. We do not pick favorable studies.

Replicate, suggest, or try Synthicant

Want to re-run any study above? We’ll send the replication bundle. Know a dataset we should test against next? Tell us — we will credit you in the resulting post, even if Synthicant doesn’t come out looking great.