William Jones·April 14, 2026·14 min read

100 Synthicant Personas vs 100 Real Humans: What Matched, What Didn't

parity studyvaliditybehavioral economicstwin-2k-500methodology

Most synthetic-persona companies will tell you their personas are "realistic." Almost none of them will run an open parity test against published human data and tell you what didn't work.

We will. This is the first in what's now a standing series of public parity studies — comparing Synthicant personas against real, published human survey responses, with full attribution and the code shared openly.

For this one, Columbia Business School and Prolific gave us an unusually clean benchmark: the Twin-2K-500 dataset. 2,058 US adults answered 256 survey questions across personality, cognition, economic preferences, and behavioral economics. Every participant's individual responses are published, alongside pre-computed digital-twin simulations from GPT-4.1-mini and Gemini-2.5 Flash. Released under CC BY 4.0 specifically so anyone can re-run this kind of comparison.

So we did. 100 personas, 5 classic heuristics-and-biases questions, 20 numeric responses each. 2,000 cells of human ground truth to compare against.

This post explains what we found in plain English, publishes the code so anyone can replicate it, and gives full attribution to the Twin-2K-500 team whose dataset made it possible.

Why we publish this when our competitors don't

Every synthetic-persona company on the market claims their personas are "realistic." Almost none of them publish a head-to-head comparison against published human data. None of them publish the code. None of them tell you what their tool can't do.

We do all three. This is the first public parity study in what will be a standing series. Every study will be against a different published dataset, with full attribution, full code, and full results — including the failures. Our position is that if you're going to make product decisions based on synthetic data, you deserve to know exactly where it works, exactly where it fails, and exactly what we did to find that out — in code you can re-run yourself.

That's the deal we're offering. Authenticity over hype. Validity studies over slogans. We'd rather lose a sale to a skeptic who reads our limitations than win one because we hid them.

What Synthicant actually is, in one paragraph

Synthicant is a product-research platform that lets teams build synthetic user personas in under a minute — either by setting Big-5 personality sliders directly or by uploading real customer call transcripts, NPS open-ends, surveys, and support tickets so the persona inherits a real customer's voice. You then interview those personas in real-time chat sessions the way you'd interview a recruited user. Except 50 of them, in parallel, in ten minutes, for the cost of a few API calls. The rest of this post is about how well those personas actually match the humans they're trying to represent.

The setup, without the jargon

Behavioral economists have a small canon of "trick" questions designed to expose how humans reason under uncertainty. We picked five from the Twin-2K-500 catalog that all 2,058 participants had answered:

Two product-rating questions — rate bicycles, alcohol, chemical plants, and pesticides on benefit (1 to 7) and risk (1 to 7). This tests the affect heuristic: do people who think something is beneficial also think it's safe?
The flu vaccine question — would you take a vaccine with a 5% chance of killing you to avoid a flu with a 10% chance of killing you? This tests omission bias: people often prefer inaction even when action is statistically safer.
The marble tray question — would you rather draw from a small tray with 1 black marble in 10, or a large tray with 8 black marbles in 100? Same odds either way (10% vs 8%), but most people pick the large tray because it has more black marbles. This tests denominator neglect.
The policy estimation slider — guess what percentage of Americans support 10 different policies (carbon tax, Medicare for All, etc.). This tests false consensus: people anchor their estimates to their own views.

We sampled 100 of the 2,058 real participants at random, built a Synthicant persona for each one using their published profile (demographics, Big-5 scores, cognitive scores — but no leakage of the answers we were about to test), and asked the personas the same five questions. Then we compared each persona's answers to the real human's actual answers, one cell at a time.

(This is the same dynamic-persona pipeline our customers use to build personas from their own customer call transcripts, NPS open-ends, or support tickets. We just pointed it at a public dataset for this study.)

That's 2,000 head-to-head comparisons.

How often did Synthicant match the real human?

This is the question most product teams actually want answered. Here it is, broken down by question type:

| Question type | Synthicant match rate | What "match" means here | |---|---|---| | Rating something on a 1–7 scale | 70% | Within 1 point of the real person's rating | | Picking between 2 options (the marble question) | 62% | Same answer (50% would be a coin flip) | | Picking between 4 options (the vaccine question) | 26% | Same answer (25% would be random) | | Estimating a percentage from 0 to 100 | 31% | Within 10 points of the real person's estimate | | Overall, all 20 cells averaged | 48% | Mixed |

The honest summary: somewhere between "as good as a coin flip" and "70% accurate," depending on what you're asking.

For context, the Twin-2K-500 paper reports that GPT-4.1-mini got 71.7% on a different (and easier) task subset. Gemini-2.5-Flash got 69.4%. The human test-retest ceiling — real humans answering the same question twice — was 81.7%. Random guessing scores 59%. So even humans only agree with their past selves 82% of the time on questions like these.

Synthicant matches the frontier-model accuracy on the comparable subset

This is the technical claim worth pausing on. On the rating-scale tasks where direct comparison is fair, Synthicant scored 70.2%. The published baselines from the Twin-2K-500 paper:

GPT-4.1-mini (OpenAI): 71.7%
Gemini-2.5 Flash (Google): 69.4%
Synthicant (Claude Opus 4.6, our pipeline): 70.2%
Human test-retest ceiling: 81.7%
Random guessing: 59.2%

Synthicant lands within a percentage point of the OpenAI model and ahead of the Google one, on a benchmark designed by an independent research lab specifically for digital-twin validity. A vertical product built by a small team is matching the validity of frontier general-purpose models from companies a thousand times its size — and it does so while giving you a purpose-built persona library, OCEAN-driven personality controls, and a research workflow those generic models don't have.

That's the technical floor we're standing on. The rest of this post is about the ceiling we're still climbing toward.

The finding that actually matters

Match rates are interesting but they bury the more useful insight. Here's the one to remember.

Across all 20 questions, Synthicant captured only 33% of the variation in real human responses.

Translated: when 100 real humans answer "what percentage of Americans support a carbon tax?" their guesses span the full 0-100 range. Some say 15%, some say 80%, with a typical spread of about 23 percentage points. When 100 Synthicant personas — built from those same humans' demographics and personalities — answer the same question, their guesses cluster tightly around the same number. The typical spread is about 3 percentage points.

The personas know they're a Republican from Texas or a college-educated Democrat from Vermont. They use that to pick the right direction. But within each demographic group, the diversity of opinion that exists in real humans is mostly missing from the synthetic responses.

It's not a Synthicant-specific problem either. The published numbers from GPT-4.1-mini and Gemini-2.5 Flash on this same dataset show the same pattern. Variance collapse appears to be a fundamental limitation of LLM-as-twin methodology in 2026, not a bug in any one product. But it is a real, measurable, replicable limitation — and product teams using synthetic personas need to know about it before they trust an individual rating.

Where Synthicant looked good

A few results were genuinely strong.

The marble question. 62% match against a 50% chance baseline. More importantly, Synthicant correctly used each persona's CRT score (a measure of cognitive reflection that the dataset includes) to predict who would fall for the bias and who would see through it. Personas with high CRT picked the small tray more often, mirroring decades of cognitive psychology research.

Direction of the affect heuristic. All four products showed the predicted negative correlation between perceived benefit and perceived risk in both humans and Synthicant. The personas got the underlying psychological pattern right.

Group means. When you average all 100 Synthicant responses against all 100 real human responses, the means are close on every question. If your research need is "what does this group think on average?", Synthicant's answer matches the human answer.

Where Synthicant looked weak

The vaccine question. 26% match against 25% random chance. Real humans split roughly 8/17/41/34 across the four options (definitely not, probably not, probably yes, definitely yes). Synthicant collapsed onto "probably yes" almost universally. The decision turns on individual risk tolerance and loss aversion, which the persona summary describes only indirectly.

Affect heuristic magnitude. All four products went the right direction, but for emotionally loaded items (chemical plants, pesticides), Synthicant showed roughly twice as much negative correlation as real humans. The personas behave like textbook examples of the bias rather than like actual messy humans who are a noisy mixture of bias and idiosyncrasy.

Slider precision. When asked to estimate public support for a policy, real humans use the full range. Synthicant clusters tightly around the perceived national average. Variance recovery on slider questions ranged from 9% to 25% — meaning Synthicant is missing 75-90% of the real human variation.

What this means for using Synthicant in real product research

Two parity studies, very different domains, converging on the same picture. Here's the practical translation.

Use Synthicant for "what does this group prefer / notice / want?" questions. Group means and modal preferences are reliable. If five out of ten Synthicant personas spontaneously complain about the same thing, that complaint is real and worth fixing.

Use Synthicant for differentiated qualitative output. OCEAN scores and demographics steer prose responses faithfully. A young entrepreneur from Brooklyn sounds like a young entrepreneur from Brooklyn; a retired schoolteacher from Phoenix sounds like a retired schoolteacher from Phoenix. This is where the personality system is doing real work.

Don't use Synthicant for individual numerical ratings. SUS scores, NPS scores, satisfaction Likerts, and probability estimates from a single persona are not reliable. The average across many personas is fine; any one persona's number is not.

For multiple-choice behavioral questions, accuracy depends on whether the relevant trait is in the persona spec. Cognitive reflection score predicts denominator-neglect resistance — Synthicant gets that right. Individual risk tolerance for medical decisions is harder to capture — Synthicant gets that wrong.

The variance-collapse problem is solvable but not solved yet. The most likely fix is a two-step prompt that asks the persona to describe their general rating tendency in words first ("I almost never give 5s; that's reserved for things I'd actively champion"), then fill in the scale. Single-step ratings collapse; staged ratings should preserve more of the real distribution. We're testing this next.

How to replicate this study yourself

The Twin-2K-500 dataset is published under CC BY 4.0, which means anyone can re-run this comparison with proper attribution. Total cost is about $15 in Anthropic API spend and 15 minutes of runtime on a laptop.

We package the full replication as a self-contained 50 KB bundle. Email hello@synthicant.com with the subject "Replication bundle request" and we'll send you the zip. Inside you'll find a runner script, a snapshot of Synthicant's production prompt-construction module (the only Synthicant code in the bundle — everything else stays private), the report, and a requirements.txt. Then:

# 1. Unzip and install dependencies
unzip synthicant-parity-bundle.zip
cd replication-bundle
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# 2. Set your Anthropic API key
export ANTHROPIC_API_KEY=sk-ant-xxxx

# 3. Run the parity. Auto-downloads ~50 MB of dataset slices from Hugging Face
#    on first run, samples 100 participants with seed=42, runs all 100 sessions
#    concurrently, joins responses to human ground truth.
#    Runtime ~15 min on a laptop, ~$15 in Anthropic API spend.
python run_parity.py

After the run, four result files land in results/:

raw_sessions.json — every LLM response in full
per_participant.csv — 100 rows × ~80 columns with each persona's responses joined to the matching human's ground truth
variance_comparison.csv — per-item standard deviation comparison (the variance-collapse table)
summary.json — headline accuracy and correlation stats

Sources of variation if your numbers differ from ours: the Anthropic Claude model version (we ran on Claude Opus 4.6 in April 2026), any tweaks to the persona-builder prompt, and changes Twin-2K-500 makes to their dataset. Beyond that, the methodology is fully deterministic — same seed, same questions, same persona summaries — so replications should land within a few percentage points of our numbers.

If you replicate the study and get materially different numbers, we want to know. The whole point of publishing parity studies is to make the field's claims falsifiable. Honest disagreement is more useful than vague consensus.

What this study does not prove

A few things worth being explicit about.

We tested five questions out of 256. The other 251 would surface different patterns. We picked H&B specifically because the answers are deliberately distributed (split decisions, varied opinions) — these are harder to predict than a random sample of survey items. The published Twin-2K-500 baselines are computed on a different 17-task subset, so the numbers aren't directly comparable.

We ran one model output per persona. The Twin-2K-500 paper shows that adding chain-of-thought reasoning, repeating questions, or running multiple temperatures changes the LLM accuracy meaningfully. We didn't explore those variations. A more thorough parity would.

The persona summary contains psychometric scores (CRT, fluid intelligence, Big-5) that are themselves correlated with the H&B responses. Some of Synthicant's accuracy is coming from these correlations rather than from genuine generative persona reasoning. This is true of every digital-twin study — there's no clean way to separate "the persona predicted X because of demographics" from "the persona predicted X because of personality." We note it as a caveat, not a critique.

Why this is now part of how Synthicant works

This is not a one-off marketing exercise. Public parity studies are now a standing commitment of the company. Every quarter, against a new published dataset, with full attribution, full code, full results — including the failures.

We do this for two reasons. First, because the synthetic-persona industry has a credibility problem and someone has to start fixing it from the inside. Second, because every study surfaces a concrete improvement we can ship. The variance-collapse finding here is already driving a roadmap change: a two-step rating prompt that asks the persona to articulate their general rating tendency in words before producing a number. We expect that to recover meaningful variance on the next round, and we'll publish the next parity study with the new prompt in place to prove it.

If you know of a published dataset with per-participant responses we should run against next, send it over. We will run it, publish the results, and credit you in the post — even (especially) if Synthicant doesn't come out looking great.

Three things we won't do

We won't claim individual-level rating accuracy. The variance-collapse data above is exactly why. Trust Synthicant for group-level signal; don't trust any one persona's number.
We won't hide pricing behind "contact sales." Plans are listed on the pricing page. $49/month Starter, $149/month Pro, 14-day free trial, cancel anytime. No demo gating, no quote calls.
We won't trap your personas in a proprietary format. Every persona can be exported to JSON or CSV from its detail page. Your data is yours.

Each of those is a deliberate choice. Each of them is a thing the synthetic-persona industry routinely does the other way.

Attribution

This study was built on the Twin-2K-500 dataset, published by the Columbia Business School Digital Twin Lab and collaborators. Full attribution:

Toubia, O., Gui, G.M., Peng, T., et al. (2025). Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions. arXiv:2505.17479. Dataset released under Creative Commons Attribution 4.0 International (CC BY 4.0). Hugging Face: LLM-Digital-Twin/Twin-2K-500.

We are grateful to the Twin-2K-500 team for releasing this dataset openly and specifically for the purpose of enabling external replication and comparison. The pre-computed GPT-4.1-mini and Gemini-2.5-Flash digital-twin simulations published alongside the human data made this comparison possible without us needing to run those models ourselves.

References

Toubia, O., Gui, G.M., Peng, T., et al. (2025). Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions. arXiv:2505.17479. — The source dataset and benchmark.

Slovic, P., Finucane, M.L., Peters, E., MacGregor, D.G. (2007). "The affect heuristic." European Journal of Operational Research, 177(3), 1333-1352. — Original framing of the negative benefit-risk correlation that QID288/289 tests.

Reyna, V.F. & Brainerd, C.J. (2008). "Numeracy, ratio bias, and denominator neglect." Learning and Individual Differences, 18(1), 89-107. — Background for the marble tray ratio-bias paradigm in QID196.

Ritov, I. & Baron, J. (1990). "Reluctance to vaccinate: Omission bias and ambiguity." Journal of Behavioral Decision Making, 3(4), 263-277. — Original framing of the vaccine-question paradigm in QID291.

Serapio-García, G., Safdari, M., Crepy, C., et al. (2023). Personality Traits in Large Language Models. arXiv:2307.00184. — First rigorous measurement of Big-5 traits in LLMs; relevant to the variance-collapse finding.

Sorokovikova, A., Yampolsky, S.V., et al. (2024). "LLMs simulate Big Five personality traits." EACL 2024. — Replicated stable but model-specific personality profiles in LLMs across multiple architectures.