William Jones·April 5, 2026·11 min read

Stanford Simulated 1,000 Real People with AI. They Were 85% as Accurate as the Real Thing.

researchgenerative agentsStanfordsynthetic usersvalidation

In 2023, Stanford researchers built a virtual town with 25 AI agents who formed relationships, threw parties, and ran for office. It was a fascinating proof of concept. But it was still fiction. Made-up characters doing made-up things.

In 2025, the same lead researcher — Joon Sung Park — submitted his doctoral dissertation. It didn't just refine the original concept. It answered the question that matters: can you simulate real people accurately enough to trust the results?

He simulated 1,052 real individuals. Measured the agents against the real people's own survey responses. And found they were 85% as accurate as the humans themselves.

That number should stop you in your tracks. But the how is even more interesting than the what.

What they actually did

The research team recruited 1,052 participants representative of the U.S. population across age, gender, race, region, education, and political ideology. Each participant sat through a two-hour qualitative interview covering everything from life stories to views on social issues, following the protocol from the American Voices Project.

The interviews were conducted by an AI interviewer agent — itself a novel contribution. The AI interviewer had a reflection module that summarized and inferred insights from the ongoing conversation, allowing it to generate better follow-up questions in real time. Think of it as an interviewer that gets smarter as the conversation progresses.

Then they built a generative agent for each person. But not by simply dumping the transcript into a prompt. The architecture was more sophisticated than that.

Expert reflection: the technique that made it work

Here's the part most coverage of this research misses.

Park didn't just inject raw interview transcripts into an LLM. He introduced a technique called "expert reflection" — a structured analysis layer between the raw data and the simulation.

For each participant's transcript, the system generates analysis from four domain expert personas:

Psychologist — personality traits, autonomy needs, emotional patterns, interpersonal dynamics. For one participant, the psychologist noted: "He values his independence and expresses a clear preference for autonomy, particularly highlighted by his enjoyment of traveling for his job and his frustration with his mother's overprotectiveness."

Behavioral Economist — financial goals, risk preferences, decision-making patterns. The same participant: "His aspiration to save for a relaxing vacation and possibly advance to a managerial position indicates a blending of practical financial goals with personal leisure aspirations."

Political Scientist — ideological positions, policy views, political identity. "He identifies as a Republican and espouses strong support for the party's views, particularly around immigration and drug policy. However, he also expresses specific support for traditionally Democratic positions on issues like abortion rights and the legalization of marijuana."

Demographer — occupation, income, education, household structure. "He works as an inventory specialist and earns between $3,000 to $5,000 monthly, contributing to a household income of around $7,000 per month."

Each expert generates up to 20 observations per transcript. When the agent needs to answer a question, the system first classifies which domain expert is most relevant, retrieves that expert's reflections, and appends them to the full transcript before prompting the LLM.

This isn't just "read the transcript and respond." It's "analyze the transcript through multiple professional lenses, extract latent insights the person never explicitly stated, then use those insights to predict behavior."

The prompting strategy itself uses a four-step chain of thought: describe what kind of person would choose each response option, reason why this specific participant might choose each option, synthesize the reasoning, then make the prediction. Structured deliberation, not gut-feel generation.

The benchmarks were rigorous

The team tested their agents against four established social science instruments:

The General Social Survey — the gold standard for measuring Americans' attitudes, beliefs, and behaviors. The agents predicted participants' responses with 85% normalized accuracy. That means the AI replicated what the real person would say 85% as well as the person replicated their own answers when retaking the survey two weeks later.

The Big Five Inventory — the 44-item personality assessment measuring openness, conscientiousness, extraversion, agreeableness, and neuroticism. The agents achieved 80% normalized correlation with real participants' OCEAN scores.

Five behavioral economic games — the dictator game, trust games, public goods game, and prisoner's dilemma. Here the agents scored 66% normalized correlation, performing similarly to simpler approaches. Economic behavior is harder to simulate than attitudes.

Five social science experiments — including studies on how perceived intent affects blame and how fairness influences emotions. The agents agreed with real participants on the replication results of all five studies.

The critical comparison: interview-based agents scored 14 to 15 percentage points higher than demographic-based or persona-based agents using the same LLMs. Demographics alone aren't enough. You need the qualitative depth.

Why interview-based agents actually work

Park didn't stop at reporting accuracy numbers. He ran a series of robustness checks to identify the specific mechanisms behind the performance gap. This is the most valuable part of the dissertation for anyone building in this space.

Mechanism 1: Direct retrieval. The LLM can find verbatim answers to survey questions buried in the interview transcript, even when the questions differ in wording. Someone asked about their health might mention "I'm disabled. My health is so bad that I can't even work" — which directly answers the GSS question "Are you employed?" without ever being asked about employment.

Mechanism 2: Inference. Even when no direct answer exists, the LLM reasons from other information in the transcript. A participant who says "I'm enrolled in school" is probably not employed full-time, and therefore unlikely to have a work supervisor. The model makes the connection.

Here's the finding that matters most: even after removing all questions answerable by either mechanism, interview-based agents still outperformed demographic and persona-based agents. After stripping out the 30 questions most likely to have direct answers and another 30 most likely to be answerable by inference, the interview agents still beat demographic agents by 8 percentage points.

Something else is going on beyond retrieval and inference. Park calls it an open question for future research. But the implication is clear: qualitative interviews give the model a holistic understanding of the person that transcends any individual fact or inference chain.

Fragments work too

One finding from the robustness analysis deserves its own section because it directly validates a key Synthicant design decision.

Park tested what happens when you systematically remove portions of the interview transcript. Even with 80% of the interview utterances removed, agents still achieved 0.79 accuracy on the GSS — compared to 0.71 for demographic-only agents.

Read that again. You can throw away four-fifths of a two-hour interview and still get better results than using demographics alone.

This matters because in product research, you almost never have a two-hour qualitative interview with each customer. You have a five-minute support call. A few survey responses. A couple of paragraphs from a feedback form. The Stanford data says that's enough. Fragments of real qualitative data consistently outperform comprehensive demographic profiles.

How Synthicant implements this

Synthicant's dynamic persona pipeline is built on the same principle as Park's architecture, with adaptations for product research contexts.

Expert reflection maps to persona analysis. When you upload documents to a dynamic persona, Synthicant's analyzer extracts personality traits (OCEAN scores with confidence weights), speaking patterns, beliefs, biases, and direct quotes. It's the same concept as Park's four-expert approach — multiple analytical lenses applied to the same source material to extract insights the person never explicitly stated.

Synthicant aggregates across multiple sources. Stanford used a single two-hour interview per person. In product research, you rarely have one perfect transcript. You have five customer interviews, a dozen support tickets, and a few survey responses. Synthicant's pipeline analyzes each document independently, then uses confidence-weighted aggregation to build a composite persona. More data makes the persona sharper.

Synthicant redacts PII before any data touches the LLM. The Stanford team acknowledged privacy as a major risk — they chose not to release their agents publicly for exactly this reason. They built an elaborate access-control framework with API-gated queries, aggregated responses for public access, and individual responses only through a review process. Synthicant takes a different approach: Microsoft Presidio strips names, emails, phone numbers, and addresses at the point of upload. Sensitive data never enters the system in the first place.

Synthicant is built for product decisions, not social science. Stanford validated against the General Social Survey and economic games. Those are useful benchmarks. But product teams need personas that can answer "Would you pay $20/month for this feature?" and "What would make you switch from our competitor?" The same principle applies: richer input data produces more reliable answers.

The bias finding matters too

One result from the study deserves more attention. Interview-based agents consistently reduced predictive bias across political ideology, race, and gender compared to demographic-based agents.

This makes intuitive sense. When you simulate someone using only their demographic profile, the LLM falls back on stereotypes associated with that demographic. It produces an average response for that group, not an individual response for that person. When you give the model actual interview data, it has real evidence to work with instead of learned generalizations.

Park puts it directly in the dissertation: conditioning on demographics alone risks stereotyping. A description like "30-year-old Asian graduate student" should not, by itself, license generic or biased inferences about daily life. Rich, self-reported context should define the person — not a demographic category.

For product research, this means demographic-only personas will systematically misrepresent the people you most need to understand: the outliers, the edge cases, the customers whose behavior doesn't match the category average. Interview-grounded personas fix that.

Social simulacra: prototyping communities with AI

The dissertation includes a third project worth noting. Social Simulacra is a system that generates entire simulated communities — not just individual personas, but populations of agents interacting within a hypothetical social space.

Park built SimReddit, which populates a new subreddit with AI-generated users, posts, and conversations based on nothing more than a community goal and set of rules. In evaluation, participants could only distinguish real subreddit conversations from SimReddit's 59% of the time — barely above chance.

The practical application: designers used SimReddit to anticipate failure modes — trolling, off-topic posts, rule violations — and iterate on community rules before any real users showed up. It's prototyping for social dynamics.

What the study got right about risks

Park identifies three risks that directly apply to anyone building in this space, and his dissertation is admirably honest about them:

Overreliance. Generative agents are not oracles. They're approximations. An 85% accuracy rate means 15% of the time, the simulation gives you the wrong answer. Park's advice: "Start with the decision you want to influence and choose cases where feedback will arrive. Simulations are at their best when iteration compounds learning, not when a one-off forecast is expected to be definitive."

Privacy. Interview data is sensitive. The Stanford team built a two-tier access system: open access to aggregated responses on fixed tasks, restricted access to individual responses requiring review and IRB approval. At Synthicant, PII is redacted at the point of upload, so sensitive data never enters the system.

Anthropomorphic drift. Park warns against "treating token predictors as people." The agents are remarkably accurate, but they're still statistical models, not conscious beings. Regular reality checks — ablations, counterfactual probes, comparison to real user data — keep interpretations grounded.

The bottom line

Park's dissertation traces an arc he calls "from capability to credibility." The 2023 Smallville study proved capability — AI agents can sustain believable behavior. The 1,000-person study establishes credibility — AI agents grounded in qualitative data can reliably predict real people's attitudes and decisions.

The difference between a useful synthetic persona and an expensive random number generator is the quality of the data you feed it. Not demographics. Not personality labels. Actual qualitative data from real people — their stories, opinions, frustrations, and aspirations.

Even fragments of that data outperform comprehensive demographic profiles. Even shortened interviews beat the baseline. The richer the qualitative input, the more faithful the simulation.

Upload your transcripts. Ground your personas in reality. Let the LLM do what it's actually good at: synthesizing complex qualitative data into coherent, queryable human models.

The research says it works. We built the tool that makes it practical.

References

Park, J.S. (2025). "Generative Agent Simulations of Human Behavior." Doctoral Dissertation, Stanford University. — The full dissertation unifying the Smallville study, the 1,000-person simulation, and Social Simulacra into a single framework with novel contributions on expert reflection, robustness analysis, and trustworthy simulation design.

Park, J.S., Zou, C.Q., Shaw, A., Hill, B.M., Cai, C.J., Morris, M.R., Willer, R., Liang, P., & Bernstein, M.S. (2024). "Generative Agent Simulations of 1,000 People." arXiv. — The paper form of the 1,000-person study. Demonstrates that LLMs paired with two-hour interview transcripts replicate real individuals' survey responses at 85% of human test-retest reliability.

Park, J.S., O'Brien, J.C., Cai, C.J., Morris, M.R., Liang, P., & Bernstein, M.S. (2023). "Generative Agents: Interactive Simulacra of Human Behavior." UIST 2023. — The predecessor study that demonstrated 25 AI agents sustaining believable behavior in a simulated environment over multiple days.

Costa, P.T. & McCrae, R.R. (1992). "Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Professional Manual." Psychological Assessment Resources. — The foundational Big Five personality inventory that Synthicant's OCEAN model is built on, and one of the benchmarks used in the Stanford study.

Jiang, H., Zhang, X., Cao, X., Kiciman, E., Manber, D., & Gonzalez, J.E. (2024). "PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits." NAACL 2024. — Demonstrated that assigned Big Five personas hold in LLMs with large effect sizes, supporting the theoretical basis for personality-grounded simulation.

Serapio-Garcia, G., Safdari, M., Crepy, C., Sun, L., Fitz, S., Romero, P., Abdulhai, M., Faust, A., & Matari'c, M. (2023). "Personality Traits in Large Language Models." arXiv. — First rigorous measurement of Big Five personality traits in LLMs, establishing that AI models have stable, measurable personality profiles.