
Different LLMs Have Different Personalities — And That's a Problem
Here's a fact that most AI persona platforms don't want you to think about: the model you use has its own personality. And it's different from every other model.
Sorokovikova et al. proved this in 2024. They administered standardized Big Five personality inventories to multiple large language models and found that each model produces a stable, distinct personality profile. GPT-4 scores differently from Claude. Claude scores differently from Llama. These aren't random fluctuations. They're consistent, replicable personality signatures baked into each model.
This means that if you switch the underlying model, your persona's personality changes — even if you keep the prompt identical.
The model personality problem
Serapio-Garcia and colleagues first documented this in 2023. They gave LLMs the same validated personality assessments used in clinical psychology — instruments like the Big Five Inventory that have decades of normative data behind them. The models didn't score randomly. They produced coherent, interpretable personality profiles.
Sorokovikova's 2024 replication confirmed two critical properties. First, these profiles are stable: the same model, tested at different times, produces consistent scores. Second, they're model-specific: different models produce systematically different profiles.
In practical terms, every LLM starts with a personality baseline before you add a single word to the system prompt. GPT-4 tends toward higher agreeableness. Some open-source models score lower on conscientiousness. Claude has its own distinct profile. These baselines act as a gravitational pull on every persona you build on top of them.
When your persona prompt says "you are a disagreeable, confrontational skeptic" and the underlying model has a strong agreeableness baseline, those two forces compete. The result is a persona that's less disagreeable than you intended — the model's innate personality dilutes the prompt.
Why "model diversity" makes it worse
Some platforms treat model switching as a feature. "We use multiple models for diverse perspectives." This sounds reasonable until you understand what it actually means.
If you interview the same persona using GPT-4 on Monday and Claude on Tuesday, you haven't gained diversity of perspective. You've introduced uncontrolled personality variation. The persona on Tuesday has a different agreeableness baseline, different conscientiousness tendencies, and different openness patterns than the persona on Monday — not because you changed any parameters, but because the underlying model changed.
This is the equivalent of swapping out your interview participant mid-study without telling the research team. The transcript looks continuous, but the personality generating the responses shifted underneath.
Worse, because model personality differences are consistent but not intuitive, you can't predict which direction the variation will go. You might get more pushback on your pricing page from the Tuesday persona — not because the persona is different, but because the model is. That's noise masquerading as signal.
Research demands controlled variation. You want to vary personality deliberately and measure the effect. Uncontrolled variation from model switching makes that impossible.
The Synthicant approach: personality from parameters, not from the model
Synthicant uses a single model — Claude — with explicit OCEAN parameterization in the system prompt. Here's why this matters.
When you set the Agreeableness slider to 2 out of 5, the system prompt builder translates that into specific behavioral instructions: the persona will push back on claims, express dissatisfaction directly, and avoid cooperative hedging. This explicit parameterization overrides the model's baseline personality for that specific dimension.
The personality comes from the structured system prompt, not from the model's innate tendencies. This means:
Reproducibility. Interview the same persona on Monday and Friday and get consistent personality-driven behavior. The model hasn't changed, the parameters haven't changed, so the behavioral patterns don't change.
Calibrated variation. When you move the Neuroticism slider from 1 to 5, you know the change in behavior comes from that parameter shift — not from a model swap, a temperature change, or a random seed difference. You're measuring one variable at a time.
Predictable extremes. A persona with Agreeableness set to 1 will be reliably confrontational because the system prompt explicitly instructs that behavior. The model's agreeableness baseline is overridden, not relied upon.
This is the same principle that makes controlled experiments work in any field. Hold everything constant except the variable you're studying. In persona research, the model is a confounding variable. Fix it.
But doesn't a single model limit diversity?
This is the obvious objection, and it has a straightforward answer: no, because personality diversity comes from the OCEAN parameters, not from the model.
The Big Five framework gives you five independent dimensions, each on a continuous scale. That's a five-dimensional personality space with effectively infinite granularity. Two personas with different OCEAN profiles will behave differently — reliably and predictably — even though they run on the same model.
Using multiple models to get "diversity" is like using different thermometers to get different temperatures. You don't want instrument variation. You want measurement precision.
The 2024 PersonaLLM study by Jiang et al. demonstrated that a single model can express the full range of Big Five personality profiles with large effect sizes across all five dimensions. Human evaluators identified the assigned personality traits with up to 80% accuracy. A single model, properly parameterized, produces more reliable personality diversity than multiple models with vague persona prompts.
What this means for evaluating persona platforms
If you're evaluating tools for synthetic user research, ask one question about the model architecture: is personality variation controlled or accidental?
Platforms that shuffle between models introduce accidental variation. You can't separate "this persona reacted negatively because of its personality configuration" from "this persona reacted negatively because the underlying model has a different agreeableness baseline today."
Platforms that use a single model with explicit personality parameterization give you controlled variation. Every difference in persona behavior traces back to a deliberate parameter choice. That's the difference between research and noise.
The model underneath is infrastructure. It should be invisible and consistent. The personality parameters are the research instrument. They should be visible and adjustable.
Controlled personality is better than random personality. Every time.
References
Sorokovikova, A., Tikhonov, I., & Nikishina, I. (2024). "LLMs Simulate Big Five Personality Traits: Further Evidence." arXiv preprint arXiv:2402.01765. — Replicated and extended earlier findings, confirming that AI personality profiles are stable across repeated measurements and differ systematically between models. The primary evidence for the model personality problem.
Serapio-García, G., Safdari, M., Crepy, C., et al. (2023). "Personality Traits in Large Language Models." arXiv preprint arXiv:2307.00184. — First rigorous measurement of Big Five traits in LLMs using standardized personality inventories. Established that models produce consistent, interpretable personality profiles rather than random noise.
Jiang, H., Zhang, X., Cao, X., et al. (2024). "PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits." Proceedings of NAACL 2024. — Demonstrated that a single model can express the full range of Big Five personality profiles with large effect sizes, validating that explicit parameterization produces reliable personality diversity without requiring multiple models.
Costa, P.T. & McCrae, R.R. (1992). "Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Professional Manual." Psychological Assessment Resources. — The foundational instrument for Big Five measurement that the LLM personality studies are benchmarked against.
Further reading
- Sorokovikova et al. — LLMs Simulate Big Five Personality Traits (2024)
- Serapio-Garcia et al. — Personality Traits in LLMs (2023)
- Jiang et al. — PersonaLLM (2024)
This is the tenth article in our research foundations series. Want to see how controlled personality parameterization produces more reliable research than model diversity? Try building the same persona twice with different OCEAN scores and compare the results.