Back to Blog
PII Redaction in AI Products: How We Keep Your Customer Data Safe
William Jones··5 min read

PII Redaction in AI Products: How We Keep Your Customer Data Safe

privacysecurityPIIenterprise

When you upload a customer interview transcript to an AI tool, that file probably contains names, email addresses, phone numbers, and maybe even financial information. Most AI products ingest all of that and send it straight to an LLM.

Synthicant doesn't. Here's why, and how we built PII redaction into the core architecture.

The problem with "trust us"

Most AI products handle customer data with a terms of service and a promise. "We take security seriously." "Your data is encrypted at rest." "We don't use your data to train models."

These statements may be true, but they miss the point. The question isn't whether the AI company will misuse your data. The question is: should the AI ever see personal information in the first place?

If an AI persona is supposed to represent a customer segment, it doesn't need to know that the interview was with "Sarah Chen from Acme Corp at sarah.chen@acme.com." It needs to know what Sarah said about the product, how she expressed her concerns, and what her decision-making process looks like.

The personal identifiers add no value and create real risk.

How Synthicant handles PII

Every text file uploaded to Synthicant passes through Microsoft Presidio — an open-source PII detection and anonymization framework used by enterprises worldwide.

The pipeline works like this:

1. Upload

You upload a file — a transcript, a CSV, a PDF, a Word document. The file hits our backend.

2. Detection

Presidio's analyzer scans the text and identifies PII entities:

  • Names — First names, last names, full names
  • Email addresses — Any email pattern
  • Phone numbers — Domestic and international formats
  • Physical addresses — Street addresses, zip codes
  • Financial data — Credit card numbers, bank accounts
  • Government IDs — SSNs, passport numbers
  • Medical data — Medical record numbers, health conditions
  • URLs — Personal websites, social media profiles

Each detection includes a confidence score. Presidio uses a combination of rule-based matching, named entity recognition (NER), and contextual analysis to minimize false positives.

3. Anonymization

Detected PII is replaced with placeholder tokens:

| Original | Redacted | |----------|----------| | Sarah Chen | [PERSON] | | sarah.chen@acme.com | [EMAIL_ADDRESS] | | (415) 555-0123 | [PHONE_NUMBER] | | 123 Market Street | [LOCATION] |

The redacted text preserves the structure and meaning of the original while removing all identifying information.

4. Embedding and storage

Only the redacted text is chunked and embedded. Only the redacted text is stored in the vector database. Only the redacted text is retrieved during chat sessions.

The original, unredacted text is never stored, embedded, or sent to any AI model.

Why Presidio?

We chose Microsoft Presidio for several reasons:

  • Open source — The code is publicly auditable. No black box.
  • Enterprise adoption — Used by organizations with strict compliance requirements (HIPAA, GDPR, CCPA)
  • Extensibility — We can add custom recognizers for industry-specific PII patterns
  • Accuracy — Combines multiple detection methods to minimize both false positives and false negatives
  • Language support — Works across multiple languages via spaCy NLP models

We run Presidio in our own infrastructure. Your data never leaves our pipeline to reach a third-party PII detection service.

What about media files?

Images, audio, and video require a different approach. You can't run regex patterns on a photo.

When you upload media to Synthicant, Google Gemini Flash generates a text description of the content. This description captures what's in the media (a product screenshot, an interview recording, a demo video) without including personal information that might be visible or audible.

The description — not the raw media — is what gets embedded and stored. If a photo contains a person's face or a whiteboard with names on it, the description focuses on the relevant product context, not the personal details.

Architectural guarantees

PII redaction in Synthicant isn't a feature flag. It's a hard architectural constraint:

  1. The upload endpoint redacts before any downstream processing
  2. The embedding service only receives redacted text
  3. The vector store only contains redacted chunks
  4. The chat system only retrieves redacted content
  5. There is no code path that sends unredacted customer text to an LLM

This means that even if there were a bug in our chat system, a prompt injection attack, or a misconfigured API endpoint, the AI physically cannot access your customers' personal information. The data doesn't exist in any system the AI can reach.

What this means for your compliance team

If you're evaluating Synthicant for an organization with data handling requirements:

  • GDPR — PII is redacted before processing. No personal data is stored in AI systems.
  • HIPAA — Medical record numbers and health conditions are detected and redacted by Presidio.
  • CCPA — Consumer personal information is anonymized before embedding.
  • SOC 2 — Data isolation per persona namespace. No cross-contamination between organizations.

We're not a compliance product, but PII redaction is built into the architecture rather than bolted on after the fact.

The tradeoff

PII redaction isn't free. Occasionally, Presidio will flag something as PII that isn't (a product name that looks like a person's name, for example). This can remove useful context from the embedded data.

We consider this an acceptable tradeoff. A false positive means slightly less context for the persona. A false negative means your customer's personal information is stored in a vector database. One of these is clearly worse than the other.


Your customer data deserves better than "trust us." Try Synthicant and see privacy-first AI research in action.