You're building an AI-powered HR tool. An employee asks the chatbot: "What's my current salary and when does my dental coverage reset?" The bot queries the database and answers correctly. Great UX.
Now consider: that SSN, salary, address, phone number, benefits details, and dependent names just passed through an API call to a third-party AI provider's servers. The employee never consented to that. Your privacy policy doesn't mention it. The AI provider's terms of service say they can use inputs to improve their models.
You just created a data breach. You just didn't know it yet.
What Counts as Sensitive PII
Before we get into the legal weeds, let's be clear about what we're talking about. Sensitive personally identifiable information (PII) includes:
- Government identifiers: Social Security numbers, passport numbers, driver's license numbers, tax IDs
- Contact details: Home addresses, phone numbers, personal email addresses
- Employment and financial data: Salary, pay stub details, bank account info, benefits enrollment, 401(k) contributions, stock options
- Family and dependent data: Spouse and children's names, dates of birth, SSNs for dependents
- Health information: Insurance plan details, medical conditions, prescription history, disability accommodations
- Authentication data: Passwords, security questions, biometric data
Each of these categories carries its own legal weight. An SSN in the wrong hands enables identity theft. A salary disclosure can violate employment agreements. Health information is covered by an entirely separate federal statute. But when you route any of this through an AI provider's API, you're creating overlapping legal risk across all of them simultaneously.
How AI Providers Actually Handle Your Data
Here's the uncomfortable truth that most developers discover too late: the terms of service you agreed to when you signed up for an API key may not protect you the way you think.
OpenAI's consumer-facing ChatGPT explicitly states that user conversations may be used to train models unless the user opts out. For API customers (developers using the API to build products), OpenAI's policy is different — they state that API inputs and outputs are not used for training by default. But:
- Enterprise agreements vary. The "not used for training" guarantee applies to the standard API, not necessarily to every model or every product offering.
- Anthropic's Claude API has similar policies, but the specifics change with updates.
- Smaller providers and open-weight models hosted by third parties often have less clear policies.
- The "zero data retention" option, when available, typically costs extra.
The key question is never "does my AI provider promise not to train on my data?" It's "what happens if a bug, a breach, or a policy change exposes the data I sent them?"
The Real-World Incidents
Samsung's ChatGPT Ban
In 2023, Samsung engineers pasted proprietary source code into ChatGPT to debug issues. When Samsung discovered that the code was being sent to OpenAI's servers and potentially used for training, they banned all employees from using generative AI tools on company devices. Other major companies — Apple, JP Morgan, Amazon — followed with similar restrictions.
This wasn't about PII exactly, but it demonstrated the core problem: employees and automated systems will route sensitive data through AI providers without understanding the implications. When the data is code, it's a trade secret issue. When the data is an SSN, it's a compliance catastrophe.
The ChatGPT Data Exposure Bug
In March 2023, OpenAI disclosed a bug that exposed the titles and first messages of approximately 1.2% of ChatGPT Plus users' conversations to other users. The bug also exposed payment-related information for some users, including partial credit card numbers, expiration dates, and billing addresses.
If an HR chatbot had been running on ChatGPT at the time, employee SSNs, salaries, and home addresses could have been visible to random other users. This is the scenario that compliance officers have nightmares about — not a hack, just a bug in the AI provider's infrastructure.
The Training Data Problem
Research has repeatedly demonstrated that large language models memorize and can reproduce training data, including PII. In one study, researchers extracted exact names, phone numbers, addresses, and Social Security numbers from GPT-3 by crafting specific prompts. The model had ingested this information from web-scraped training data and could regurgitate it on demand.
When you send sensitive employee data through an AI API, even if the provider doesn't explicitly train on it, the data may influence the model's behavior in future responses to other users. The technical term is "data contamination," and it's incredibly difficult to detect or prove after the fact.
The Legal Framework: What Laws Apply
This is where it gets complicated, because sending sensitive PII through an AI provider can trigger violations of multiple laws simultaneously.
HIPAA: Health Information
If your AI tool handles any health-related data — insurance plan details, medical leave, disability accommodations, prescription information — the Health Insurance Portability and Accountability Act applies.
HIPAA requires that protected health information (PHI) only be disclosed to "business associates" who have signed a Business Associate Agreement (BAA). Most AI providers are not willing to sign a BAA. If you send PHI through their API without one, you've violated HIPAA. Penalties range from $100 to $50,000 per violation, with annual maxima up to $1.5 million for willful neglect.
Some AI providers have started offering HIPAA-compliant tiers (OpenAI has a BAA for enterprise healthcare customers), but these come with restrictions and additional costs. The default API tier is almost never HIPAA-compliant.
GLBA: Financial Data
The Gramm-Leach-Bliley Act requires financial institutions to protect consumers' "nonpublic personal information." If your AI tool handles employee payroll data, bank account details for direct deposit, or financial planning information, GLBA's Safeguards Rule applies. The FTC enforces GLBA violations, and has been increasingly aggressive about AI-related data practices.
FERPA: Education Records
Student records — including financial aid data, enrollment information, and grades — are protected by the Family Educational Rights and Privacy Act. AI tutoring tools and chatbots used in educational settings have already triggered FERPA compliance questions. The Department of Education issued guidance in 2024 warning schools to carefully vet AI tools that access student data.
State Privacy Laws
Beyond federal statutes, every state with a comprehensive privacy law has something to say about this:
- California (CPRA): Requires disclosure of data sharing with third parties, including "service providers." An AI provider processing employee data counts. CPRA also grants consumers the right to delete their data and opt out of automated decision-making.
- Virginia (VCDPA), Colorado (CPA), Connecticut (CTDPA), and others: Similar requirements around consent, disclosure, and data minimization. Most of these laws give state attorneys general enforcement authority.
- Illinois (BIPA): If your AI processes biometric data — facial recognition for employee verification, voice analysis — BIPA requires written consent and data retention policies. Clearview AI's $51.8 million settlement shows how expensive violations get.
The EU Data Protection Framework
If any of your employees or customers are in the EU, GDPR applies. Under GDPR, sending personal data to an AI provider triggers several obligations:
- Legal basis for processing: You need a valid legal basis — consent, contractual necessity, or legitimate interest. "It's convenient for our chatbot" doesn't qualify.
- Data Processing Agreement (DPA): You need a written agreement with the AI provider governing their data handling. The standard API terms of service may not satisfy this.
- Cross-border data transfer: If the AI provider's servers are outside the EU, you need to ensure adequate data transfer mechanisms are in place.
- Data minimization: You must only send the minimum data necessary. Sending an entire employee record when the chatbot only needs to answer a question about benefits enrollment violates this principle.
- Right to erasure: If an employee asks to have their data deleted, you need to be able to demonstrate that the AI provider has also deleted it — which is often technically impossible once data has been processed by a model.
What "Shipped Through an AI Provider" Actually Means
Let's be precise about the risk scenarios, because they matter:
Scenario 1: Prompt Injection
You send: Employee John Smith (SSN: 123-45-6789, salary: $85,000) asks about dental coverage reset date.
The AI provider receives this as part of the API call. It's processed, stored temporarily (and sometimes permanently), and may be logged. Even if the provider doesn't train on it, the data exists on their servers, subject to their security practices, not yours.
Scenario 2: RAG with PII
You build a retrieval-augmented generation system that searches your HR database and includes the results in the LLM prompt. The system works great — employees get instant answers. But the retrieval step sends SSNs, salaries, and addresses to the AI provider as context. The LLM never "sees" the database directly, but it receives the extracted data in every query.
This is arguably worse than Scenario 1, because it's systematic. Every query through the RAG system routes fresh PII to the AI provider. It's not a one-time mistake — it's an architectural decision that creates continuous compliance risk.
Scenario 3: Embedding Generation
You use an AI provider's embedding API to vectorize employee documents for semantic search. The documents contain PII. The embeddings are stored in a vector database — which may be hosted by the same AI provider, or a different third party. The raw text may or may not be retained, but the embeddings themselves can sometimes be reversed or contain enough information to reconstruct sensitive data.
Scenario 4: Autonomous Agent Actions
This is the frontier risk. An AI coding agent that has access to your codebase might encounter environment variables with database credentials, config files with API keys, or documentation that includes employee contact information. If the agent's operations are processed through an external provider, all of this context becomes visible to them.
Practical Risk Mitigation
If you're building AI tools that interact with sensitive data, here's what you need to do:
1. Data Classification First
Before you touch an AI provider, classify your data: - Red (never route externally): SSNs, passwords, full financial account numbers, health diagnoses, biometric data - Yellow (minimize and anonymize): Salaries, addresses, phone numbers, benefit details — use only what's needed, strip identifiers when possible - Green (acceptable for AI processing): Policy summaries, general procedure descriptions, non-personalized content
2. Build a PII Stripper
Before any data reaches an AI provider, run it through a PII detection and redaction layer. Tools like Microsoft Presidio, Google Cloud DLP, or open-source alternatives can detect and mask SSNs, phone numbers, addresses, and other sensitive patterns. The AI processes redacted data, and you re-identify the response on your side.
3. Choose the Right Provider Tier
Most AI providers offer different tiers with different data handling: - Consumer/free tiers: Data may be used for training. Never route PII here. - Standard API: Usually not used for training, but data is retained temporarily. Read the specific terms carefully. - Enterprise/zero-retention tiers: Data is not retained after processing and not used for training. These exist but cost more and may have latency implications. - Self-hosted models: The only way to guarantee data never leaves your infrastructure. Open-weight models like Llama, Mistral, or Phi can run on your own servers.
4. Get the Agreements Right
For any tier that processes personal data, you need: - A Data Processing Agreement (DPA) under GDPR - A Business Associate Agreement (BAA) if handling health data - Clear contractual guarantees about data retention, sub-processor disclosure, and breach notification timelines
5. Audit and Document
Maintain a data flow map that shows exactly what personal data goes where, including through AI providers. Regulators increasingly expect this level of documentation. If you can't explain where an employee's SSN went when they asked the chatbot about their 401(k), you're already in trouble.
The Bottom Line
Sending sensitive personal information through an AI provider isn't just a privacy concern — it's a multi-jurisdictional compliance minefield. A single SSN in a prompt can simultaneously trigger HIPAA, GLBA, state privacy law, and GDPR violations. The AI provider's infrastructure, not yours, becomes the attack surface.
The good news: the industry is maturing. Self-hosted models are increasingly capable. PII detection tools are getting better. Provider tiers with zero retention are becoming standard. But the gap between what developers build and what compliance requires is still wide, and the penalties for falling through it are getting larger.
Every employee who asks your AI chatbot about their salary, benefits, or personal information is trusting you with data that regulators, class action attorneys, and state AGs are increasingly watching for. Make sure that trust is warranted.