Your AI Is Lying to Your Customers (And Prompt Engineering Won't Fix It)
I'm Ali, building Provia — an AI-powered sales platform — from Gaza. This is the hardest lesson I...
I'm Ali, building Provia — an AI-powered sales platform — from Gaza. This is the hardest lesson I learned building AI for production.
The Azure Storm Hoodie That Never Existed
A customer messages your store's AI chatbot at 11 PM on a Tuesday:
"Hey, do you have any blue hoodies?"
The AI responds instantly, friendly and confident:
"Absolutely! Check out our Azure Storm Hoodie -- it's one of our best sellers. Premium cotton blend, available in sizes S through XL. Right now it's $49.99 and we only have a few left in stock. Want me to add one to your cart?"
There is one problem. The Azure Storm Hoodie does not exist. It has never existed. Your store has never sold a product with that name, at that price, in those sizes. The AI made the whole thing up -- the name, the price, the material, the scarcity angle. Every single detail was fabricated.
And it sounded perfect.
This is not a hypothetical. This is what happened when I built Provia, an AI-powered sales chatbot platform for e-commerce stores. The AI was connected to a real product database. It had a system prompt explicitly telling it to only recommend products from the catalog. And it still invented products out of thin air, confidently, fluently, and convincingly enough that customers tried to buy them.
If you are building any AI system that references real-world data -- product catalogs, documentation, inventory, pricing -- you need to read this. Because the fix is not what you think.
The Prompt Engineering Arms Race
When I first discovered the hallucination problem, I did what every developer does: I rewrote the system prompt.
Attempt 1: The Polite Instruction
You are a sales assistant for this store. Only recommend products from the database.
Result: The AI followed this instruction about 80% of the time. The other 20%, it cheerfully invented products, especially when the customer asked for something specific that was not in the catalog. Instead of saying "we don't carry that," it created something plausible.
Attempt 2: The Stern Warning
IMPORTANT: Never make up product names. Never invent prices. Only reference
products that exist in the catalog. If a product is not in the database,
say you don't have it.
Result: Better. Maybe 90% compliance. But the remaining 10% was worse -- the AI got creative. Instead of inventing whole products, it would take a real product name and "adjust" it. A real product called "Classic Tee" might become "Classic Premium Tee" at a slightly different price. Close enough to seem real, wrong enough to cause problems.
Attempt 3: The Nuclear Option
CRITICAL RULE - ZERO TOLERANCE:
You MUST NOT, under ANY circumstances, mention ANY product that is not
EXPLICITLY provided in the search results. If you mention a product name
that was not in the data provided to you, you are FAILING at your job.
When in doubt, say "let me check our catalog" and search again.
Result: 95% compliance. The AI almost always stuck to real products. But "almost always" is not good enough when real customers are trying to spend real money. One hallucinated product recommendation per hundred conversations means that if your store handles 500 conversations a day, five customers are being told about products that do not exist. Every single day.
Why 95% Is Not Good Enough
I want to sit with that number for a second. Ninety-five percent accuracy sounds impressive until you calculate the cost.
Five percent failure rate. Fifty conversations a day with fabricated product recommendations. A customer gets excited about a product, tries to find it, cannot, contacts support, gets confused, loses trust. Some percentage of those customers never come back. At scale, you are bleeding revenue from a wound you cannot see unless you are monitoring every conversation.
And that is the optimistic case. The pessimistic case is a customer who buys something based on a hallucinated description -- the right product name but wrong specs, wrong price, wrong availability. Now you have a customer service nightmare, a potential chargeback, and depending on your jurisdiction, a legal liability.
Why Prompt Engineering Fundamentally Cannot Solve This
After months of iteration, I stopped trying to fix the prompt and started thinking about why prompt engineering fails for this class of problem. The answer is structural, not a matter of finding the right words.
LLMs Are Probabilistic, Not Rule-Following
A system prompt is not a set of rules. It is a statistical bias. When you write "never invent product names," you are pushing the probability distribution toward compliance, but you are not setting it to zero. The model does not have a boolean flag called follow_instructions that you can set to true. It has billions of parameters that collectively determine what token comes next, and "the next plausible token" sometimes means inventing a product name.
This is not a bug. It is how the technology works. You cannot prompt your way out of it any more than you can ask a river to flow uphill by putting up a sign.
Helpfulness Is the Enemy
LLMs are trained to be helpful. When a customer asks "do you have blue hoodies?" the model is under enormous pressure -- from its training, from RLHF, from everything it has learned about being a good assistant -- to say yes. Saying "I don't see any blue hoodies in our catalog" feels like failure to the model. Saying "Check out our Azure Storm Hoodie!" feels like success.
The more specific the customer's question, the stronger this pressure becomes. Vague questions ("what do you sell?") are easy to handle with real data. Specific questions ("do you have a size 10 navy waterproof hiking boot under $80?") create a scenario where the model desperately wants to find a match, and if the real data does not provide one, the model's next best option is to create one.
You Cannot Unit Test Prompt Compliance
This is the part that should terrify you. With traditional code, you write a function, you write tests, you know it works or it does not. With prompt engineering, you cannot write a test that guarantees the model will never hallucinate. You can test a thousand inputs and get perfect results, then the thousand-and-first input triggers a hallucination you never anticipated.
You cannot achieve deterministic behavior from a non-deterministic system through instructions alone.
Context Window Pollution
Here is a subtlety that took me several sessions to discover. Even if the AI starts a conversation by correctly searching the database, as the conversation grows longer, the original search results get pushed further back in the context window. The AI starts "remembering" the general vibe of the products rather than the specific details. Product names drift. Prices shift. Features get mixed between products. The longer the conversation, the more likely the AI is to hallucinate -- not because it is ignoring your prompt, but because the real data is being diluted by tokens of conversation history.
The Architectural Solution: Removing the Ability to Lie
The breakthrough came when I stopped thinking about what I told the AI and started thinking about what I allowed the AI to do.
The core insight: prompt engineering controls tone; architecture controls behavior.
Instead of instructing the AI "don't make things up," I removed its ability to make things up. The mechanism: OpenAI function calling (tool use).
How It Works
You define a tool that the AI must call to get product information:
const tools = [{
type: "function",
function: {
name: "search_products",
description: "Search the store's product catalog. MUST be called before mentioning any product.",
parameters: {
type: "object",
properties: {
query: {
type: "string",
description: "What the customer is looking for"
},
max_price: {
type: "number",
description: "Maximum budget if specified"
},
min_price: {
type: "number",
description: "Minimum price if specified"
},
},
required: ["query"],
},
},
}];
The flow becomes:
- Customer asks about products.
- The AI must call
search_products-- it is the only tool available for product data. -
search_productsqueries the real database (PostgreSQL with pgvector for semantic search). - Real results come back as tool response messages.
- The AI formulates its response using only the returned data.
Here is the critical difference: if a product does not exist in the database, it cannot appear in the search results, which means the AI cannot reference it. The hallucination is not suppressed by instruction -- it is prevented by architecture. The AI literally does not have the information needed to fabricate a product, because it only gets product data through the controlled pipeline.
The Search Pipeline
The search function itself uses a fallback chain to maximize the chance of finding relevant real products:
async function searchProducts(storeId: string, query: string) {
// 1. Semantic search with pgvector (cosine similarity)
const embedding = await generateEmbedding(query);
const { data: results } = await supabase.rpc("search_products", {
query_embedding: embedding,
match_threshold: 0.3,
store_id: storeId,
});
if (results?.length) return { status: "found", products: results };
// 2. Fallback: text match on name and description
const { data } = await supabase
.from("products")
.select("*")
.eq("store_id", storeId)
.or(`name.ilike.%${query}%,description.ilike.%${query}%`);
if (data?.length) return { status: "found", products: data };
// 3. Final fallback: return available categories
const categories = await getStoreCategories(storeId);
return {
status: "no_matches",
message: "No matching products found",
categories: categories,
};
}
The semantic search (step 1) handles fuzzy matching -- a customer asking for "blue hoodie" will match a product called "Ocean Pullover Sweatshirt" because the embeddings capture meaning, not just keywords. The text fallback (step 2) catches exact matches the embedding might miss. And the category fallback (step 3) gives the AI something useful to say even when there genuinely is no match: "We don't have blue hoodies, but we do carry jackets, sweaters, and accessories. Want me to show you what we have?"
No fabrication. No hallucination. Just real data or an honest acknowledgment of absence.
The Evolution: Four Sessions of Hard Lessons
This solution did not appear fully formed. It evolved over multiple development sessions, each one teaching something about how AI systems behave in production.
Session 1: Naive Chat
The initial implementation was a basic chat completion call with a system prompt and conversation history. The AI had the store's product list injected into the system prompt as a JSON blob. This worked for small catalogs (under 20 products) but fell apart with larger ones -- the context window could not hold the entire catalog, and even when it could, the AI would mix up details between products. Hallucination rate: roughly 20%.
Session 3: Function Calling
Introducing function calling was the turning point. Instead of pre-loading products into the prompt, the AI had to actively search for them. Hallucination of non-existent products dropped to effectively zero. The AI could still occasionally get details wrong (misquoting a price from the results), but it could no longer invent products wholesale.
Session 5: Token Optimization
With function calling working, a new problem emerged: cost. Every search call added tokens. Long conversations meant long context windows. History limits and prompt compression brought costs under control without sacrificing accuracy. The key optimization was limiting conversation history to the most recent messages rather than sending the entire thread.
Session 6: Two-Context Architecture
The final refinement was splitting the AI into two separate contexts:
- Search context: Zero conversation history. Receives only the customer's current message. Decides what to search for. This prevents context pollution -- the search decision is based purely on what the customer just said, not on a drifting conversation.
- Response context: Receives bounded conversation history plus search results. Formulates the actual reply.
This separation eliminated the last category of errors: the AI "remembering" products from earlier in the conversation and subtly misquoting them.
The Analogy That Makes It Click
Prompt engineering is like putting a "Please Don't Steal" sign in a retail store. Most people will respect it. Some will not. And you have no way to guarantee compliance.
Architecture -- function calling with controlled data access -- is like putting the merchandise behind a counter. The customer has to ask a clerk for what they want. The clerk can only hand over items that are physically on the shelves. The customer cannot grab something that does not exist because the store's inventory is the single source of truth.
The sign might work 95% of the time. The counter works 100% of the time. When real money is on the line, you need the counter.
The Monitoring That Caught It
One detail worth calling out: the hallucination problem was discovered because we built an admin panel where store owners could read chat transcripts. An admin noticed a customer asking about a product that was not in the catalog and the AI confidently recommending it.
Without that monitoring, this failure would have been invisible. The customer would have gotten confused, maybe left, and we would have seen a dip in conversion rates without understanding why.
Build monitoring from day one. Every AI response that references real-world data should be auditable. If you cannot trace every product recommendation back to a real database record, you have a hallucination problem that you simply have not found yet.
Beyond Chatbots: Where This Pattern Applies
This is not just about chatbots. The same architectural principle applies anywhere an AI generates content that references real data:
- Documentation bots that answer questions about your API. Without tool-gated access to the actual docs, the AI will invent endpoints, parameters, and response formats.
- Customer support agents that reference order history. Without forced database lookups, the AI will fabricate order statuses and tracking numbers.
- Content generation that cites statistics. Without tool access to the real data source, the AI will generate plausible-sounding but completely made-up numbers.
- Internal tools that query dashboards or reports. Without architectural constraints, the AI will synthesize data that feels right but is not.
The pattern is always the same: if the AI can generate a plausible-sounding answer without consulting the real data, it sometimes will. The fix is always the same: make the real data the only source the AI can draw from.
The Cost Argument (It's Negligible)
A common objection: "Function calling adds latency and cost." Let me address this with real numbers.
A single function call adds roughly one extra API round-trip. In practice, this means:
- Latency: 200-500ms additional per search call. For a conversational chatbot, this is imperceptible -- customers expect a brief pause while the "agent" checks the catalog.
- Token cost: The tool definition adds about 150 tokens to each request. At current API pricing, that is approximately $0.00001 per message. Even at 100,000 messages per month, the overhead is under a dollar.
Compare that cost to one customer who tries to buy a hallucinated product, contacts support, leaves a bad review, and never returns. The architectural approach is not just more reliable -- it is cheaper than dealing with the consequences of hallucination.
Is YOUR AI Architecturally Safe? A Checklist
If you are building an AI system that references real-world data, run through this list:
Data Access
- [ ] Can the AI generate responses about real entities (products, orders, docs) without querying the actual data source?
- [ ] If yes, you have a hallucination risk, regardless of your prompt.
Tool Design
- [ ] Is every real-world data access gated behind a function call / tool?
- [ ] Does the AI receive data ONLY through tool responses, never pre-loaded in the system prompt?
- [ ] Are tool responses the single source of truth for entity-specific information?
Failure Handling
- [ ] When a search returns no results, does the AI have a graceful fallback (categories, suggestions) instead of being tempted to fabricate?
- [ ] Is the "no results" path explicitly designed and tested?
Context Management
- [ ] Is conversation history bounded to prevent context pollution?
- [ ] Are search decisions isolated from conversation drift?
- [ ] Are old tool results excluded from the context to prevent stale data references?
Monitoring
- [ ] Can you read every AI-generated response that references real data?
- [ ] Can you trace each entity mention back to a real database record?
- [ ] Are you actively looking for hallucinations, or waiting for customers to report them?
If you checked even one box in the "Data Access" section, you have work to do.
The Uncomfortable Truth
Here is what I wish someone had told me before I spent weeks iterating on prompts:
You cannot instruct your way to reliability.
Prompt engineering is essential for controlling tone, personality, conversation flow, and response format. It is the right tool for shaping how the AI communicates. But it is the wrong tool for constraining what the AI communicates when "what" needs to be grounded in reality.
For that, you need architecture. You need to design systems where the AI physically cannot reference data it did not receive from a trusted source. Function calling is one implementation of this principle. RAG with strict citation requirements is another. The specific mechanism matters less than the principle: do not rely on instructions to constrain behavior that has real-world consequences.
Your AI is not lying to your customers out of malice. It is lying because you gave it the ability to speak without the constraint of truth. Take away the ability, and the lying stops.
Not sometimes. Not 95% of the time. Completely.
I'm documenting my entire journey building an AI sales platform from Gaza. Follow me @AliMAfana for more real lessons from production AI.
Previous article: My AI Kept Recommending Pajamas for Date Night — Here's Why