How I Built a Copilot AI at VTEX: RAG & Multi-Agent

None of Us Had Done This Before

I joined VTEX to work on their AI copilot — an assistant designed to help merchants manage their e-commerce operations. The team was small, cross-functional, and scrappy. We weren’t a team of AI researchers. Nobody had built a RAG pipeline before. Nobody had shipped a multi-agent system.

But this was late 2024. Most engineering teams hadn’t. LLM tooling was still maturing, agentic frameworks were in their early days, and “AI engineer” wasn’t a standard job title yet.

What we did have was something I’ve learned matters more than expertise: the refusal to stop until the thing works. Every person on that team had the same instinct — figure it out, learn fast, ship faster.

The mission was deceptively simple: build an AI assistant that helps merchants manage their e-commerce operations.

Simple to say. Brutal to execute.

Why Enterprise B2B Makes This Hard

VTEX is a publicly traded enterprise e-commerce platform. It’s not a consumer app where every user does roughly the same thing. Every B2B client is a world of its own:

A fashion retailer in Brazil uses the platform completely differently than an electronics distributor in Mexico
A merchant with 50 SKUs has different needs than one with 500,000
Some clients use the platform for their entire operation; others use it as one piece in a complex tech stack

There’s no “typical user.” Every question has different context, different expectations, different levels of platform knowledge. An AI assistant that gives generic answers is useless here.

This is what makes enterprise AI fundamentally harder than consumer AI, and it shaped every architectural decision we made.

Version Zero: The Third-Party Experiment

We moved fast. Within two weeks of kicking off, we had something running.

The first version wasn’t custom-built at all. We embedded a third-party chatbot solution and pointed it at our documentation. It was the fastest path to getting something in front of users and learning from it.

It worked for about a week. Here’s a representative example of why it didn’t survive:

Merchant asks: “My international customers are seeing the wrong currency at checkout.”

What the chatbot answered:

“To change your store’s currency, go to Admin > Store Settings > Currency. You can set the default currency for your store there.”

Generic. Useless. This isn’t a “default currency” problem — it’s a multi-currency checkout issue that depends on trade policy routing, payment gateway configuration, and geo-location rules. The merchant knew that. The chatbot didn’t. It just pattern-matched “currency” and surfaced the first vaguely relevant doc.

What we needed it to answer: Identify the actual problem domain (trade policy routing), retrieve the specific documentation for multi-currency checkout, ask which country/region is affected, and suggest checking whether the geo-routing rules match the customer’s location to the correct sales channel.

That gap — between a keyword match and actual understanding — is the entire reason we had to build it ourselves.

The pattern

Off-the-shelf chatbot solutions are built for simple support operations and small-to-medium customers. They can’t handle multi-step reasoning, can’t access client-specific context, and give the same surface-level answers regardless of who’s asking. Enterprise B2B merchants saw through it immediately.

But that experiment wasn’t a failure — it was the most valuable two weeks of the project. It taught us exactly what the real system needed:

Ingest massive documentation — Enterprise platforms have thousands of pages of docs, APIs, and guides. The system needed to consume all of it.
Handle different complexity levels — Simple questions should get fast, cheap answers. Complex questions should trigger deeper reasoning with more capable models.
Persistent memory — Merchants don’t ask one question and leave. They have conversations. The system needed to remember context across messages.
Execute tools — Not just answer questions, but actually do things on the platform.

Two weeks in, and we already had a clear spec for v1. Time to build it ourselves.

Building v1: The Architecture

The Stack

We chose our tools based on one criterion: what lets us ship the fastest without locking us in.

Agent orchestration framework — We needed stateful, graph-based workflows that could handle the multi-step reasoning the off-the-shelf chatbot couldn’t. This was the backbone of the entire system.
Managed vector store — For retrieval. Fast, managed, one less thing to operate ourselves.
Multi-provider LLM strategy — We didn’t marry a single AI provider. The system routes each query to the best model for its complexity level across multiple providers. This also gives resilience — if one provider has an outage, the system degrades gracefully instead of going dark.
Reranking — This was the single biggest improvement to retrieval quality. Raw vector search gets you in the neighborhood. Reranking gets you to the right house.

                        User Query
                             │
                             ▼
                    ┌─────────────────┐
                    │     Intent       │
                    │   Classification │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
         ┌─────────┐  ┌──────────┐  ┌──────────┐
         │ SIMPLE   │  │ MEDIUM   │  │ COMPLEX  │
         │          │  │          │  │          │
         │ Fast LLM │  │ Mid-tier │  │ Frontier │
         │ 1 pass   │  │ LLM      │  │ LLM      │
         │ <1s      │  │ Multi-   │  │ Agent +  │
         │          │  │ source   │  │ Tools    │
         └────┬─────┘  └────┬─────┘  └────┬─────┘
              │              │              │
              │         ┌────┴─────┐  ┌────┴─────┐
              │         │Retrieval │  │Retrieval │
              │         │+ Rerank  │  │+ Rerank  │
              │         └────┬─────┘  │+ Memory  │
              │              │        └────┬─────┘
              │              │             │
              ▼              ▼             ▼
         ┌────────────────────────────────────┐
         │          Context Assembly           │
         │  (docs + metadata + conversation)   │
         └──────────────────┬─────────────────┘
                            │
                            ▼
                       Response
                    (streamed to user)

The Documentation Pipeline

This was one of the hardest problems and the one nobody talks about.

Enterprise documentation is massive: API references, merchant guides, developer docs, release notes, help center articles — across multiple languages. None of it was structured in a way that’s friendly to RAG ingestion.

We built a pipeline to normalize everything into an internal standard:

Crawl — Pull documentation from multiple sources
Parse — Strip formatting noise, extract meaningful sections
Chunk — Semantic chunking that respects document boundaries. Not fixed 512-token windows — actual sections that make sense on their own.
Enrich — Add metadata: document type, product area, language, last updated date
Embed + Index — Generate embeddings and push to the vector store

The key insight: garbage in, garbage out. We spent more time on this pipeline than on the LLM layer itself. If your chunks are bad, no amount of prompt engineering saves you.

Published benchmarks from companies like Cohere and Pinecone consistently show that semantic chunking combined with reranking can push retrieval hit rates from the ~58% range to over 90%, with NDCG improvements of 60%+. Our experience building this pipeline confirmed those numbers — the jump in quality after fixing the ingestion layer was dramatic.

The chunking lesson

Fixed-size chunking is the default in every tutorial. It’s also wrong for most real use cases. Documentation has natural boundaries — headings, sections, code blocks. Respecting those boundaries dramatically improves retrieval relevance.

Routing by Complexity

Not every question needs an expensive frontier model call. We built a routing layer that classifies queries by complexity and routes to the right model:

Simple — “What’s the API endpoint for product search?” → Smaller, faster model, single retrieval pass. Sub-second response. Cost: fractions of a cent.
Medium — “How do I set up my catalog import?” → Mid-tier model, multi-source retrieval, step-by-step guidance.
Complex — “Why did my checkout conversion drop after the last deployment?” → Full agent with the most capable model available, tool access, multiple retrieval passes, cross-referencing data sources.

This wasn’t just about cost — though industry data shows intelligent routing can cut average cost per query by 30-40% by sending 90%+ of queries to lighter models without quality loss. It was about latency. Simple questions should feel instant. Nobody wants to wait 8 seconds for an API endpoint URL.

Memory That Actually Works

Conversational memory in AI systems sounds simple until you try to build it for production.

The challenge: merchants have conversations that span dozens of messages. They reference things they said 15 messages ago. They switch topics and come back. A naive “stuff the last N messages into context” approach runs into token limits fast and dilutes relevance.

The approach that worked for us: structured memory that separates what the user is trying to do from the raw conversation history. The system maintains a running summary of the current task, the key entities mentioned, and the open questions — independent of the raw message log.

The Hard Parts Nobody Warns You About

Operating AI in Production

Building the system was the fun part. Operating it was the hard part.

SLAs matter. When your AI copilot is embedded in a customer support flow, “it’s usually fast” isn’t good enough. You need consistent latency, reliable uptime, and graceful degradation when things go wrong.

An unoptimized RAG pipeline easily takes 6-10 seconds end-to-end. The breakdown is roughly: embedding ~200ms, vector search ~50ms, reranking ~150ms, LLM generation 4-8 seconds. You have to fight for every millisecond.

What “things going wrong” looks like in an AI system:

Embedding API has a latency spike → Retrieval times double → Response times breach SLA
A documentation update introduces contradictory content → The model confidently gives wrong answers
A user finds a creative way to make the system go in circles → Token costs spike on a single conversation

Here’s something counterintuitive we learned: reranking adds ~150ms to retrieval but reduces total response time. How? By sending fewer, more relevant chunks to the LLM, generation drops significantly. The net result is faster, not slower.

Each of these failure modes required monitoring, alerting, and fallback strategies that we built iteratively as we hit them in production.

The Evaluation Problem

How do you know if the AI is actually good?

User feedback is noisy. Automated metrics miss nuance. Manual review doesn’t scale.

The layered approach that worked:

Retrieval quality — Are we finding the right documents? Measured by hit rate and NDCG against human-curated golden sets.
Response quality — Is the answer correct, complete, and well-structured? LLM-as-judge with human spot-checks. Published research consistently shows RAG reduces hallucination rates from ~15% to near-zero on grounded domain-specific tasks.
Task completion — Did the user actually solve their problem? This is the metric that matters most. Industry benchmarks for AI support resolution rates typically progress from 30-40% in month one to 55-70% as the system matures.

No single metric tells the full story. You need all three.

What I Learned

Start With the Dumbest Thing That Works

The third-party chatbot experiment taught me more in two weeks than a month of design docs would have. Ship the simplest version, watch it fail, and let the failures write your spec.

The Data Pipeline Is the Product

Everyone wants to talk about the LLM, the agent framework, the fancy orchestration. The actual differentiator is the data pipeline. How you ingest, chunk, enrich, and index your knowledge base determines 80% of your system’s quality.

Small Teams With the Right Culture Win

A small team that didn’t have AI experience shipped a production system that handles real enterprise traffic. The advantage wasn’t technical expertise — it was the willingness to learn everything from scratch and the refusal to ship something mediocre.

The Model Is the Least Important Decision You’ll Make

This is the take that gets me side-eyes, but I believe it deeply.

Everyone obsesses over which LLM to use. GPT-4 vs Claude vs Gemini. The debates go on for weeks. Architecture review meetings. Benchmarks. Red-teaming.

None of that matters if your retrieval is bad. None of that matters if your chunks are wrong. None of that matters if your documentation is outdated. We swapped models multiple times during development and the quality difference was marginal compared to the day we fixed our chunking strategy. That one change — going from naive fixed-size chunks to semantic boundaries — did more for response quality than any model upgrade ever did.

If you’re spending 80% of your time choosing a model and 20% on your data pipeline, flip those numbers.

Stop Gatekeeping AI Engineering

I’ve seen teams with PhDs in machine learning fail to ship an AI product, and I’ve seen teams of general-purpose engineers with zero AI background ship one in weeks. The difference is never expertise — it’s execution speed and the willingness to be bad at something before being good at it.

The AI field has a gatekeeping problem. People talk about these systems like you need a decade of ML research to build them. You don’t. You need good engineering instincts, the ability to read documentation fast, and the stubbornness to debug things you don’t fully understand yet.

If you’re a backend engineer wondering if you can build AI systems: you can. We’re proof.

AI Engineering Is Still Engineering

The fundamentals don’t change: observability, testing, SLAs, graceful degradation, clear abstractions. The models are new, the problems are old.

The uncomfortable truth

Most AI projects fail not because the models aren’t good enough, but because the engineering around them isn’t rigorous enough. Treat your AI system like any other production service: monitor it, test it, have runbooks for when it breaks.

Looking Back

I started this project not knowing what a vector embedding was. I ended it having built a production RAG system that handles real enterprise traffic every day.

The thing that surprises me most isn’t the technical complexity — it’s how learnable all of this is. The patterns — RAG pipelines, agent orchestration, documentation ingestion, complexity routing, structured memory — aren’t magic. They’re engineering problems with engineering solutions. The hardest part was never the AI. It was the same thing that’s always hard: understanding what users actually need and building something reliable enough that they trust it.

If you’re an engineer staring at the AI space feeling like you’re already behind — you’re not. The field is moving fast, but the gap between “I’ve never built this” and “I shipped this to production” is shorter than you think. It took us weeks, not years. It took stubbornness, not PhDs.

Ship the dumb version. Let real users break it. Then build the real thing.

That’s how it’s always worked. AI doesn’t change that.

Have questions about RAG, multi-agent systems, or any of the patterns I discussed here? I’m always down to talk AI architecture. Find me on LinkedIn or GitHub.