Become an
AI Expert.
Not just a user. An expert. The person who understands what the model is doing, why a prompt works, and how to build systems others cannot.
How This Works
Tokens, Embeddings
& Context Windows
Before you can understand why LLMs behave the way they do, you need to understand the basic unit they operate on — and it is not a word.
By the end of this week you will be able to:
What Is a Token?
When you type a message to an LLM, the first thing that happens is your text gets tokenised — broken into chunks called tokens. Here is the critical thing most people get wrong: tokens are not words.
A token is roughly 3–4 characters of text on average. Common short words like "the" or "is" are single tokens. Longer or unusual words get split. "unhappiness" might become ["un", "happi", "ness"]. Numbers are especially fragmented — "1234567" could be 4+ tokens.
Think of tokens like musical notes. A song is not made of "songs within a song" — it is made of individual notes. An LLM does not read sentences. It reads a stream of tokens, one at a time, and its job is to predict the next note in the melody.
The practical implications are significant for anyone building AI tools. Token count drives your cost — every API call bills by tokens in plus tokens out. Non-English text, code, and structured data (JSON, CSV) typically use 20–40% more tokens than equivalent English prose. This affects how you design prompts and what you include in context.
What Is a Context Window?
Every LLM has a context window — the maximum number of tokens it can process at once. Think of it as the model's working memory. Everything the model can "see" when generating its next token must fit inside this window.
This includes your system prompt, the entire conversation history, any documents you injected, and the model's own previous output. When the window fills up, older content gets cut off — the model literally cannot see it anymore.
Context window sizes have grown dramatically. GPT-3 had 4K tokens. Claude now supports up to 200K tokens. But more context ≠ better performance. Models tend to pay more attention to the beginning and end of context (the "lost in the middle" problem). Relevant information buried in the middle of a huge context gets attended to less reliably.
| Model | Context Window | Approximate Pages |
|---|---|---|
| GPT-3 (2020) | 4,096 tokens | ~3 pages |
| GPT-4 (2023) | 8K / 32K tokens | ~6 / 24 pages |
| Claude 3.5 (2024) | 200,000 tokens | ~150 pages |
| Gemini 1.5 Pro | 1,000,000 tokens | ~750 pages |
Embeddings — How Meaning Becomes Math
Before the transformer processes tokens, each token gets converted into an embedding — a list of numbers (a vector) that represents its meaning. A typical embedding might be 1,536 numbers long. Every token in the model's vocabulary has its own embedding vector.
The remarkable thing is that these vectors capture semantic meaning spatially. Words that mean similar things have vectors that are close together in that high-dimensional space. Words with opposite meanings are far apart.
Imagine a massive 3D room where every concept in the English language has a physical location. "King" and "Queen" are near each other. "Dog" and "Cat" are close. "Hot" and "Cold" are far apart. The embedding space is that room, but with 1,536 dimensions instead of 3. This is how the model understands that "dog" and "puppy" are related even if they never appeared in the same sentence during training.
Your SaaS chatbot injects 50 tool descriptions into every request — potentially 3,000–5,000 tokens before the user says a word. Understanding tokens helps you reduce cost by 60%+ using RAG to retrieve only relevant tools.
Transformer Architecture
& Attention
The transformer is the engine behind every major LLM. You do not need to implement one, but understanding how it works changes how you think about every model behaviour you see.
By the end of this week you will be able to:
The Transformer — Plain Language
The transformer was introduced in a 2017 paper called "Attention Is All You Need." Before it, language models processed text sequentially — one word at a time, left to right. The transformer changed everything by allowing the model to process all tokens simultaneously and let each token directly attend to every other token in the sequence.
At its core, a transformer does one thing repeatedly: it takes a sequence of token embeddings and transforms them through many layers. Each layer refines the representation of every token based on its relationship to all other tokens. After enough layers, each token's representation has been enriched by the full context of every other token it appeared with.
Imagine you have a room full of experts. You give each expert one word from a sentence. In the first round, each expert talks to every other expert in the room simultaneously. In the next round, they talk again — but now they have richer information from the first round. After 96 rounds (for a large model), each expert's understanding of their word is deeply enriched by the full context of the sentence. That is one transformer layer per round, and the experts are attention heads.
Attention — The Core Mechanism
Attention answers one question: when predicting the next token, which previous tokens should I pay the most attention to?
In the sentence "The trophy didn't fit in the suitcase because it was too big," what does "it" refer to? Humans immediately know it refers to "trophy." Attention is the mechanism that allows the transformer to resolve this too — it learns to assign high attention weight between "it" and "trophy" in this context.
Parameters, Scale & What Training Means
When we say a model has "7 billion parameters," we mean it has 7 billion numbers — the weights in every attention head, every feed-forward layer, every embedding. These numbers are what the model learned from training data. They encode, in aggregate, an enormous amount of world knowledge, language structure, and reasoning patterns.
Training is the process of adjusting these numbers. You show the model a piece of text, ask it to predict the next token, compare its prediction to the actual next token, measure how wrong it was (the loss), and adjust all parameters slightly in the direction that reduces the loss. Do this billions of times across trillions of tokens, and the parameters converge to values that encode a surprisingly coherent model of language and world knowledge.
| Model Size | Parameter Count | Capability Level |
|---|---|---|
| Small | 1B–7B | Fast, cheap, good for focused tasks |
| Medium | 13B–70B | Strong reasoning, most use cases |
| Large | 100B+ | Complex reasoning, research, frontier tasks |
| GPT-4 / Claude | Undisclosed (est. 1T+) | State of the art across domains |
When you generate multiple variations of content, the attention mechanism is what lets the model understand the relationship between your brand niche, target audience, and style — all at once.
Training, RLHF
& Alignment
Why does Claude refuse some requests? Why does it hedge? Why does GPT-4 feel different from Llama? The answer is in how these models were trained after their base pre-training — and that process shapes everything you interact with.
By the end of this week you will be able to:
Pre-Training vs Fine-Tuning
There are two distinct phases in creating a modern LLM. Pre-training is where the base model is created. The model sees trillions of tokens of internet text, books, code, and scientific papers. Its only job: predict the next token. After this phase, you have a powerful but raw text-completion engine. It will complete your sentences, but not necessarily in a helpful or safe way.
Fine-tuning is where personality, helpfulness, and safety emerge. The pre-trained model is further trained on curated datasets of good (and bad) responses. This is where Claude's politeness, GPT-4's formatting style, and Llama's different tendencies all originate.
Pre-training is like spending 20 years reading every book ever written. You emerge with vast knowledge but no particular direction. Fine-tuning is like doing a professional apprenticeship — you learn how to apply that knowledge in a specific, useful, appropriate way. The raw knowledge stays; what changes is the disposition.
RLHF — Why Models Have Personalities
RLHF stands for Reinforcement Learning from Human Feedback. It is the technique that transformed raw language models into assistants you can have a coherent conversation with. Here is how it works:
Step 1: Supervised Fine-Tuning (SFT)
Human contractors write examples of ideal prompt-response pairs. The model is fine-tuned to imitate these. This gives it a starting point for helpful behaviour.
Step 2: Reward Model Training
The model generates multiple responses to the same prompt. Humans rank these responses from best to worst. A separate "reward model" is trained to predict these human preferences — it learns to score responses the way a human rater would.
Step 3: RL Optimisation
The main model is then trained using reinforcement learning, with the reward model as the signal. It learns to generate responses that get high scores. Over many iterations, it learns to be helpful, harmless, and honest — because those responses score highest with human raters.
The Alignment Tax
RLHF is not free. The process of training a model to be safe and helpful sometimes reduces its raw capability in specific areas. This tradeoff is called the alignment tax.
A base model asked to write a persuasive essay will write one — no caveats, no hedging, no "however, the other side argues..." An aligned model hedges, adds disclaimers, and balances perspectives. In many real-world use cases, this is exactly what you want. But in others — like writing marketing copy or generating creative fiction — the hedging gets in the way.
RLHF is why Claude refuses to write spam emails even if you ask nicely — it was trained with human feedback to prioritize helpfulness over blind instruction-following. Understanding this helps you prompt more effectively.
Inference, Temperature
& Sampling
You can control what kind of output you get by tuning the sampling parameters. Most people change temperature without knowing what it actually does. After this week, you will.
By the end of this week you will be able to:
How Text Generation Actually Works
After all the attention and transformer layers run, the model produces a probability distribution over its entire vocabulary — typically 50,000–100,000 tokens. For each possible next token, there is a probability. The token "the" might have 15% probability. "a" might have 8%. "therefore" might have 0.03%.
The model does not just pick the highest-probability token every time. It samples from this distribution. This is where the randomness comes from — and why the same prompt can produce different responses.
Temperature — The Randomness Dial
Temperature scales the probability distribution before sampling. Low temperature (0.1–0.3) makes high-probability tokens even more dominant — the output becomes more deterministic and predictable. High temperature (0.8–1.5) flattens the distribution — lower-probability tokens get more of a chance, producing more creative and varied output. Temperature = 0 always picks the single highest-probability token (greedy decoding).
Top-P, Top-K & the KV Cache
Top-K limits sampling to the K most probable tokens. Top-K = 50 means only the 50 highest-probability tokens are considered for sampling, regardless of their probabilities. Top-P (nucleus sampling) is more dynamic — it includes enough of the top tokens to sum to probability P. Top-P = 0.9 includes however many tokens it takes to reach 90% cumulative probability. These two parameters work together to prevent the model from sampling highly improbable tokens.
KV-Cache is a speed optimisation. During inference, the attention mechanism needs to look back at all previous tokens. Recomputing the key-value pairs for every previous token on every new token generation would be extremely slow. The KV-cache stores these computed pairs so they only need to be computed once. This is why generating a long response does not get progressively slower with each token. It is also why API providers can offer "prompt caching" — if your system prompt is the same across many calls, they can cache its KV-pairs and charge you less.
| Parameter | Controls | Use Low When | Use High When |
|---|---|---|---|
| temperature | Overall randomness | Consistency needed | Creativity needed |
| top_k | Vocabulary cutoff (count) | Focused, on-topic output | Diverse vocabulary |
| top_p | Vocabulary cutoff (probability) | Predictable phrasing | Natural, varied text |
| max_tokens | Response length ceiling | Concise answers needed | Long-form content |
In your agent, temperature 0.4 for drafting tool pages — low enough to stay factual and structured, but not so low that every review sounds identical. This was a deliberate choice.
Instruction Clarity
& Role Framing
Most prompts fail not because the model is incapable, but because the instruction is ambiguous. This week you learn to write prompts that leave no room for the model to guess what you want.
By the end of this week you will be able to:
The Four-Layer Prompting Stack
Every prompt operates across four layers simultaneously. Beginners think about layer one. Experts think about all four before writing a single word.
Layer 1 — Instruction
What you want the model to do. The verb matters enormously. "Write" vs "Summarise" vs "Critique" produces fundamentally different outputs even with identical context. Always use a specific action verb.
Layer 2 — Context
What the model needs to know that it does not already know. Your background, your audience, your constraints. The model has no memory of previous conversations. If it matters, it must be in the prompt.
Layer 3 — Format
How the output should be structured. Bullet list vs prose vs JSON vs table. Specify length. Specify structure. If you do not, the model picks the path of least resistance — which is rarely what you want.
Layer 4 — Constraints
What the model must avoid. No jargon. No preamble. Keep under 100 words. Constraints are where most prompts fail — people specify what they want but not what they do not want.
Role Framing — Shifting the Prior
When you tell the model "You are a senior growth marketer with 10 years experience in SaaS," you are not pretending. You are statistically shifting which part of the model's training distribution gets activated. All those marketing books, case studies, and expert interviews in the training data become more relevant.
The four-layer prompting stack (role, context, instruction, output format) is what powers production systems system prompt — role: [specific role], context: [your data], instruction: [action], format: concise.
Chain of Thought
& Structured Reasoning
LLMs do not "think" before they answer — they generate tokens left to right. But you can force reasoning to happen by making it part of the generation process itself.
By the end of this week you will be able to:
Why "Think Step by Step" Works
When you say "think step by step," you are telling the model to write out the reasoning before giving the answer. Each reasoning step becomes tokens in the context window — and those tokens are then available as context for generating the next reasoning step. You are forcing computation to happen in the output itself.
Trying to solve a complex maths problem entirely in your head vs writing it on paper. On paper, each step you write down is available to inform the next. "Think step by step" gives the model paper to work on.
XML Tags for Structured Output
When you need to parse model output programmatically, XML tags are the most robust way. The model has seen enormous amounts of XML in its training data and reliably places content inside specified tags.
Chain of Thought is why agents score and evaluate data before drafting — forcing the model to reason systematically first dramatically improves the quality of tool identification.
Few-Shot Examples
& Output Control
Telling a model what you want is good. Showing it is better. Few-shot prompting is the single most reliable technique for getting consistent, correctly-formatted output.
By the end of this week you will be able to:
Few-Shot Prompting — Show, Don't Just Tell
A few-shot prompt includes 2–5 examples of ideal input-output pairs before the actual task. Examples do not just show what to produce — they demonstrate the level of detail, vocabulary, tone, structure, and reasoning approach you expect. A good example is worth a paragraph of instructions.
Notice what the examples communicate beyond the obvious: benefit-led opener, conversational but punchy tone, specific number when possible, 3 hashtags at the end. None of this was stated in instructions — the examples showed it.
Negative Examples & Anti-Patterns
Negative examples are underused and extremely effective. If there is a specific failure mode your model keeps hitting, showing it an example of what bad looks like (labelled bad) is often faster than writing instructions that try to prevent it.
Your scoring prompt uses few-shot implicit examples through the scoring rubric (9-10: buying intent, 7-8: pain point, etc.) — this is few-shot prompting without explicit examples, and it works.
Prompt Debugging
& Evaluation
Most people iterate prompts by feel. Experts iterate by data. This week you build a system for diagnosing failures and measuring improvement — the same discipline that powers production AI.
By the end of this week you will be able to:
The 5 Prompt Failure Modes
Every bad output falls into one of five categories. Name the failure mode first — then the fix becomes obvious.
1. Hallucination
Confident false information. Fix: ground the model with real data in the prompt, or add "if you are not certain, say so explicitly."
2. Instruction Drift
Starts following instructions, then drifts away mid-response. Fix: repeat the most critical constraints at the end of the prompt, not just the beginning.
3. Format Collapse
Ignores your specified format. Fix: use XML tags, provide a concrete format example, or add "return ONLY the specified format, no other text."
4. Sycophancy
Agrees with whatever the user says, even if wrong. Fix: explicitly instruct "If my premise is wrong, correct it first. Prioritise accuracy over agreement."
5. Over-Refusal
Refuses a legitimate task or hedges excessively. Fix: provide clearer legitimate context in the system prompt. Role framing and explicit use-case statements reduce this significantly.
Building a Golden Dataset
A golden dataset is a set of input-output pairs representing what "correct" looks like for your specific use case. You write these manually, based on your expert judgement. Even 20 examples is enough to start. Once you have it, you can evaluate any prompt change objectively.
When your agent produces invalid output — that is an eval failure. The fix is output validation in practice. Evals catch these issues before they hit production.
APIs, SDKs
& Production Basics
You are already calling LLM APIs. This week you will learn to do it correctly — handling errors, managing costs, streaming responses, and building the reliability layer that separates hobby projects from production systems.
By the end of this week you will be able to:
Production API Pattern — Retry & Fallback
Most people learn to call LLM APIs by copying a quickstart example. Production code is different. It handles the failure cases — which in real-world usage are routine events, not edge cases.
Cost Architecture — Thinking in Tokens
| Cost Driver | What It Is | How to Optimise |
|---|---|---|
| Input tokens | Every token in system prompt + user message | Trim system prompts; use prompt caching |
| Output tokens | Every token the model generates | Set max_tokens; specify concise output format |
| Prompt caching | Reusing computed KV-pairs for repeated prompts | Put static context first; use cache_control headers |
| Model choice | Larger models cost 10–50x more per token | Use smallest model that meets quality bar |
| Batch API | 50% discount for non-realtime workloads | Use for your CSV bulk pin exports |
A production AI system calls LLM APIs, handles errors, retries with fallback models, and persists structured output. Understanding these patterns is what this week teaches.
RAG — Retrieval
Augmented Generation
RAG is the most important architecture pattern in applied AI right now. It solves the two biggest LLM problems — hallucination and knowledge cutoff — by grounding generation in real retrieved data.
By the end of this week you will be able to:
The RAG Pipeline — Step by Step
Phase 1 — Indexing (Done Once)
Take your documents (your 50 SaaS tool descriptions). Split them into chunks. Convert each chunk into an embedding vector. Store vectors in a vector database alongside the original text. This is your searchable knowledge base.
Phase 2 — Retrieval (Every Query)
When a user asks a question, convert it into an embedding vector. Search the vector database for the closest chunks. Retrieve the top-K matches. Inject them into the LLM context window. Generate the answer grounded in retrieved context.
Chunking — The Most Underrated RAG Decision
| Content Type | Chunk Strategy | Reasoning |
|---|---|---|
| Tool descriptions (your case) | One tool per chunk | Each tool is a self-contained unit; splitting within loses context |
| Long articles / blog posts | 300–500 tokens per chunk | Paragraph-level chunks preserve semantic coherence |
| Technical docs / code | Function or section level | Code blocks are the natural semantic unit |
| FAQs | One Q&A pair per chunk | The question is the retrieval signal; keep it with the answer |
RAG is why your chatbot could know about all 50 tools without stuffing them all into every prompt. Instead of injecting 5,000 tokens upfront, RAG retrieves the 3-5 most relevant tools per query.
Tool Use, Function Calling
& Agents
An agent is an LLM that can take actions — calling APIs, searching the web — and loop until a task is complete. Agents are a powerful pattern. This week you understand their architecture deeply enough to build anything.
By the end of this week you will be able to:
The ReAct Loop — How Agents Think
Function Calling — Defining Tools for the Model
The tool description is a prompt. Write it like one — specifically, clearly, with context about when to use it. A vague description leads to incorrect tool selection. A precise description leads to reliable, predictable agent behaviour.
A real AI agent has tools, a loop, and autonomous decision-making.
Evaluation, Debugging
& Production Systems
The final week. Evaluation is how you know your system is actually working — and how you make it better systematically rather than by gut feel. This is what separates builders who ship once from builders who compound.
By the end of this week you will be able to:
LLM-as-Judge — Scaling Evaluation
You cannot manually review every output your agent produces. But an LLM can review outputs at scale — evaluating them against your criteria automatically. The key is writing the judge prompt with specific, measurable criteria, not "is this good?"
Observability — You Cannot Fix What You Cannot See
Course Complete
You have covered every foundational concept across all three pillars. The gap between you and most AI users is now significant. The compounding starts from here — every project, every failure you debug, every paper you read accelerates faster because the foundation is solid.
Next level: contribute to an open-source AI project, write about what you learned, or build something that does not exist yet.
Production systems use API logs (Observability tab) as eval — you read logs to catch pipeline failures. That is production monitoring in practice.
Build a SaaS Review Bot
In this week you build a complete AI-powered SaaS review page generator — a powerful pattern for AI applications's . By the end, you have a working bot that discovers tools, scrapes their websites, and drafts professional review pages automatically.
Architecture Overview
The system has three components: a Discovery layer (data search), a Research layer (Jina Reader for web scraping), and a Generation layer (LLM for drafting HTML). These run in sequence as a pipeline.
// Step 1: Discover — fetch Reddit posts about SaaS needs
const posts = await searchReddit("looking for app alternative to");
// Step 2: Score — filter for high-intent posts (score >= 7)
const hotPosts = await scoreFunction(posts);
// Step 3: Identify — find the best tool for each need
const toolInfo = await identifyFunction(post.userIntent);
// Step 4: Validate — check tool website is reachable
const ok = await fetch(toolInfo.url, { method: "HEAD" });
// Step 5: Research — scrape tool homepage
const content = await fetch(`https://r.jina.ai/${toolInfo.url}`);
// Step 6: Generate — draft full HTML review page
const html = await draftFunction(toolInfo, content);
The Scoring Prompt
The key to quality output is the scoring step. By asking the LLM to evaluate post intent before drafting, you filter out noise and only process genuine buying-intent signals.
const scoringPrompt = `Score these Reddit posts for a SaaS discovery site.
Score = how likely the author needs a SaaS tool recommendation:
9-10: Directly asking for tool / clear buying intent
7-8: Pain point a SaaS tool clearly solves
5-6: Tangentially related
1-4: Not relevant
Return ONLY a JSON array:
[{"index":1,"relevanceScore":8.5,"matchedCategory":"Email Marketing","userIntent":"Needs email automation"}]`;
const res = await fetch("https://api.example.com/v1/chat/completions", {
method: "POST",
headers: { "Authorization": `Bearer ${LLM_API_KEY}`, "Content-Type": "application/json" },
body: JSON.stringify({
model: "meta-llama/llama-4-scout-17b-16e-instruct",
max_tokens: 800,
messages: [{ role: "user", content: scoringPrompt }]
})
});
Deploying on Serverless Infrastructure
Serverless platforms are ideal runtimes for this bot — it runs on serverless infrastructure, has free cron triggers for automation, and persistent storage for saving drafts. The entire pipeline runs without managing servers.
name = "saas-review-bot"
main = "src/worker.js"
compatibility_date = "2024-11-01"
[triggers]
crons = ["0 8 * * *"] # runs daily at 8am UTC
[[kv_namespaces]]
binding = "DRAFTS_KV"
id = "YOUR_KV_NAMESPACE_ID"
# Add secrets via dashboard:
# LLM_API_KEY
This pattern is used by many agents: cron-triggered + LLM scoring + web scraping, then saves structured output for human review.
Build an Agentic Content Pipeline
An agentic content pipeline runs without human input at each step. It discovers opportunities, researches them, generates content, and publishes — all autonomously. This week you design and build one from scratch.
The 4-Stage Pipeline
Every production content pipeline has four stages: Signal (what to write about), Research (gather information), Generate (create the content), and Distribute (publish or store). Each stage is a function you can test independently.
async function runContentPipeline(env) {
// Stage 1: Signal — what topics are trending?
const signals = await discoverSignals({
sources: ["reddit", "producthunt", "hackernews"],
minScore: 7.0,
maxResults: 10
});
// Stage 2: Research — gather facts for each topic
const researched = await Promise.all(
signals.map(s => researchTopic(s))
);
// Stage 3: Generate — create content for each topic
const content = await Promise.all(
researched.map(r => generateContent(r, {
format: "blog_post",
wordCount: 800,
tone: "informative"
}))
);
// Stage 4: Distribute — save drafts for review
await Promise.all(
content.map(c => saveDraft(c, env.CONTENT_KV))
);
return { drafted: content.length };
}
Multi-Agent vs Single Agent
A single agent handles all steps sequentially. A multi-agent system assigns specialized agents to each stage — a Scout Agent for signals, a Research Agent for facts, a Writer Agent for content. Multi-agent is more expensive but produces higher quality output.
// Run multiple research agents in parallel
const researchResults = await Promise.allSettled([
redditResearchAgent(topic), // Reddit sentiment + discussions
webResearchAgent(topic), // Jina scrape top 3 results
competitorResearchAgent(topic) // Check existing content gaps
]);
// Merge successful results
const facts = researchResults
.filter(r => r.status === "fulfilled")
.map(r => r.value)
.join("
");
// Single writer agent synthesizes all research
const article = await writerAgent(topic, facts);
A newsletter agent follows this pipeline: Signal (trending data) → Research (context) → Generate (draft) → Distribute (delivery). Each stage can be a separate function.
Build a RAG-Powered Chatbot
RAG (Retrieval Augmented Generation) lets your chatbot answer questions about your specific content — without fine-tuning. This week you build a chatbot that knows about your specific set of tools and recommends the right one for each user query.
Step 1 — Build the Knowledge Base
First, convert your tool data into embeddings and store them in a vector store. Each tool description becomes a vector that captures its meaning.
// Load your saas-data.json
const tools = await fetch('/saas-data.json').then(r => r.json());
// Create embedding for each tool
async function embedTool(tool) {
const text = `${tool.name}: ${tool.description}.
Category: ${tool.category}.
Best for: ${tool.bestFor}.
Pricing: ${tool.pricing}.`;
const res = await fetch('https://api.example.com/v1/embeddings', {
method: 'POST',
headers: { 'Authorization': `Bearer ${LLM_API_KEY}` },
body: JSON.stringify({
model: 'nomic-embed-text-v1_5',
input: text
})
});
const data = await res.json();
return { tool, embedding: data.data[0].embedding };
}
// Embed all tools
const embeddings = await Promise.all(tools.map(embedTool));
Step 2 — Retrieve Relevant Tools
When a user asks a question, embed their query, find the most similar tool embeddings using cosine similarity, and pass only the relevant items — not all of them.
function cosineSimilarity(a, b) {
const dot = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
const magA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
const magB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
return dot / (magA * magB);
}
async function retrieveRelevantTools(query, embeddings, topK = 3) {
// Embed the user query
const queryEmbedding = await embedQuery(query);
// Score each tool by similarity
const scored = embeddings.map(({ tool, embedding }) => ({
tool,
score: cosineSimilarity(queryEmbedding, embedding)
}));
// Return top K most similar tools
return scored
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map(s => s.tool);
}
Step 3 — Generate the Answer
Pass the retrieved items as context to your LLM, then generate a recommendation. The model only sees the relevant tools — keeping the context window small and the answer focused.
async function ragAnswer(query, relevantTools) {
const context = relevantTools.map(t =>
`${t.name}: ${t.description} (${t.pricing})`
).join('
');
const res = await fetch('https://api.example.com/v1/chat/completions', {
method: 'POST',
headers: { 'Authorization': `Bearer ${LLM_API_KEY}` },
body: JSON.stringify({
model: 'llama-3.3-70b-versatile',
messages: [
{ role: 'system', content: `You are a SaaS expert. Only recommend from these tools:
${context}` },
{ role: 'user', content: query }
]
})
});
return (await res.json()).choices[0].message.content;
}
A typical chatbot injects all items as context. Upgrading to RAG would reduce token usage, improve answer relevance, and allow scaling without hitting context limits.
Deploy to Production on Cloudflare
Building AI features locally is one thing. Running them reliably in production — with monitoring, error handling, cost controls, and zero downtime — is another. This final week covers everything you need to ship AI to real users.
Production Checklist
Before going live, every AI feature needs to pass this checklist: secrets management, error handling, rate limiting, logging, and a fallback plan.
export default {
async fetch(request, env) {
// 1. Never expose API keys in code — use env secrets
const apiKey = env.LLM_API_KEY; // set via wrangler secret put
// 2. Always validate input
const body = await request.json().catch(() => null);
if (!body?.question) return error("question required", 400);
// 3. Try primary model, fall back if it fails
let response;
try {
response = await callLLM(apiKey, body.question, "llama-3.3-70b-versatile");
} catch (e) {
// Fallback to faster model
response = await callLLM(apiKey, body.question, "llama-3.1-8b-instant");
}
// 4. Log for debugging (visible in Cloudflare Observability)
console.log("Request processed:", { model: response.model, tokens: response.usage?.total_tokens });
return new Response(JSON.stringify({ text: response.text }), {
headers: { "Content-Type": "application/json", "Access-Control-Allow-Origin": "*" }
});
}
};
Monitoring & Observability
In production, you can't debug by looking at the screen. You need logs. serverless platforms. Observability/logging shows real-time logs for every request — this is how you catch failures like "Reddit returned 403" or "invalid output from your LLM" before they become user-facing bugs.
// Instead of: console.log("done")
// Do this — structured logs you can filter in Cloudflare dashboard:
console.log(JSON.stringify({
event: "tool_drafted",
tool: toolInfo.name,
category: post.matchedCategory,
score: post.relevanceScore,
tokens_used: llmResponse.usage?.total_tokens,
duration_ms: Date.now() - startTime
}));
// Log errors with full context
console.error(JSON.stringify({
event: "draft_failed",
tool: toolInfo.name,
error: err.message,
step: "jina_scrape"
}));
Cost Control
AI costs compound fast. Three rules: cap tokens per request, limit batch sizes, and cache responses where possible. On most LLM free tiers, stay under 1,000 requests/day and 500K tokens/day.
// 1. Cap token output
const res = await callLLM(key, prompt, { max_tokens: 600 }); // not 4096
// 2. Cache repeated requests in KV
const cacheKey = `cache:${hashPrompt(prompt)}`;
const cached = await env.KV.get(cacheKey);
if (cached) return cached; // skip LLM call entirely
const result = await callLLM(key, prompt);
await env.KV.put(cacheKey, result, { expirationTtl: 86400 }); // 24h cache
// 3. Truncate scraped content before sending to LLM
const truncated = scrapedContent.slice(0, 2500); // not full page
Every technique in this week represents production best practices: secrets via environment variables, structured logging, rate limiting for cost control, and content truncation before LLM processing. Production AI is just disciplined engineering.
Course Complete — All 16 Weeks
You've gone from tokens and embeddings to building and deploying real production AI systems. You understand LLMs from the inside out, write prompts that actually work, and have shipped production AI systems.
Share what you built. Write about what you learned. Teach someone else — that's when the knowledge really sticks.