✦ 12-Week Interactive Course

Become an
AI Expert.

Not just a user. An expert. The person who understands what the model is doing, why a prompt works, and how to build systems others cannot.

🔬
How LLMs Work
Tokens, attention, transformers, RLHF — the internals that explain every behaviour you see.
Weeks 1–4
✍️
Prompt Engineering
The craft of directing model output with precision. Instruction, context, format, constraints.
Weeks 5–8
⚙️
AI Tooling & Agents
APIs, RAG, function calling, multi-agent systems. Build things that do not exist yet.
Weeks 9–12

How This Works

01
Read & Learn
Each week teaches one concept with real analogies and code examples.
02
Test Yourself
Knowledge checks after each concept. Immediate feedback explains why.
03
Ask the AI Tutor
Confused? Ask Claude directly inside the lesson. No tab switching.
04
Do the Challenge
Each week ends with a hands-on challenge tied to your real projects.
Pillar 1 — LLM Internals Week 01 of 12

Tokens, Embeddings
& Context Windows

Before you can understand why LLMs behave the way they do, you need to understand the basic unit they operate on — and it is not a word.

By the end of this week you will be able to:

Explain what a token is and why "words" is the wrong mental model
Understand why context window size is a hard architectural constraint
Know what embeddings are and why similar concepts cluster together spatially
Explain why LLMs struggle with counting, arithmetic, and exact string matching

What Is a Token?

When you type a message to an LLM, the first thing that happens is your text gets tokenised — broken into chunks called tokens. Here is the critical thing most people get wrong: tokens are not words.

A token is roughly 3–4 characters of text on average. Common short words like "the" or "is" are single tokens. Longer or unusual words get split. "unhappiness" might become ["un", "happi", "ness"]. Numbers are especially fragmented — "1234567" could be 4+ tokens.

Analogy

Think of tokens like musical notes. A song is not made of "songs within a song" — it is made of individual notes. An LLM does not read sentences. It reads a stream of tokens, one at a time, and its job is to predict the next note in the melody.

Tokenisation Example # The text: "Claude is an AI assistant" # Gets split into tokens like: tokens = ["Cl", "aude", " is", " an", " AI", " assistant"] # Cost: 6 tokens in, not 5 "words" # Token count ≈ word count × 1.3 for English # Code and non-English text use MORE tokens per word
Expert Insight
This is why LLMs struggle with tasks like "count the letter R in 'strawberry'" — they never see individual characters. They see tokens, and character-level information is partially lost in that mapping. Understanding tokens explains this failure instantly.

The practical implications are significant for anyone building AI tools. Token count drives your cost — every API call bills by tokens in plus tokens out. Non-English text, code, and structured data (JSON, CSV) typically use 20–40% more tokens than equivalent English prose. This affects how you design prompts and what you include in context.

Knowledge Check
Why does the word "unhappiness" likely get split into multiple tokens while "the" does not?
A
"unhappiness" is a less common word
B
Tokenisers compress common short sequences into single tokens; long/rare words get split into frequent sub-word pieces
C
Because it has more than 8 characters
D
It has more syllables

What Is a Context Window?

Every LLM has a context window — the maximum number of tokens it can process at once. Think of it as the model's working memory. Everything the model can "see" when generating its next token must fit inside this window.

This includes your system prompt, the entire conversation history, any documents you injected, and the model's own previous output. When the window fills up, older content gets cut off — the model literally cannot see it anymore.

Why This Matters For You
Your application with a chatbot runs via an API. Every time a user sends a message, you are sending tokens. If you inject all 50 items into every request, you could be spending 3,000–5,000 tokens before the user says a word. This is why RAG (Week 10) exists — retrieve only the relevant items, not all of them.

Context window sizes have grown dramatically. GPT-3 had 4K tokens. Claude now supports up to 200K tokens. But more context ≠ better performance. Models tend to pay more attention to the beginning and end of context (the "lost in the middle" problem). Relevant information buried in the middle of a huge context gets attended to less reliably.

ModelContext WindowApproximate Pages
GPT-3 (2020)4,096 tokens~3 pages
GPT-4 (2023)8K / 32K tokens~6 / 24 pages
Claude 3.5 (2024)200,000 tokens~150 pages
Gemini 1.5 Pro1,000,000 tokens~750 pages

Embeddings — How Meaning Becomes Math

Before the transformer processes tokens, each token gets converted into an embedding — a list of numbers (a vector) that represents its meaning. A typical embedding might be 1,536 numbers long. Every token in the model's vocabulary has its own embedding vector.

The remarkable thing is that these vectors capture semantic meaning spatially. Words that mean similar things have vectors that are close together in that high-dimensional space. Words with opposite meanings are far apart.

Analogy

Imagine a massive 3D room where every concept in the English language has a physical location. "King" and "Queen" are near each other. "Dog" and "Cat" are close. "Hot" and "Cold" are far apart. The embedding space is that room, but with 1,536 dimensions instead of 3. This is how the model understands that "dog" and "puppy" are related even if they never appeared in the same sentence during training.

Embedding Arithmetic (Real Example) # The famous word2vec result — embeddings capture relationships: vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen") # This works because the model learned the concept of # "royalty" and "gender" as directions in the embedding space. # Practical use: semantic search # embed("best tool for email marketing") → vector # compare to embedded tool descriptions → find closest matches # This is how your chatbot can do smart tool recommendations
Knowledge Check
An LLM is asked to count how many times the letter "r" appears in "strawberry". It confidently says 2, but the answer is 3. What is the most accurate explanation?
A
The model was not paying enough attention to the task
B
The model did not see enough examples of letter counting in training
C
Tokenisation splits words into sub-word chunks, so the model never sees individual characters — character-level operations are architecturally difficult
D
The context window was too small to hold the word
AI Tutor — Ask About This Week's Content
🤖
I'm your AI tutor for Week 1. Ask me anything about tokens, embeddings, or context windows. I'll explain it in plain language and connect it to your real projects.
Why more tokens for other languages?
How does this affect API costs?
What is BPE tokenisation?
How do I use embeddings in my chatbot?
🎯
Week 1 Challenge
Apply what you learned. These tasks take 20–30 minutes total and use tools you already have access to.
Go to platform.openai.com/tokenizer or Anthropic's tokeniser tool. Type a message you commonly send to your chatbot. Count the tokens. Now rewrite it to say the same thing in fewer tokens. What's the minimum?
Ask Claude: "How many times does the letter 'r' appear in 'strawberry'?" Note the answer. Then ask: "Use a code interpreter to count it." Compare results. You just observed the tokenisation limitation in the wild.
Look at your chatbot's current system prompt. Count the approximate tokens (words × 1.3). Calculate how much of your context window budget that uses per conversation. Does it change how you'd write it?
🌍 Real World

Your SaaS chatbot injects 50 tool descriptions into every request — potentially 3,000–5,000 tokens before the user says a word. Understanding tokens helps you reduce cost by 60%+ using RAG to retrieve only relevant tools.

🧠 Week Quiz
Q1. What is a token in the context of LLMs?
Q2. Context window refers to...
Q3. Embeddings convert text into...
Q4. True or False: Two sentences with similar meaning will have embeddings that are far apart in vector space.
Q5. True or False: A longer context window always means better outputs.
Pillar 1 — LLM Internals Week 02 of 12

Transformer Architecture
& Attention

The transformer is the engine behind every major LLM. You do not need to implement one, but understanding how it works changes how you think about every model behaviour you see.

By the end of this week you will be able to:

Describe the transformer architecture in plain language without jargon
Explain what attention does: which tokens influence which other tokens
Understand why more parameters generally means better capability
Know what training means: gradient descent, loss, and data scale

The Transformer — Plain Language

The transformer was introduced in a 2017 paper called "Attention Is All You Need." Before it, language models processed text sequentially — one word at a time, left to right. The transformer changed everything by allowing the model to process all tokens simultaneously and let each token directly attend to every other token in the sequence.

At its core, a transformer does one thing repeatedly: it takes a sequence of token embeddings and transforms them through many layers. Each layer refines the representation of every token based on its relationship to all other tokens. After enough layers, each token's representation has been enriched by the full context of every other token it appeared with.

Analogy

Imagine you have a room full of experts. You give each expert one word from a sentence. In the first round, each expert talks to every other expert in the room simultaneously. In the next round, they talk again — but now they have richer information from the first round. After 96 rounds (for a large model), each expert's understanding of their word is deeply enriched by the full context of the sentence. That is one transformer layer per round, and the experts are attention heads.

Attention — The Core Mechanism

Attention answers one question: when predicting the next token, which previous tokens should I pay the most attention to?

In the sentence "The trophy didn't fit in the suitcase because it was too big," what does "it" refer to? Humans immediately know it refers to "trophy." Attention is the mechanism that allows the transformer to resolve this too — it learns to assign high attention weight between "it" and "trophy" in this context.

Attention in Concept # For each token, attention computes three things: Query = "What am I looking for?" # (this token's question) Key = "What do I offer?" # (every token's label) Value = "What information I carry" # (every token's content) # "it" asks its Query: "who am I referring to?" # All tokens respond with their Keys # "trophy" has a high match → high attention weight # The Value of "trophy" flows into the representation of "it" # Result: "it" now carries meaning enriched by "trophy"
Expert Insight
Multi-head attention runs this process in parallel many times (e.g. 96 heads in GPT-4). Each head learns to attend to different types of relationships — one head might track syntactic structure, another co-reference, another factual associations. The outputs of all heads are combined. This is why transformers generalise so well: they learn many different types of relationships simultaneously.
Knowledge Check
A model produces a response that correctly uses a word from 3,000 tokens earlier in the conversation. Which mechanism makes this possible?
A
The model has a built-in memory system
B
Attention allows every token to directly attend to every other token in the context window, regardless of distance
C
Recurrent processing carries information from early tokens forward
D
The model is simply large enough to hold all information

Parameters, Scale & What Training Means

When we say a model has "7 billion parameters," we mean it has 7 billion numbers — the weights in every attention head, every feed-forward layer, every embedding. These numbers are what the model learned from training data. They encode, in aggregate, an enormous amount of world knowledge, language structure, and reasoning patterns.

Training is the process of adjusting these numbers. You show the model a piece of text, ask it to predict the next token, compare its prediction to the actual next token, measure how wrong it was (the loss), and adjust all parameters slightly in the direction that reduces the loss. Do this billions of times across trillions of tokens, and the parameters converge to values that encode a surprisingly coherent model of language and world knowledge.

Model SizeParameter CountCapability Level
Small1B–7BFast, cheap, good for focused tasks
Medium13B–70BStrong reasoning, most use cases
Large100B+Complex reasoning, research, frontier tasks
GPT-4 / ClaudeUndisclosed (est. 1T+)State of the art across domains
AI Tutor — Week 2
🤖
Ready to go deeper on transformers and attention. What's unclear? I can explain with more analogies, connect it to your setup, or walk through the math conceptually if you want.
Encoder vs decoder models?
Why do large models suddenly get new abilities?
What is positional encoding?
🎯
Week 2 Challenge
Observe attention and scale in the wild using models you already have access to.
Give Claude a long document (paste any article). Ask it to find a specific piece of information from the beginning after discussing something from the end. Observe that it can do this — that is attention spanning the full context.
Run the same complex reasoning task through a smaller model and a larger model (Claude). Document where the small model fails and where it succeeds. You are observing the effect of scale on capability.
Find the Andrej Karpathy "Let's build GPT from scratch" video on YouTube. Watch the first 20 minutes. You do not need to follow the code — just absorb the conceptual walkthrough of next-token prediction.
🌍 Real World

When you generate multiple variations of content, the attention mechanism is what lets the model understand the relationship between your brand niche, target audience, and style — all at once.

🧠 Week Quiz
Q1. What does the attention mechanism do?
Q2. Transformers process tokens...
Q3. More parameters in a model generally means...
Q4. True or False: The Transformer architecture was introduced in the paper "Attention Is All You Need".
Q5. True or False: Self-attention allows each token to look at every other token in the sequence.
Pillar 1 — LLM Internals Week 03 of 12

Training, RLHF
& Alignment

Why does Claude refuse some requests? Why does it hedge? Why does GPT-4 feel different from Llama? The answer is in how these models were trained after their base pre-training — and that process shapes everything you interact with.

By the end of this week you will be able to:

Explain the difference between pre-training and fine-tuning
Understand what RLHF is and why it shapes model personality and behaviour
Know what the "alignment tax" means and when it matters for your use case
Explain why different models behave differently on the same prompt

Pre-Training vs Fine-Tuning

There are two distinct phases in creating a modern LLM. Pre-training is where the base model is created. The model sees trillions of tokens of internet text, books, code, and scientific papers. Its only job: predict the next token. After this phase, you have a powerful but raw text-completion engine. It will complete your sentences, but not necessarily in a helpful or safe way.

Fine-tuning is where personality, helpfulness, and safety emerge. The pre-trained model is further trained on curated datasets of good (and bad) responses. This is where Claude's politeness, GPT-4's formatting style, and Llama's different tendencies all originate.

Analogy

Pre-training is like spending 20 years reading every book ever written. You emerge with vast knowledge but no particular direction. Fine-tuning is like doing a professional apprenticeship — you learn how to apply that knowledge in a specific, useful, appropriate way. The raw knowledge stays; what changes is the disposition.

RLHF — Why Models Have Personalities

RLHF stands for Reinforcement Learning from Human Feedback. It is the technique that transformed raw language models into assistants you can have a coherent conversation with. Here is how it works:

Step 1: Supervised Fine-Tuning (SFT)

Human contractors write examples of ideal prompt-response pairs. The model is fine-tuned to imitate these. This gives it a starting point for helpful behaviour.

Step 2: Reward Model Training

The model generates multiple responses to the same prompt. Humans rank these responses from best to worst. A separate "reward model" is trained to predict these human preferences — it learns to score responses the way a human rater would.

Step 3: RL Optimisation

The main model is then trained using reinforcement learning, with the reward model as the signal. It learns to generate responses that get high scores. Over many iterations, it learns to be helpful, harmless, and honest — because those responses score highest with human raters.

Expert Insight
This is why Claude sounds like Claude and not like a raw text completer. The "personality" you experience — the hedging, the helpfulness, the refusals — is not programmed as rules. It is the statistical residue of thousands of human raters expressing their preferences. Different companies, different raters, different preferences → different model personalities.
Knowledge Check
You notice that an open-source base model (pre-training only, no RLHF) completes your story prompt with violent content, while Claude declines. What explains this difference?
A
Claude was trained on cleaner data
B
Claude has undergone RLHF which trained it to avoid content human raters scored negatively, while the base model simply predicts the statistically likely continuation regardless of content
C
Claude has hard-coded rules that block certain content types
D
Claude is a larger model

The Alignment Tax

RLHF is not free. The process of training a model to be safe and helpful sometimes reduces its raw capability in specific areas. This tradeoff is called the alignment tax.

A base model asked to write a persuasive essay will write one — no caveats, no hedging, no "however, the other side argues..." An aligned model hedges, adds disclaimers, and balances perspectives. In many real-world use cases, this is exactly what you want. But in others — like writing marketing copy or generating creative fiction — the hedging gets in the way.

Practical Application
When your content generator produces output that feels overly cautious or neutral, that is the alignment tax showing up. The fix is not switching models — it is using system prompts to explicitly grant the model permission to be direct and match your brand voice. You are not overriding safety; you are shifting the model's prior about what kind of output is appropriate in your context.
AI Tutor — Week 3
🤖
This week covers why I behave the way I do. You can ask me anything — including questions about my own training and why I sometimes refuse things or hedge. I'll be as transparent as I can.
What is Constitutional AI?
How do I reduce hedging in my tool?
RLHF vs DPO?
🎯
Week 3 Challenge
Observe RLHF effects directly by comparing model behaviours across the same prompts.
Ask different LLM models the exact same controversial question (e.g. "Write a strongly one-sided argument for X"). Document the differences in hedging, refusals, and caveats. You are observing different RLHF choices.
Find a case in your current tools where the model output feels "over-aligned" — too cautious, too hedged. Rewrite the system prompt to grant explicit context and permission. Measure whether the output quality improves.
Read Anthropic's model card for Claude (anthropic.com/model-card). Identify three specific training choices they made and why. Connect each choice back to what you learned about RLHF this week.
🌍 Real World

RLHF is why Claude refuses to write spam emails even if you ask nicely — it was trained with human feedback to prioritize helpfulness over blind instruction-following. Understanding this helps you prompt more effectively.

🧠 Week Quiz
Q1. What is fine-tuning?
Q2. RLHF stands for...
Q3. The "alignment tax" refers to...
Q4. True or False: Pre-training uses labeled, task-specific data.
Q5. True or False: RLHF can make a model refuse requests it would otherwise comply with.
Pillar 1 — LLM Internals Week 04 of 12

Inference, Temperature
& Sampling

You can control what kind of output you get by tuning the sampling parameters. Most people change temperature without knowing what it actually does. After this week, you will.

By the end of this week you will be able to:

Understand what temperature, top-p, and top-k actually do to model output
Know when to use low temperature vs high temperature for different tasks
Explain why the same prompt can produce different outputs each run
Understand what KV-cache is and why it matters for speed and cost

How Text Generation Actually Works

After all the attention and transformer layers run, the model produces a probability distribution over its entire vocabulary — typically 50,000–100,000 tokens. For each possible next token, there is a probability. The token "the" might have 15% probability. "a" might have 8%. "therefore" might have 0.03%.

The model does not just pick the highest-probability token every time. It samples from this distribution. This is where the randomness comes from — and why the same prompt can produce different responses.

Temperature — The Randomness Dial

Temperature scales the probability distribution before sampling. Low temperature (0.1–0.3) makes high-probability tokens even more dominant — the output becomes more deterministic and predictable. High temperature (0.8–1.5) flattens the distribution — lower-probability tokens get more of a chance, producing more creative and varied output. Temperature = 0 always picks the single highest-probability token (greedy decoding).

Temperature Effect on Probabilities # Original probabilities for next token: "the": 40% | "a": 20% | "some": 5% | "many": 2% # At temperature = 0.1 (very deterministic): "the": 95% | "a": 4% | "some": 0.4% | "many": 0.1% # At temperature = 1.5 (more creative): "the": 28% | "a": 18% | "some": 12% | "many": 9%
Expert Insight
For your content generator, you want moderate-high temperature (0.7–1.0) — enough variety to make each pin feel fresh, but not so high that the output becomes incoherent. For your chatbot answering factual questions, use low temperature (0.1–0.3) — you want consistent, reliable answers, not creative variation.
Knowledge Check
You are building a legal document summariser. Lawyers need consistent, reliable output that does not vary between runs. Which temperature setting should you use and why?
A
Very low (0.0–0.2) — because low temperature concentrates probability mass on the most likely tokens, producing consistent, predictable output
B
Medium (0.7) — to balance creativity and accuracy
C
High (1.5) — to generate more diverse summaries
D
Temperature does not matter for summarisation tasks

Top-P, Top-K & the KV Cache

Top-K limits sampling to the K most probable tokens. Top-K = 50 means only the 50 highest-probability tokens are considered for sampling, regardless of their probabilities. Top-P (nucleus sampling) is more dynamic — it includes enough of the top tokens to sum to probability P. Top-P = 0.9 includes however many tokens it takes to reach 90% cumulative probability. These two parameters work together to prevent the model from sampling highly improbable tokens.

KV-Cache is a speed optimisation. During inference, the attention mechanism needs to look back at all previous tokens. Recomputing the key-value pairs for every previous token on every new token generation would be extremely slow. The KV-cache stores these computed pairs so they only need to be computed once. This is why generating a long response does not get progressively slower with each token. It is also why API providers can offer "prompt caching" — if your system prompt is the same across many calls, they can cache its KV-pairs and charge you less.

ParameterControlsUse Low WhenUse High When
temperatureOverall randomnessConsistency neededCreativity needed
top_kVocabulary cutoff (count)Focused, on-topic outputDiverse vocabulary
top_pVocabulary cutoff (probability)Predictable phrasingNatural, varied text
max_tokensResponse length ceilingConcise answers neededLong-form content
AI Tutor — Week 4
🤖
Week 4 wraps up Pillar 1. Ask me anything about temperature, sampling, or inference — or about how to apply these settings in your tools. This is very practical knowledge.
Best temperature for content generation?
How does prompt caching save money?
Top-p vs top-k in practice?
🎯
Week 4 Challenge — Pillar 1 Capstone
You have now finished Pillar 1. This challenge tests your full understanding of LLM internals.
In your application, locate the LLM API call. Add temperature as an explicit parameter. Test content generation at temp 0.3, 0.7, and 1.2. Write one sentence describing the difference you observe in the output quality and variety.
Without looking at your notes, explain to someone (or write it down) how an LLM generates a response — covering: tokenisation → embedding → attention → probability distribution → sampling → output. If you can do this clearly, you understand Pillar 1.
Enable prompt caching on your most-used API. Measure the latency difference between a cached and uncached first request. Document the cost saving on your monthly estimate.
🌍 Real World

In your agent, temperature 0.4 for drafting tool pages — low enough to stay factual and structured, but not so low that every review sounds identical. This was a deliberate choice.

🧠 Week Quiz
Q1. Higher temperature in LLM inference means...
Q2. Top-P sampling controls...
Q3. The KV Cache speeds up inference by...
Q4. True or False: Temperature 0 always produces the same output for the same input.
Q5. True or False: Top-K and Top-P are mutually exclusive — you can only use one.
Pillar 2 — Prompt Engineering Week 05 of 12

Instruction Clarity
& Role Framing

Most prompts fail not because the model is incapable, but because the instruction is ambiguous. This week you learn to write prompts that leave no room for the model to guess what you want.

By the end of this week you will be able to:

Write every instruction with a verb, object, and success criterion
Use role framing to shift model behaviour without changing the task
Know when to use system prompts vs user messages vs injected context
Build a personal prompting test habit: stimulus → expected → actual → delta

The Four-Layer Prompting Stack

Every prompt operates across four layers simultaneously. Beginners think about layer one. Experts think about all four before writing a single word.

Layer 1 — Instruction

What you want the model to do. The verb matters enormously. "Write" vs "Summarise" vs "Critique" produces fundamentally different outputs even with identical context. Always use a specific action verb.

Layer 2 — Context

What the model needs to know that it does not already know. Your background, your audience, your constraints. The model has no memory of previous conversations. If it matters, it must be in the prompt.

Layer 3 — Format

How the output should be structured. Bullet list vs prose vs JSON vs table. Specify length. Specify structure. If you do not, the model picks the path of least resistance — which is rarely what you want.

Layer 4 — Constraints

What the model must avoid. No jargon. No preamble. Keep under 100 words. Constraints are where most prompts fail — people specify what they want but not what they do not want.

The Test
Before sending any prompt ask: "Could a competent person misinterpret this and produce something technically correct but completely wrong for my needs?" If yes, add more context or constraints.

Role Framing — Shifting the Prior

When you tell the model "You are a senior growth marketer with 10 years experience in SaaS," you are not pretending. You are statistically shifting which part of the model's training distribution gets activated. All those marketing books, case studies, and expert interviews in the training data become more relevant.

Role Framing — Before vs After # BEFORE — no role framing: "Write a description for a project management SaaS tool." # Result: Generic, bland, reads like a Wikipedia stub. # AFTER — with role framing: "You are a conversion copywriter specialising in B2B SaaS. Your copy is direct, benefit-led, and avoids feature-dumping. Write a 3-sentence product description for a project management tool aimed at engineering teams. Lead with the outcome." # Result: Sharp, opinionated, actually useful.
Applied to Your Work
Your content generation system prompt should not start with "You are an AI assistant." It should start with "You are a [role]. Your [output] is [brand voice], [key characteristic]."
Knowledge Check
You send the prompt: "Help me with my email." The model produces a generic email template. What is the primary failure?
A
The model is not capable enough
B
The prompt is missing all four layers — no specific verb, no context, no format, no constraints
C
Temperature was set too high
D
The prompt needed a role framing prefix
AI Tutor — Week 5
🤖
Week 5 is where prompting gets practical. Paste any prompt you are currently using and I will critique it against the four-layer framework.
Best role framing phrases?
Writing effective constraints
System prompt vs user message
🎯
Week 5 Challenge
Rewrite three real prompts you use today using the four-layer framework.
Take your content generator's current prompt. Rewrite it with all four layers explicit. Run both versions and compare output quality.
Split your SaaS chatbot system prompt cleanly into system (persona + rules) vs user (dynamic query). Test that the split works correctly.
Write 5 different role framings for the same task. Run all 5. Document which role produced the best output and why that role's training distribution was most relevant.
🌍 Real World

The four-layer prompting stack (role, context, instruction, output format) is what powers production systems system prompt — role: [specific role], context: [your data], instruction: [action], format: concise.

🧠 Week Quiz
Q1. Role framing in a prompt works by...
Q2. Which prompt is better?
Q3. Output format instructions should be...
Q4. True or False: Adding "think step by step" to a prompt can improve reasoning accuracy.
Q5. True or False: System prompts are visible to end users by default.
Pillar 2 — Prompt Engineering Week 06 of 12

Chain of Thought
& Structured Reasoning

LLMs do not "think" before they answer — they generate tokens left to right. But you can force reasoning to happen by making it part of the generation process itself.

By the end of this week you will be able to:

Understand why "think step by step" works at the architectural level
Build multi-step prompts that decompose hard tasks into verifiable sub-steps
Use XML tags to create reliable, parseable output structures
Know when chain of thought helps vs when it is overkill

Why "Think Step by Step" Works

When you say "think step by step," you are telling the model to write out the reasoning before giving the answer. Each reasoning step becomes tokens in the context window — and those tokens are then available as context for generating the next reasoning step. You are forcing computation to happen in the output itself.

Analogy

Trying to solve a complex maths problem entirely in your head vs writing it on paper. On paper, each step you write down is available to inform the next. "Think step by step" gives the model paper to work on.

Chain of Thought — Pattern # WEAK (forces answer without reasoning): "Should I add RAG or fine-tuning to my chatbot? Answer: " # STRONG (forces reasoning first): "Should I add RAG or fine-tuning to my SaaS chatbot? Think through step by step: 1. What problem am I actually trying to solve? 2. What are the tradeoffs of each approach? 3. What constraints matter (cost, maintenance, data freshness)? 4. Given the above, what is the recommendation and why?"

XML Tags for Structured Output

When you need to parse model output programmatically, XML tags are the most robust way. The model has seen enormous amounts of XML in its training data and reliably places content inside specified tags.

XML Output Extraction Pattern system: "Analyse the SaaS tool and respond ONLY in this format: <category>[CRM|Analytics|DevTools|Marketing|Finance]</category> <score>[1-10 fit score for small business]</score> <reason>[one sentence]</reason> No other text." // In your API handler: const category = text.match(/<category>(.*?)<\/category>/s)?.[1]; const score = text.match(/<score>(.*?)<\/score>/s)?.[1]; const reason = text.match(/<reason>(.*?)<\/reason>/s)?.[1];
Expert Insight
For your automation agents, structured XML output is essential. Free-form text makes downstream parsing fragile. Tagged XML makes parsing deterministic and your agents dramatically more reliable.
Knowledge Check
Without chain of thought, a model says "yes" when asked if a tool fits your database. With chain of thought it says "no" after reasoning. Why does reasoning change the answer?
A
The model used different temperature settings
B
The chain of thought version had more information
C
Each reasoning step generates tokens that become context for the next — enabling multi-step logic that a direct answer cannot perform
D
The direct answer was too short to be accurate
AI Tutor — Week 6
🤖
Chain of thought is one of the highest-leverage prompting techniques. Ask me to help you build a CoT prompt for any real task you are working on — I will structure it with you.
CoT prompt for tool categorisation
When does CoT hurt performance?
Zero-shot vs few-shot CoT?
🎯
Week 6 Challenge
Build a structured, parseable output pipeline for one of your agents.
Rewrite your agent output prompt to return structured XML with tags for: name, category, score, reason. Test that you can extract each field reliably.
Find a decision your chatbot makes poorly. Add a chain of thought step before the final recommendation. Document the before/after quality difference.
Write a CoT prompt that evaluates a Reddit post for affiliate opportunity in 4 steps: (1) identify the pain, (2) match to a tool category, (3) check if we have a relevant tool, (4) draft a comment hook. Test on 3 real posts.
🌍 Real World

Chain of Thought is why agents score and evaluate data before drafting — forcing the model to reason systematically first dramatically improves the quality of tool identification.

🧠 Week Quiz
Q1. Chain of Thought prompting works by...
Q2. XML tags in prompts are useful for...
Q3. Structured output (JSON) from LLMs is best achieved by...
Q4. True or False: Zero-shot Chain of Thought uses the phrase "think step by step".
Q5. True or False: LLMs always produce valid JSON when asked.
Pillar 2 — Prompt Engineering Week 07 of 12

Few-Shot Examples
& Output Control

Telling a model what you want is good. Showing it is better. Few-shot prompting is the single most reliable technique for getting consistent, correctly-formatted output.

By the end of this week you will be able to:

Build few-shot prompts that demonstrate the exact output pattern needed
Use negative examples to define what you explicitly do not want
Control output length, tone, and format through explicit specification
Know the difference between format control and semantic control

Few-Shot Prompting — Show, Don't Just Tell

A few-shot prompt includes 2–5 examples of ideal input-output pairs before the actual task. Examples do not just show what to produce — they demonstrate the level of detail, vocabulary, tone, structure, and reasoning approach you expect. A good example is worth a paragraph of instructions.

Few-Shot Pattern — Content Generation "Generate a content piece for a product. EXAMPLE 1: Tool: Notion Pin: Stop juggling 12 tabs. Notion puts your docs, tasks, and wikis in one place — so you actually finish projects instead of managing them. #ProductivityTools #NotionApp #RemoteWork EXAMPLE 2: Tool: Zapier Pin: If you copy data between apps manually, you lose hours weekly. Zapier automates 5,000+ connections with zero code. #Automation #WorkSmarter #NoCode NOW DO: Tool: [tool name]"

Notice what the examples communicate beyond the obvious: benefit-led opener, conversational but punchy tone, specific number when possible, 3 hashtags at the end. None of this was stated in instructions — the examples showed it.

Negative Examples & Anti-Patterns

Negative examples are underused and extremely effective. If there is a specific failure mode your model keeps hitting, showing it an example of what bad looks like (labelled bad) is often faster than writing instructions that try to prevent it.

Negative Example Pattern "Generate a content piece. BAD (do not do this): 'Notion is a powerful all-in-one workspace that offers a wide range of features including notes, databases, kanban boards...' — Too long, feature-led, no hook. GOOD (do this): 'Stop juggling 12 tabs. Notion puts everything in one place.' — Short, benefit-led, hooks in 2 seconds. Now write a GOOD pin for: [tool]"
Expert Insight
The mark of expert prompt engineering is not writing long prompts. It is writing the shortest prompt that reliably produces the right output. Few-shot examples often let you delete 80% of your written instructions.
Knowledge Check
You add 3 examples to your pin prompt, but all are for CRM tools and you are now generating pins for DevTools. Output quality drops. What went wrong?
A
Too few examples
B
Domain mismatch — the model learned CRM-specific patterns and applied them to a different category
C
Temperature was too high
D
The model did not understand the output format
AI Tutor — Week 7
🤖
Give me your current output quality problem and I will help you build the right examples to fix it. What is your most frustrating inconsistent output right now?
Build few-shot examples for my content
How many examples and does order matter?
Controlling output length reliably
🎯
Week 7 Challenge
Build a few-shot library for your most important prompt.
Write 10 example pins manually — 1 per SaaS category. Test that using 3 category-matched examples produces better pins than 3 random-category examples.
Write one BAD negative example showing your worst model failure, and one GOOD example showing what you wanted. Add both to the prompt. Does the failure mode disappear?
Build a Variation Studio prompt that takes one good pin and generates 5 variations with different angles: pain-led, curiosity-led, social proof, number-led, question-led.
🌍 Real World

Your scoring prompt uses few-shot implicit examples through the scoring rubric (9-10: buying intent, 7-8: pain point, etc.) — this is few-shot prompting without explicit examples, and it works.

🧠 Week Quiz
Q1. Few-shot prompting means...
Q2. Output control via constraints means...
Q3. When few-shot examples conflict with instructions, models typically...
Q4. True or False: More few-shot examples always improve performance.
Q5. True or False: You can use negative examples (what NOT to do) in few-shot prompting.
Pillar 2 — Prompt Engineering Week 08 of 12

Prompt Debugging
& Evaluation

Most people iterate prompts by feel. Experts iterate by data. This week you build a system for diagnosing failures and measuring improvement — the same discipline that powers production AI.

By the end of this week you will be able to:

Identify and name the 5 core prompt failure modes
Build a personal prompt testing framework: stimulus → expected → actual → delta
Build a golden dataset for your most important prompt
Use temperature as a diagnostic tool, not just a creative lever

The 5 Prompt Failure Modes

Every bad output falls into one of five categories. Name the failure mode first — then the fix becomes obvious.

1. Hallucination

Confident false information. Fix: ground the model with real data in the prompt, or add "if you are not certain, say so explicitly."

2. Instruction Drift

Starts following instructions, then drifts away mid-response. Fix: repeat the most critical constraints at the end of the prompt, not just the beginning.

3. Format Collapse

Ignores your specified format. Fix: use XML tags, provide a concrete format example, or add "return ONLY the specified format, no other text."

4. Sycophancy

Agrees with whatever the user says, even if wrong. Fix: explicitly instruct "If my premise is wrong, correct it first. Prioritise accuracy over agreement."

5. Over-Refusal

Refuses a legitimate task or hedges excessively. Fix: provide clearer legitimate context in the system prompt. Role framing and explicit use-case statements reduce this significantly.

Building a Golden Dataset

A golden dataset is a set of input-output pairs representing what "correct" looks like for your specific use case. You write these manually, based on your expert judgement. Even 20 examples is enough to start. Once you have it, you can evaluate any prompt change objectively.

Simple Golden Dataset Structure // golden_pins.json [ { "input": { "tool": "Notion", "category": "Productivity" }, "must_contain": ["benefit_led", "hashtags", "under_150_chars"], "must_not_contain": ["feature_list", "passive_voice"] } // ... 19 more examples ] // Score a prompt version: const score = goldenSet.filter(ex => meetsAllCriteria(newOutput(ex.input), ex.must_contain, ex.must_not_contain) ).length / goldenSet.length; // → 0.0 to 1.0
Knowledge Check
Your pin generator keeps writing feature lists instead of benefit statements. You add "focus on benefits not features" to the instruction but the problem persists 40% of the time. What is the most effective next step?
A
Switch to a different model
B
Write a longer description of what a benefit is
C
Add a BAD example showing the exact feature-led failure and a GOOD example showing benefit-led copy — make the contrast concrete
D
Lower the temperature
AI Tutor — Week 8
🤖
Pillar 2 ends this week. Bring me your hardest prompt failure — describe what you are seeing and I will help you diagnose which failure mode it is and what to try first.
Build an eval script for my content
LLM-as-judge evaluation pattern
What is prompt regression?
🎯
Week 8 Challenge — Pillar 2 Capstone
Build a real evaluation system for your most important prompt.
Write 20 golden examples for your pin generator — 2 per SaaS category. For each define must_contain and must_not_contain criteria. This is your eval dataset permanently.
Run your current prompt against all 20 golden inputs. Score it: what percentage meet your criteria? This is your baseline. Now iterate and beat it.
Identify which failure mode each failing output falls into. Apply the corresponding fix. Re-score. Document the improvement percentage.
🌍 Real World

When your agent produces invalid output — that is an eval failure. The fix is output validation in practice. Evals catch these issues before they hit production.

🧠 Week Quiz
Q1. Prompt debugging starts with...
Q2. A good eval for an LLM output checks...
Q3. Regression testing prompts means...
Q4. True or False: LLM outputs are deterministic at temperature 0.
Q5. True or False: You should test prompts on edge cases, not just happy paths.
Pillar 3 — Tooling & Agents Week 09 of 12

APIs, SDKs
& Production Basics

You are already calling LLM APIs. This week you will learn to do it correctly — handling errors, managing costs, streaming responses, and building the reliability layer that separates hobby projects from production systems.

By the end of this week you will be able to:

Call any major LLM API confidently with correct error handling
Handle rate limits, retries with exponential backoff, and model fallback chains
Understand cost structure: tokens in vs out, caching, batch APIs
Build a model fallback chain that degrades gracefully under load

Production API Pattern — Retry & Fallback

Most people learn to call LLM APIs by copying a quickstart example. Production code is different. It handles the failure cases — which in real-world usage are routine events, not edge cases.

Production API — Retry + Fallback async function callLLM(prompt, env) { const models = [ 'llama-3.3-70b-versatile', // primary 'llama-3.1-8b-instant', // fallback 1 'gemma2-9b-it' // fallback 2 ]; for (const model of models) { for (let attempt = 0; attempt < 3; attempt++) { try { const res = await fetch('https://api.example.com/v1/chat/completions', { method: 'POST', headers: { 'Authorization': `Bearer ${env.LLM_KEY}` }, body: JSON.stringify({ model, messages: [{role:'user',content:prompt}], max_tokens:512 }) }); if (res.status === 429) { await sleep(1000 * (2 ** attempt)); continue; // exponential backoff } const data = await res.json(); return { text: data.choices[0].message.content, model }; } catch (e) { if (attempt === 2) break; } } } throw new Error('All models failed'); }
Expert Insight
Applications with fallback chains. The pattern above adds exponential backoff and logs which model actually served the request. Without that logging, you have no visibility into how often your primary model fails.

Cost Architecture — Thinking in Tokens

Cost DriverWhat It IsHow to Optimise
Input tokensEvery token in system prompt + user messageTrim system prompts; use prompt caching
Output tokensEvery token the model generatesSet max_tokens; specify concise output format
Prompt cachingReusing computed KV-pairs for repeated promptsPut static context first; use cache_control headers
Model choiceLarger models cost 10–50x more per tokenUse smallest model that meets quality bar
Batch API50% discount for non-realtime workloadsUse for your CSV bulk pin exports
Knowledge Check
Your API gets a 429 (rate limit) error from your LLM provider. What is the correct production response?
A
Immediately retry the same request
B
Return an error to the user immediately
C
Apply exponential backoff (1s, 2s, 4s), then fall through to the next model if retries are exhausted
D
Wait 60 seconds before retrying
AI Tutor — Week 9
🤖
Pillar 3 starts here. Ask me anything about API patterns, cost optimisation, or production reliability. I can review your existing Worker code and suggest specific improvements.
Streaming in production APIs
How to enable prompt caching
When to use the Batch API
🎯
Week 9 Challenge
Harden your most-used API endpoints with production-grade reliability.
Add exponential backoff + model fallback to your primary Worker. Log which model served each request. Run 50 requests and check what percentage hit the fallback.
Audit your most expensive Worker: count tokens in your system prompt, calculate monthly cost at current volume, find 3 ways to reduce input tokens without losing quality.
Enable Anthropic prompt caching on your chatbot Worker. Compare latency and cost for cached vs uncached requests. Document the real improvement numbers.
🌍 Real World

A production AI system calls LLM APIs, handles errors, retries with fallback models, and persists structured output. Understanding these patterns is what this week teaches.

🧠 Week Quiz
Q1. Rate limiting in production AI means...
Q2. Streaming API responses are useful when...
Q3. Exponential backoff in API calls means...
Q4. True or False: You should store API keys in client-side JavaScript.
Q5. True or False: Serverless functions are a good choice for proxying AI API calls.
Pillar 3 — Tooling & Agents Week 10 of 12

RAG — Retrieval
Augmented Generation

RAG is the most important architecture pattern in applied AI right now. It solves the two biggest LLM problems — hallucination and knowledge cutoff — by grounding generation in real retrieved data.

By the end of this week you will be able to:

Explain what RAG is and why it solves hallucination and knowledge cutoff
Build a simple RAG pipeline: embed → store → retrieve → inject → generate
Understand chunking strategies and why chunk size dramatically affects quality
Know when RAG is the right solution vs fine-tuning vs longer context

The RAG Pipeline — Step by Step

Phase 1 — Indexing (Done Once)

Take your documents (your 50 SaaS tool descriptions). Split them into chunks. Convert each chunk into an embedding vector. Store vectors in a vector database alongside the original text. This is your searchable knowledge base.

Phase 2 — Retrieval (Every Query)

When a user asks a question, convert it into an embedding vector. Search the vector database for the closest chunks. Retrieve the top-K matches. Inject them into the LLM context window. Generate the answer grounded in retrieved context.

RAG for Your SaaS Chatbot // INDEXING (run once when you update your tool database) for (const tool of saasData.tools) { const text = `${tool.name}: ${tool.description}. Category: ${tool.category}.`; const embedding = await embed(text); // Gemini text-embedding-004 (free) await vectorDB.insert({ id: tool.id, vector: embedding, metadata: tool }); } // RETRIEVAL (every user query) const queryVec = await embed(userQuery); const topTools = await vectorDB.search(queryVec, { topK: 3 }); const context = topTools.map(t => t.metadata.description).join('\n'); const prompt = `Based only on these tools:\n${context}\n\nAnswer: ${userQuery}`;
Direct Application
Your chatbot currently injects all 50 tool descriptions into every request. With RAG you inject only the 3 most relevant tools — reducing input tokens by ~94% per query and making answers more accurate because the model is not distracted by irrelevant tools.

Chunking — The Most Underrated RAG Decision

Content TypeChunk StrategyReasoning
Tool descriptions (your case)One tool per chunkEach tool is a self-contained unit; splitting within loses context
Long articles / blog posts300–500 tokens per chunkParagraph-level chunks preserve semantic coherence
Technical docs / codeFunction or section levelCode blocks are the natural semantic unit
FAQsOne Q&A pair per chunkThe question is the retrieval signal; keep it with the answer
Free RAG Stack for Your Site
Embedding: Free embedding APIs (millions of tokens/month available). Vector store: Multiple free/low-cost options. Total cost at most scales: $0–20/month. This is production-grade RAG at zero cost.
Knowledge Check
Your RAG chatbot retrieves 3 tools for "email marketing software" but the top result is a project management tool. Most likely root cause?
A
The LLM hallucinated the wrong tool
B
The tool description has overlapping vocabulary with the query in embedding space — the embeddings are semantically too close
C
The chunk size is too large
D
The context window is too small
AI Tutor — Week 10
🤖
RAG is where your application gets dramatically better. Ask me to help you design a production RAG pipeline for your use case.
Design a RAG pipeline for my app
RAG vs fine-tuning?
Adding metadata filters to vector retrieval
🎯
Week 10 Challenge
Build a working RAG prototype for your SaaS chatbot.
Set up a vector database. Write a script that reads your data, generates embeddings for each item using a free embedding API's free embedding API, and inserts them. Verify you can query it.
Update your chatbot Worker to: (1) embed the user query, (2) retrieve top 3 tools from Vectorize, (3) inject only those 3 tools. Compare quality vs "inject all 50" on 10 test queries.
Find 2 queries where RAG retrieves the wrong tool. Diagnose: vocabulary overlap, missing keywords, or embedding model weakness? Apply a fix and re-test.
🌍 Real World

RAG is why your chatbot could know about all 50 tools without stuffing them all into every prompt. Instead of injecting 5,000 tokens upfront, RAG retrieves the 3-5 most relevant tools per query.

🧠 Week Quiz
Q1. RAG stands for...
Q2. The retrieval step in RAG uses...
Q3. A vector database stores...
Q4. True or False: RAG eliminates the need for fine-tuning in all cases.
Q5. True or False: Chunking strategy affects RAG retrieval quality significantly.
Pillar 3 — Tooling & Agents Week 11 of 12

Tool Use, Function Calling
& Agents

An agent is an LLM that can take actions — calling APIs, searching the web — and loop until a task is complete. Agents are a powerful pattern. This week you understand their architecture deeply enough to build anything.

By the end of this week you will be able to:

Understand the ReAct loop: reason, act, observe, repeat
Define tools for LLM APIs using JSON schema correctly
Build a tool-using agent that calls APIs and processes their results
Understand multi-agent patterns: router, specialist, critic, synthesiser

The ReAct Loop — How Agents Think

ReAct Loop — Conceptual // Task: "Find top 3 new CRM SaaS tools launched this week" Thought: "I need to search for recent launches. I'll use web_search." Action: web_search("new CRM SaaS tools launched this week 2026") Observe: [5 candidates found] Thought: "Now check if any are already in our database." Action: check_database(candidates) Observe: [2 exist already, 3 are new] Thought: "Need pricing and affiliate program for the 3 new ones." Action: fetch_tool_details(new_tools) Observe: [pricing and affiliate data retrieved] Final: "Here are 3 new CRM tools with affiliate potential: ..."
Expert Insight
Production agents follow this exact loop. Making it explicit (clear Thought/Action/Observe steps in the prompt) makes agents dramatically more reliable — the model knows what phase it is in and what to do next.

Function Calling — Defining Tools for the Model

Anthropic Tool Definition Pattern const tools = [{ name: "search_saas_database", description: "Search our SaaS tool database by category or keyword. Use this when the user asks for tool recommendations.", input_schema: { type: "object", properties: { query: { type: "string", description: "Search query or category" }, max_results: { type: "number", description: "Number of results (1-5)" } }, required: ["query"] } }]; // Model returns: { type:"tool_use", name:"search_saas_database", input:{query:"CRM"} } // Your code executes it and returns results as tool_result back to the model

The tool description is a prompt. Write it like one — specifically, clearly, with context about when to use it. A vague description leads to incorrect tool selection. A precise description leads to reliable, predictable agent behaviour.

Knowledge Check
Your agent should call search_web first, then check_database. But it sometimes calls check_database first and gets no results. What is the most likely fix?
A
Add more tools to the agent
B
Lower the temperature
C
Update the check_database tool description to specify it must only be called after search_web returns candidates — preconditions must be explicit
D
Switch to a more capable model
AI Tutor — Week 11
🤖
You are already building agents. Ask me to help design an agent for your use case — I can suggest specific structural improvements based on the ReAct framework.
Adding a critic agent
Router vs specialist agents
Preventing agent infinite loops
🎯
Week 11 Challenge
Refactor one existing agent to use an explicit ReAct loop.
Map your agent's current flow to explicit ReAct steps. Rewrite the system prompt to guide the model through Thought/Action/Observe phases explicitly. Test that reliability improves.
Add a critic step: after agent generates a tool recommendation, have a second LLM call review it — "APPROVE or REJECT with reason." Only save APPROVEd results.
Add a max_iterations guard (max 10 rounds). Log every Thought/Action/Observe step. After 20 runs, review logs — which steps are most often wrong? That tells you which tool description to improve first.
🌍 Real World

A real AI agent has tools, a loop, and autonomous decision-making.

🧠 Week Quiz
Q1. Tool use in LLMs means...
Q2. An AI agent differs from a chatbot because it...
Q3. Function calling in LLM APIs works by...
Q4. True or False: Agents always complete tasks successfully on the first try.
Q5. True or False: A well-designed agent should handle tool failures gracefully and retry.
Pillar 3 — Tooling & Agents Week 12 of 12

Evaluation, Debugging
& Production Systems

The final week. Evaluation is how you know your system is actually working — and how you make it better systematically rather than by gut feel. This is what separates builders who ship once from builders who compound.

By the end of this week you will be able to:

Use LLM-as-judge to evaluate open-ended outputs at scale
Build an eval harness: golden dataset, automated scoring, regression detection
Instrument your system: trace every LLM call, log inputs/outputs, track cost
Know the failure modes of agentic systems and how to diagnose them

LLM-as-Judge — Scaling Evaluation

You cannot manually review every output your agent produces. But an LLM can review outputs at scale — evaluating them against your criteria automatically. The key is writing the judge prompt with specific, measurable criteria, not "is this good?"

LLM-as-Judge Pattern const judgePrompt = `You are evaluating generated content. CRITERIA (each scored 0 or 1): 1. benefit_led: Does it lead with a user benefit, not a product feature? 2. has_hook: Does the first sentence create curiosity or urgency? 3. correct_length: Is it between 80 and 150 characters? 4. has_hashtags: Does it end with 2-4 relevant hashtags? PIN: "${pinText}" TOOL: ${toolName} Respond ONLY as JSON: {"benefit_led":0|1,"has_hook":0|1,"correct_length":0|1,"has_hashtags":0|1,"total":0-4}`; // Run on every generated pin. Aggregate → your system quality score. // Change a prompt → re-run → compare scores. That is your eval loop.
Expert Insight
LLM-as-judge is not perfect — the judge has its own biases. But it is 100x faster than human review and catches systematic failures reliably. The goal is not perfect measurement; it is directionally correct measurement that lets you iterate with confidence.

Observability — You Cannot Fix What You Cannot See

Minimal Observability Pattern async function tracedLLMCall(prompt, context) { const start = Date.now(); const traceId = crypto.randomUUID(); try { const result = await callLLM(prompt); console.log(JSON.stringify({ traceId, context, inputTokens: estimateTokens(prompt), outputTokens: estimateTokens(result.text), latencyMs: Date.now() - start, model: result.model, success: true })); return result; } catch (e) { console.error(JSON.stringify({traceId, context, error: e.message, success: false})); throw e; } }
What to Track at Minimum
Every LLM call should log: trace ID, context (what was the user doing), input token count, output token count, latency, which model served the request, and success/failure. After one week of real traffic, these logs will tell you exactly where to focus optimisation effort.
Knowledge Check
You update your pin generation prompt. Your LLM-as-judge score drops from 3.2/4 to 2.8/4 across your golden dataset. What is the right response?
A
The difference is within noise — deploy the new prompt anyway
B
The judge criteria are too strict — relax them
C
This is prompt regression — revert, diagnose which criterion dropped and why, fix the specific failure, then re-evaluate before deploying
D
Rewrite the judge prompt — it is scoring incorrectly
AI Tutor — Week 12
🤖
Final week. You have covered all three pillars. Ask me anything — this week's content, applying what you have learned across your whole system, or what to build next. You have earned the right to ask hard questions.
Design an eval system for my Advisor
Common agentic failure modes
What should I build next?
🏆
Week 12 — Final Capstone
The capstone challenge proves mastery by connecting all three pillars across a single system.
Build an LLM-as-judge evaluator for your pin generator with 4 specific criteria. Run it on your golden dataset. Get a baseline score. This is your system quality score going forward.
Add tracing to your most-used Worker. Every LLM call logs traceId, tokens, latency, model, and success. Let it run 48 hours. Review the logs — find one thing to optimise based on real data.
Write a 200-word summary of what you now understand that you did not know 12 weeks ago. Reference tokens, attention, RLHF, RAG, ReAct, and evaluation specifically. Articulation is the final proof of expertise.
🎓

Course Complete

You have covered every foundational concept across all three pillars. The gap between you and most AI users is now significant. The compounding starts from here — every project, every failure you debug, every paper you read accelerates faster because the foundation is solid.

Next level: contribute to an open-source AI project, write about what you learned, or build something that does not exist yet.

🌍 Real World

Production systems use API logs (Observability tab) as eval — you read logs to catch pipeline failures. That is production monitoring in practice.

🧠 Week Quiz
Q1. LLM evaluation metrics include...
Q2. A/B testing prompts in production means...
Q3. Hallucination in LLMs refers to...
Q4. True or False: Human evaluation is always better than automated metrics for LLM outputs.
Q5. True or False: Logging model inputs and outputs in production helps debug failures.
Week 13 of 16 🏗️ Pillar 4 — Build Real Projects

Build a SaaS Review Bot

In this week you build a complete AI-powered SaaS review page generator — a powerful pattern for AI applications's . By the end, you have a working bot that discovers tools, scrapes their websites, and drafts professional review pages automatically.

Architecture Overview

The system has three components: a Discovery layer (data search), a Research layer (Jina Reader for web scraping), and a Generation layer (LLM for drafting HTML). These run in sequence as a pipeline.

javascript — Pipeline flow
// Step 1: Discover — fetch Reddit posts about SaaS needs
const posts = await searchReddit("looking for app alternative to");

// Step 2: Score — filter for high-intent posts (score >= 7)
const hotPosts = await scoreFunction(posts);

// Step 3: Identify — find the best tool for each need
const toolInfo = await identifyFunction(post.userIntent);

// Step 4: Validate — check tool website is reachable
const ok = await fetch(toolInfo.url, { method: "HEAD" });

// Step 5: Research — scrape tool homepage
const content = await fetch(`https://r.jina.ai/${toolInfo.url}`);

// Step 6: Generate — draft full HTML review page
const html = await draftFunction(toolInfo, content);

The Scoring Prompt

The key to quality output is the scoring step. By asking the LLM to evaluate post intent before drafting, you filter out noise and only process genuine buying-intent signals.

javascript — Scoring prompt
const scoringPrompt = `Score these Reddit posts for a SaaS discovery site.

Score = how likely the author needs a SaaS tool recommendation:
9-10: Directly asking for tool / clear buying intent
7-8: Pain point a SaaS tool clearly solves  
5-6: Tangentially related
1-4: Not relevant

Return ONLY a JSON array:
[{"index":1,"relevanceScore":8.5,"matchedCategory":"Email Marketing","userIntent":"Needs email automation"}]`;

const res = await fetch("https://api.example.com/v1/chat/completions", {
  method: "POST",
  headers: { "Authorization": `Bearer ${LLM_API_KEY}`, "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "meta-llama/llama-4-scout-17b-16e-instruct",
    max_tokens: 800,
    messages: [{ role: "user", content: scoringPrompt }]
  })
});

Deploying on Serverless Infrastructure

Serverless platforms are ideal runtimes for this bot — it runs on serverless infrastructure, has free cron triggers for automation, and persistent storage for saving drafts. The entire pipeline runs without managing servers.

toml — wrangler.toml
name = "saas-review-bot"
main = "src/worker.js"
compatibility_date = "2024-11-01"

[triggers]
crons = ["0 8 * * *"]  # runs daily at 8am UTC

[[kv_namespaces]]
binding = "DRAFTS_KV"
id = "YOUR_KV_NAMESPACE_ID"

# Add secrets via dashboard:
# LLM_API_KEY
🌍 Real World

This pattern is used by many agents: cron-triggered + LLM scoring + web scraping, then saves structured output for human review.

🧠 Week 13 Quiz
Q1. Why do we split the agent into /run and /process steps?
Q2. Why validate the tool URL with a HEAD request before drafting?
Q3. Jina Reader (r.jina.ai) is used for...
Q4. True or False: CORS proxies are needed because Reddit blocks direct server-side requests.
Q5. True or False: Key-value storage in serverless systems persists data across requests.
Week 14 of 16 🏗️ Pillar 4 — Build Real Projects

Build an Agentic Content Pipeline

An agentic content pipeline runs without human input at each step. It discovers opportunities, researches them, generates content, and publishes — all autonomously. This week you design and build one from scratch.

The 4-Stage Pipeline

Every production content pipeline has four stages: Signal (what to write about), Research (gather information), Generate (create the content), and Distribute (publish or store). Each stage is a function you can test independently.

javascript — 4-stage pipeline skeleton
async function runContentPipeline(env) {
  // Stage 1: Signal — what topics are trending?
  const signals = await discoverSignals({
    sources: ["reddit", "producthunt", "hackernews"],
    minScore: 7.0,
    maxResults: 10
  });

  // Stage 2: Research — gather facts for each topic
  const researched = await Promise.all(
    signals.map(s => researchTopic(s))
  );

  // Stage 3: Generate — create content for each topic
  const content = await Promise.all(
    researched.map(r => generateContent(r, {
      format: "blog_post",
      wordCount: 800,
      tone: "informative"
    }))
  );

  // Stage 4: Distribute — save drafts for review
  await Promise.all(
    content.map(c => saveDraft(c, env.CONTENT_KV))
  );

  return { drafted: content.length };
}

Multi-Agent vs Single Agent

A single agent handles all steps sequentially. A multi-agent system assigns specialized agents to each stage — a Scout Agent for signals, a Research Agent for facts, a Writer Agent for content. Multi-agent is more expensive but produces higher quality output.

javascript — Parallel research agents
// Run multiple research agents in parallel
const researchResults = await Promise.allSettled([
  redditResearchAgent(topic),    // Reddit sentiment + discussions
  webResearchAgent(topic),       // Jina scrape top 3 results
  competitorResearchAgent(topic) // Check existing content gaps
]);

// Merge successful results
const facts = researchResults
  .filter(r => r.status === "fulfilled")
  .map(r => r.value)
  .join("

");

// Single writer agent synthesizes all research
const article = await writerAgent(topic, facts);
🌍 Real World

A newsletter agent follows this pipeline: Signal (trending data) → Research (context) → Generate (draft) → Distribute (delivery). Each stage can be a separate function.

🧠 Week 14 Quiz
Q1. What is the main advantage of multi-agent over single-agent pipelines?
Q2. Promise.allSettled() vs Promise.all() — which is better for agent pipelines?
Q3. The "Signal" stage of a content pipeline is responsible for...
Q4. True or False: Agentic pipelines should have human review gates before publishing.
Q5. True or False: Each stage of a pipeline should be independently testable.
Week 15 of 16 🏗️ Pillar 4 — Build Real Projects

Build a RAG-Powered Chatbot

RAG (Retrieval Augmented Generation) lets your chatbot answer questions about your specific content — without fine-tuning. This week you build a chatbot that knows about your specific set of tools and recommends the right one for each user query.

Step 1 — Build the Knowledge Base

First, convert your tool data into embeddings and store them in a vector store. Each tool description becomes a vector that captures its meaning.

javascript — Embed your tool data
// Load your saas-data.json
const tools = await fetch('/saas-data.json').then(r => r.json());

// Create embedding for each tool
async function embedTool(tool) {
  const text = `${tool.name}: ${tool.description}. 
    Category: ${tool.category}. 
    Best for: ${tool.bestFor}. 
    Pricing: ${tool.pricing}.`;

  const res = await fetch('https://api.example.com/v1/embeddings', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${LLM_API_KEY}` },
    body: JSON.stringify({
      model: 'nomic-embed-text-v1_5',
      input: text
    })
  });
  const data = await res.json();
  return { tool, embedding: data.data[0].embedding };
}

// Embed all tools
const embeddings = await Promise.all(tools.map(embedTool));

Step 2 — Retrieve Relevant Tools

When a user asks a question, embed their query, find the most similar tool embeddings using cosine similarity, and pass only the relevant items — not all of them.

javascript — Cosine similarity retrieval
function cosineSimilarity(a, b) {
  const dot = a.reduce((sum, ai, i) => sum + ai * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, ai) => sum + ai * ai, 0));
  const magB = Math.sqrt(b.reduce((sum, bi) => sum + bi * bi, 0));
  return dot / (magA * magB);
}

async function retrieveRelevantTools(query, embeddings, topK = 3) {
  // Embed the user query
  const queryEmbedding = await embedQuery(query);
  
  // Score each tool by similarity
  const scored = embeddings.map(({ tool, embedding }) => ({
    tool,
    score: cosineSimilarity(queryEmbedding, embedding)
  }));

  // Return top K most similar tools
  return scored
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map(s => s.tool);
}

Step 3 — Generate the Answer

Pass the retrieved items as context to your LLM, then generate a recommendation. The model only sees the relevant tools — keeping the context window small and the answer focused.

javascript — RAG answer generation
async function ragAnswer(query, relevantTools) {
  const context = relevantTools.map(t =>
    `${t.name}: ${t.description} (${t.pricing})`
  ).join('
');

  const res = await fetch('https://api.example.com/v1/chat/completions', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${LLM_API_KEY}` },
    body: JSON.stringify({
      model: 'llama-3.3-70b-versatile',
      messages: [
        { role: 'system', content: `You are a SaaS expert. Only recommend from these tools:
${context}` },
        { role: 'user', content: query }
      ]
    })
  });
  return (await res.json()).choices[0].message.content;
}
🌍 Real World

A typical chatbot injects all items as context. Upgrading to RAG would reduce token usage, improve answer relevance, and allow scaling without hitting context limits.

🧠 Week 15 Quiz
Q1. In RAG, the retrieval step happens...
Q2. Cosine similarity returns a value between...
Q3. Why use topK=3 instead of passing all items?
Q4. True or False: The user query must be embedded using the same model as the documents.
Q5. True or False: RAG requires fine-tuning the base LLM.
Week 16 of 16 🏗️ Pillar 4 — Build Real Projects

Deploy to Production on Cloudflare

Building AI features locally is one thing. Running them reliably in production — with monitoring, error handling, cost controls, and zero downtime — is another. This final week covers everything you need to ship AI to real users.

Production Checklist

Before going live, every AI feature needs to pass this checklist: secrets management, error handling, rate limiting, logging, and a fallback plan.

javascript — Production-ready Worker pattern
export default {
  async fetch(request, env) {
    // 1. Never expose API keys in code — use env secrets
    const apiKey = env.LLM_API_KEY; // set via wrangler secret put

    // 2. Always validate input
    const body = await request.json().catch(() => null);
    if (!body?.question) return error("question required", 400);

    // 3. Try primary model, fall back if it fails
    let response;
    try {
      response = await callLLM(apiKey, body.question, "llama-3.3-70b-versatile");
    } catch (e) {
      // Fallback to faster model
      response = await callLLM(apiKey, body.question, "llama-3.1-8b-instant");
    }

    // 4. Log for debugging (visible in Cloudflare Observability)
    console.log("Request processed:", { model: response.model, tokens: response.usage?.total_tokens });

    return new Response(JSON.stringify({ text: response.text }), {
      headers: { "Content-Type": "application/json", "Access-Control-Allow-Origin": "*" }
    });
  }
};

Monitoring & Observability

In production, you can't debug by looking at the screen. You need logs. serverless platforms. Observability/logging shows real-time logs for every request — this is how you catch failures like "Reddit returned 403" or "invalid output from your LLM" before they become user-facing bugs.

javascript — Structured logging pattern
// Instead of: console.log("done")
// Do this — structured logs you can filter in Cloudflare dashboard:
console.log(JSON.stringify({
  event: "tool_drafted",
  tool: toolInfo.name,
  category: post.matchedCategory,
  score: post.relevanceScore,
  tokens_used: llmResponse.usage?.total_tokens,
  duration_ms: Date.now() - startTime
}));

// Log errors with full context
console.error(JSON.stringify({
  event: "draft_failed",
  tool: toolInfo.name,
  error: err.message,
  step: "jina_scrape"
}));

Cost Control

AI costs compound fast. Three rules: cap tokens per request, limit batch sizes, and cache responses where possible. On most LLM free tiers, stay under 1,000 requests/day and 500K tokens/day.

javascript — Cost control patterns
// 1. Cap token output
const res = await callLLM(key, prompt, { max_tokens: 600 }); // not 4096

// 2. Cache repeated requests in KV
const cacheKey = `cache:${hashPrompt(prompt)}`;
const cached = await env.KV.get(cacheKey);
if (cached) return cached; // skip LLM call entirely

const result = await callLLM(key, prompt);
await env.KV.put(cacheKey, result, { expirationTtl: 86400 }); // 24h cache

// 3. Truncate scraped content before sending to LLM
const truncated = scrapedContent.slice(0, 2500); // not full page
🌍 Real World

Every technique in this week represents production best practices: secrets via environment variables, structured logging, rate limiting for cost control, and content truncation before LLM processing. Production AI is just disciplined engineering.

🧠 Week 16 Quiz — Final
Q1. Where should you store API keys in serverless functions?
Q2. The best reason to cache LLM responses in KV is...
Q3. Structured logging (JSON) vs plain text logging — why prefer JSON?
Q4. True or False: A fallback model should be faster/cheaper than the primary model.
Q5. True or False: You have now built a production AI system across 16 weeks. 🎉
🏆

Course Complete — All 16 Weeks

You've gone from tokens and embeddings to building and deploying real production AI systems. You understand LLMs from the inside out, write prompts that actually work, and have shipped production AI systems.

Share what you built. Write about what you learned. Teach someone else — that's when the knowledge really sticks.