Topic 1 — How LLMs Actually Work

1

What Is a Large Language Model?

A Large Language Model (LLM) is a neural network trained on massive amounts of text data. Its core job is deceptively simple: given a sequence of text, predict the most likely next word (token).

That's it. Every impressive thing you've seen — writing code, answering questions, reasoning about math — emerges from this single mechanism, scaled up enormously.

💡

Analogy: The World's Best Autocomplete

Think of your phone's autocomplete, but trained on trillions of words from books, code, websites, and conversations. It doesn't just predict the next word — it has learned patterns of logic, grammar, facts, and reasoning from all that data.

📝

Your Prompt

"What is Python?"

→

🔤

Tokenizer

Text → [Token IDs]

→

🧠

Transformer

Process + Predict

→

🎯

Sampling

Pick next token

→

💬

Output

"Python is a..."

2

Tokens: The Language of LLMs

LLMs don't read words like humans do. They break text into tokens — chunks that might be a word, part of a word, or even a single character. This is called tokenization.

🔡

Common words = 1 token

"hello" → [hello]
"the" → [the]
Simple words are kept whole.

✂️

Long words = multiple tokens

"extraordinary" → [extra][ordinary]
"tokenization" → [token][ization]

🔢

Numbers get split

"12345" → [123][45]
This is why LLMs struggle with exact math!

📏

~1 token ≈ 4 characters

A rough rule of thumb: 1 token is about 4 characters or ¾ of a word in English.

🧪 Try It: Simple Tokenizer Simulator

Enter any text to see approximate tokenization:

⚠️

Why Tokens Matter for Prompt Engineering

Every LLM has a token limit (context window). Gemini 2.5 Pro has 1M tokens. Claude Sonnet/Opus has ~200K tokens. GPT-4o has ~128K tokens. Your prompt + the model's response must fit within this limit. Longer prompts = more cost, more latency. Token efficiency is a core skill.

3

The Context Window

The context window is the model's "working memory" — the total amount of text (in tokens) it can see at once. This includes your system prompt, conversation history, and the response being generated.

System Prompt

Conversation History

Current Prompt

Response

Remaining Space

← Total Context Window (e.g. 200K tokens) →

✅

Key Insight

The model has no memory between conversations. Each API call starts fresh. Everything the model "knows" about your conversation must be inside the context window. This is why chat apps send the entire conversation history with every message.

4

Temperature & Sampling: Controlling Creativity

When the model predicts the next token, it doesn't just pick the single best one. It calculates a probability distribution over all possible next tokens. Then it samples from that distribution. The temperature parameter controls how.

Parameter	Value	Effect
`temperature`	`0.0`	Always picks the most probable token. Deterministic, focused, repetitive.
`temperature`	`0.7`	Balanced — creative but coherent. Good default for most tasks.
`temperature`	`1.0+`	Very random. Creative writing, brainstorming. Can be incoherent.
`top_p`	`0.0–1.0`	Only consider tokens in the top P% of probability mass. Alternative to temperature.
`max_tokens`	integer	Maximum number of tokens the model will generate in its response.

🎮 Interactive: See Temperature in Action

Given the prompt "The best programming language is", see how temperature affects the completion:

Temperature 0.0

Click "Generate" to see the effect of temperature.

5

Your First API Call (Python)

Since you know Python, let's see what an actual LLM API call looks like. Here's how to call Claude using the Anthropic SDK:

Python

# Install: pip install anthropic
import anthropic

# Create client (uses ANTHROPIC_API_KEY env variable)
client = anthropic.Anthropic()

# Make an API call
message = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    temperature=0.7,
    system="You are a helpful coding tutor.",
    messages=[
        {"role": "user", "content": "Explain Python decorators in 3 sentences."}
    ]
)

print(message.content[0].text)
# Also useful:
print(f"Tokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")

And here's the equivalent with OpenAI's API for comparison:

Python

# Install: pip install openai
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.7,
    max_tokens=1024,
    messages=[
        {"role": "system", "content": "You are a helpful coding tutor."},
        {"role": "user", "content": "Explain Python decorators in 3 sentences."}
    ]
)

print(response.choices[0].message.content)

✅

Notice the Pattern

Both APIs share the same core structure: model (which LLM), messages (the conversation), and parameters (temperature, max_tokens). This pattern is universal across all LLM providers.

6

Key Mental Models

Before we move to prompt engineering techniques, internalize these mental models. They'll shape how you think about every prompt you write.

🎰

It's Probabilistic

LLMs are stochastic — the same prompt can produce different outputs. This isn't a bug, it's a feature you can control with temperature.

🪞

Garbage In, Garbage Out

The quality of the output is directly proportional to the quality of the input. Vague prompts → vague answers.

📦

No Hidden State

The model only knows what's in the context window. It has no memory, no files, no internet access (unless you give it tools).

🎭

It's a Role Player

LLMs are trained to complete text. If you set up a scenario (system prompt), it will play that role. This is the foundation of prompt engineering.

✓

Check Your Understanding

Quick Quiz — 3 Questions

1. What does an LLM fundamentally do?

2. If you want a deterministic, focused output (e.g., JSON parsing), what temperature should you use?

3. Why do LLMs sometimes struggle with math like "What is 7,391 × 4,528?"

✓

Topic 1 Summary

Here's what you've learned:

LLMs are next-token predictors trained on massive text data. They process text as tokens (~4 chars each) within a fixed context window. You control output randomness with temperature (0 = deterministic, 1+ = creative). The model has no memory between calls — everything must be in the context window.

Next up → Topic 2: Model Selection & Comparison
You'll learn how to evaluate and choose the right LLM for your task — from GPT to Claude to open-source models.