Understand the engine behind ChatGPT, Claude, and every AI tool — from tokens to text generation.
A Large Language Model (LLM) is a neural network trained on massive amounts of text data. Its core job is deceptively simple: given a sequence of text, predict the most likely next word (token).
That's it. Every impressive thing you've seen — writing code, answering questions, reasoning about math — emerges from this single mechanism, scaled up enormously.
Think of your phone's autocomplete, but trained on trillions of words from books, code, websites, and conversations. It doesn't just predict the next word — it has learned patterns of logic, grammar, facts, and reasoning from all that data.
LLMs don't read words like humans do. They break text into tokens — chunks that might be a word, part of a word, or even a single character. This is called tokenization.
"hello" → [hello]
"the" → [the]
Simple words are kept whole.
"extraordinary" → [extra][ordinary]
"tokenization" → [token][ization]
"12345" → [123][45]
This is why LLMs struggle with exact math!
A rough rule of thumb: 1 token is about 4 characters or ¾ of a word in English.
Every LLM has a token limit (context window). Claude 3.5 Sonnet has ~200K tokens. GPT-4o has ~128K tokens. Your prompt + the model's response must fit within this limit. Longer prompts = more cost, more latency. Token efficiency is a core skill.
The context window is the model's "working memory" — the total amount of text (in tokens) it can see at once. This includes your system prompt, conversation history, and the response being generated.
The model has no memory between conversations. Each API call starts fresh. Everything the model "knows" about your conversation must be inside the context window. This is why chat apps send the entire conversation history with every message.
When the model predicts the next token, it doesn't just pick the single best one. It calculates a probability distribution over all possible next tokens. Then it samples from that distribution. The temperature parameter controls how.
| Parameter | Value | Effect |
|---|---|---|
temperature |
0.0 |
Always picks the most probable token. Deterministic, focused, repetitive. |
temperature |
0.7 |
Balanced — creative but coherent. Good default for most tasks. |
temperature |
1.0+ |
Very random. Creative writing, brainstorming. Can be incoherent. |
top_p |
0.0–1.0 |
Only consider tokens in the top P% of probability mass. Alternative to temperature. |
max_tokens |
integer | Maximum number of tokens the model will generate in its response. |
Given the prompt "The best programming language is", see how temperature affects the completion:
Since you know Python, let's see what an actual LLM API call looks like. Here's how to call Claude using the Anthropic SDK:
# Install: pip install anthropic import anthropic # Create client (uses ANTHROPIC_API_KEY env variable) client = anthropic.Anthropic() # Make an API call message = client.messages.create( model="claude-sonnet-4-5-20250929", max_tokens=1024, temperature=0.7, system="You are a helpful coding tutor.", messages=[ {"role": "user", "content": "Explain Python decorators in 3 sentences."} ] ) print(message.content[0].text) # Also useful: print(f"Tokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
And here's the equivalent with OpenAI's API for comparison:
# Install: pip install openai from openai import OpenAI client = OpenAI() # uses OPENAI_API_KEY response = client.chat.completions.create( model="gpt-4o", temperature=0.7, max_tokens=1024, messages=[ {"role": "system", "content": "You are a helpful coding tutor."}, {"role": "user", "content": "Explain Python decorators in 3 sentences."} ] ) print(response.choices[0].message.content)
Both APIs share the same core structure: model (which LLM), messages (the conversation), and parameters (temperature, max_tokens). This pattern is universal across all LLM providers.
Before we move to prompt engineering techniques, internalize these mental models. They'll shape how you think about every prompt you write.
LLMs are stochastic — the same prompt can produce different outputs. This isn't a bug, it's a feature you can control with temperature.
The quality of the output is directly proportional to the quality of the input. Vague prompts → vague answers.
The model only knows what's in the context window. It has no memory, no files, no internet access (unless you give it tools).
LLMs are trained to complete text. If you set up a scenario (system prompt), it will play that role. This is the foundation of prompt engineering.
1. What does an LLM fundamentally do?
2. If you want a deterministic, focused output (e.g., JSON parsing), what temperature should you use?
3. Why do LLMs sometimes struggle with math like "What is 7,391 × 4,528?"
Here's what you've learned:
LLMs are next-token predictors trained on massive text data. They process text as tokens (~4 chars each) within a fixed context window. You control output randomness with temperature (0 = deterministic, 1+ = creative). The model has no memory between calls — everything must be in the context window.
Next up → Topic 2: Model Selection & Comparison
You'll learn how to evaluate and choose the right LLM for your task — from GPT to Claude to open-source models.