Live · in-browser LLM · WebGPU

A real model. Right here. Right now.

Five demos for software developers. Same model, five lenses on what's actually new.

Model

engine: WebLLM WebGPU: checking… no model loaded VRAM: —

Tip: pick the smallest model on weak Wi-Fi (Qwen 0.5B downloads in ~30 s). Llama 3.2 3B is the best quality at this size but is ~1.8 GB on first load. Once loaded, the model is cached forever.

Where do model files come from? · 💾 local · 📦 cache · 🌐 internet

Loading order: 1) local folder ./models/<model-id>/ if present, 2) browser Cache API (already-downloaded weights), 3) HuggingFace CDN. The progress text above tells you which path the loader took.

To pre-download into the local folder (works fully offline, no browser cache dependency):

./download-models.sh                         # default: Llama-3.2-1B + Qwen3-0.6B
./download-models.sh all                     # all 8 models
./download-models.sh Qwen3-1.7B-q4f16_1-MLC  # specific id

# Then serve over http (file:// blocks fetch in Chrome):
python3 -m http.server 8000
# Open: http://localhost:8000/live-demo.html

Beat 03a · Show, don't tell

Streaming chat — first-token latency & tok/s

A real model, generating one token at a time, on this laptop. Watch first-token latency and tokens/sec. Both are real numbers, not animations.

first-token — throughput — tokens out 0

Beat 03b · The cost knob

Reasoning effort — same prompt, three system prompts

Frontier labs ship a single reasoning_effort knob. Small open models don't have it natively — but you can fake it with system prompts. Same model, three settings. Watch tokens out and time go up; watch quality go up too.

Test prompt

Low effort just answer

— tokens — time — tok/s

Medium effort think step by step

— tokens — time — tok/s

High effort plan, draft, critique, answer

— tokens — time — tok/s

Beat 03c · System prompts shape behaviour

Same model, same prompt — different role

A reminder to the audience: the model is one weights file. The behaviour they see in production is mostly the system prompt. Pick a persona; type a prompt; same weights answer like a Senior SRE, a Skeptical Reviewer, or a 10-year-old.

Beat 03d · The agent loop, real

Tool calling — model emits JSON, browser executes, result fed back

This is the single most important architectural shift. The model decides: “I need to call calculator with these args.” The browser runs the JS. The result is fed back. The loop continues until the model returns a final answer. These presets are intentionally simple so 1–3B models can copy the JSON pattern reliably.

Available tools (handlers below)

Pick a preset or type a question to start the loop.

Beat 03e · Agents are role-played by system prompts

Multi-agent orchestration — Planner → Researcher → Critic

One model. Three roles. Sequential calls. Each agent's output is the next agent's input. This is exactly how Claude Code, Devin, Cursor Composer work — minus a thousand engineering details. The architecture is simple; that is the point.

Planner break the problem into 3 questions

Researcher answer each question concretely

Critic find the weakest claim & one missing risk

Final answer Planner re-merges with Critic's notes

Beat 03f · Your own eval, your own data, your own answer

Evals — score multiple models on your own test set

The closing line of the talk says it: vendor benchmarks are marketing. Your eval is leverage. This tab lets you pick 2-3 models, pick a test set, run all questions sequentially against each model, score with a deterministic check, and compare. Same browser. No API. No spreadsheet. Click Run eval.

Models to evaluate

Each selected model loads sequentially. Pre-cache at least one before the talk.

Prompt context (system prompt)

This is the context engineering axis. Pick 2-3 to prove that prompting can move scores as much as model choice.

Test set

—

Pick models + a test set, then hit Run. Per-question results stream in here.