Back to Report
Live Β· in-browser LLM Β· WebGPU

A real model. Right here. Right now.

Five demos for software developers. Same model, five lenses on what's actually new.

engine: WebLLM WebGPU: checking… no model loaded VRAM: β€”
Tip: pick the smallest model on weak Wi-Fi (Qwen 0.5B downloads in ~30 s). Llama 3.2 3B is the best quality at this size but is ~1.8 GB on first load. Once loaded, the model is cached forever.
Where do model files come from? Β· πŸ’Ύ local Β· πŸ“¦ cache Β· 🌐 internet

Loading order: 1) local folder ./models/<model-id>/ if present, 2) browser Cache API (already-downloaded weights), 3) HuggingFace CDN. The progress text above tells you which path the loader took.

To pre-download into the local folder (works fully offline, no browser cache dependency):

./download-models.sh                         # default: Llama-3.2-1B + Qwen3-0.6B
./download-models.sh all                     # all 8 models
./download-models.sh Qwen3-1.7B-q4f16_1-MLC  # specific id

# Then serve over http (file:// blocks fetch in Chrome):
python3 -m http.server 8000
# Open: http://localhost:8000/live-demo.html
Beat 03a Β· Show, don't tell

Streaming chat β€” first-token latency & tok/s

A real model, generating one token at a time, on this laptop. Watch first-token latency and tokens/sec. Both are real numbers, not animations.

first-token β€” throughput β€” tokens out 0
Beat 03b Β· The cost knob

Reasoning effort β€” same prompt, three system prompts

Frontier labs ship a single reasoning_effort knob. Small open models don't have it natively β€” but you can fake it with system prompts. Same model, three settings. Watch tokens out and time go up; watch quality go up too.

Test prompt

        
Low effort just answer
β€” tokens β€” time β€” tok/s

          
Medium effort think step by step
β€” tokens β€” time β€” tok/s

          
High effort plan, draft, critique, answer
β€” tokens β€” time β€” tok/s

          
Beat 03c Β· System prompts shape behaviour

Same model, same prompt β€” different role

A reminder to the audience: the model is one weights file. The behaviour they see in production is mostly the system prompt. Pick a persona; type a prompt; same weights answer like a Senior SRE, a Skeptical Reviewer, or a 10-year-old.


        
Beat 03d Β· The agent loop, real

Tool calling β€” model emits JSON, browser executes, result fed back

This is the single most important architectural shift. The model decides: β€œI need to call calculator with these args.” The browser runs the JS. The result is fed back. The loop continues until the model returns a final answer. These presets are intentionally simple so 1–3B models can copy the JSON pattern reliably.

Available tools (handlers below)
Pick a preset or type a question to start the loop.
Beat 03e Β· Agents are role-played by system prompts

Multi-agent orchestration β€” Planner β†’ Researcher β†’ Critic

One model. Three roles. Sequential calls. Each agent's output is the next agent's input. This is exactly how Claude Code, Devin, Cursor Composer work β€” minus a thousand engineering details. The architecture is simple; that is the point.

Planner break the problem into 3 questions

          
Researcher answer each question concretely

          
Critic find the weakest claim & one missing risk

          
Final answer Planner re-merges with Critic's notes

          
Beat 03f Β· Your own eval, your own data, your own answer

Evals β€” score multiple models on your own test set

The closing line of the talk says it: vendor benchmarks are marketing. Your eval is leverage. This tab lets you pick 2-3 models, pick a test set, run all questions sequentially against each model, score with a deterministic check, and compare. Same browser. No API. No spreadsheet. Click Run eval.

Models to evaluate

Each selected model loads sequentially. Pre-cache at least one before the talk.

Prompt context (system prompt)

This is the context engineering axis. Pick 2-3 to prove that prompting can move scores as much as model choice.

Test set

β€”

Pick models + a test set, then hit Run. Per-question results stream in here.