AI inference is the process of running a trained AI model to generate outputs from new inputs. When you ask an AI a question and it responds, that is inference. It is distinct from training — the process of teaching the model, which has already happened before you interact with it.
Try Lucy OS1 →Inference is computationally intensive but far lighter than training. Running a 70B-parameter LLM in inference requires 40-80GB of GPU memory. Leading inference providers (OpenAI, Anthropic, Groq) optimise inference throughput through specialised hardware (H100 GPUs, custom TPUs), quantisation (reducing model precision), and batching (processing multiple requests simultaneously).
Every response Lucy generates is an inference run. Lucy OS1 uses OpenAI's inference infrastructure for GPT-4o-mini — one of the fastest and most reliable inference providers. Response tokens begin streaming within 100-200ms of query receipt.
Try Lucy OS1 →Inference optimisation balances two metrics: latency (how fast does one request complete?) and throughput (how many requests per second can the system handle?). They are often in tension.
Reducing model precision from float32 to int8 or int4, shrinking model size and accelerating inference with minimal quality loss. Makes large models deployable on consumer hardware.
Using a small fast model to predict multiple future tokens that a larger model then verifies. Achieves large-model quality at small-model speed for many outputs.
LLMs run far faster on GPUs (or TPUs) than CPUs. On-device inference (phone, laptop) uses quantised models on NPUs for acceptable performance without cloud dependence.
What is the difference between training and inference?
Training: teaching the model from data. Inference: using the trained model to make predictions. Training happens once (or periodically). Inference happens billions of times daily.
Why is inference expensive?
Large models (GPT-4, Claude) require expensive H100 GPUs to run. At scale, inference infrastructure is the primary cost of operating an AI product.
Can AI inference run locally on my device?
Yes, for smaller quantised models. Llama 3.1 8B runs on a modern MacBook M-series. For large models like GPT-4o, cloud inference is required — local hardware cannot run them at acceptable speed.
Lucy OS1 puts these concepts to work in a real, streaming voice AI pipeline — Deepgram STT, GPT-4o-mini, and Cartesia TTS delivering natural voice conversation.
Start talking to Lucy →Welcome