Lucy
Talk
Voice AI Glossary · 2026

What Is AI Inference?

AI inference is the process of running a trained AI model to generate outputs from new inputs. When you ask an AI a question and it responds, that is inference. It is distinct from training — the process of teaching the model, which has already happened before you interact with it.

Try Lucy OS1 →

Definition in Full

Inference is computationally intensive but far lighter than training. Running a 70B-parameter LLM in inference requires 40-80GB of GPU memory. Leading inference providers (OpenAI, Anthropic, Groq) optimise inference throughput through specialised hardware (H100 GPUs, custom TPUs), quantisation (reducing model precision), and batching (processing multiple requests simultaneously).

How Lucy OS1 Uses AI Inference

Every response Lucy generates is an inference run. Lucy OS1 uses OpenAI's inference infrastructure for GPT-4o-mini — one of the fastest and most reliable inference providers. Response tokens begin streaming within 100-200ms of query receipt.

Try Lucy OS1 →

Key Concepts

Latency vs throughput

Inference optimisation balances two metrics: latency (how fast does one request complete?) and throughput (how many requests per second can the system handle?). They are often in tension.

Quantisation

Reducing model precision from float32 to int8 or int4, shrinking model size and accelerating inference with minimal quality loss. Makes large models deployable on consumer hardware.

Speculative decoding

Using a small fast model to predict multiple future tokens that a larger model then verifies. Achieves large-model quality at small-model speed for many outputs.

GPU vs CPU inference

LLMs run far faster on GPUs (or TPUs) than CPUs. On-device inference (phone, laptop) uses quantised models on NPUs for acceptable performance without cloud dependence.

Frequently Asked Questions

What is the difference between training and inference?

Training: teaching the model from data. Inference: using the trained model to make predictions. Training happens once (or periodically). Inference happens billions of times daily.

Why is inference expensive?

Large models (GPT-4, Claude) require expensive H100 GPUs to run. At scale, inference infrastructure is the primary cost of operating an AI product.

Can AI inference run locally on my device?

Yes, for smaller quantised models. Llama 3.1 8B runs on a modern MacBook M-series. For large models like GPT-4o, cloud inference is required — local hardware cannot run them at acceptable speed.

Related Terms

Large Language Model (LLM) AI Latency Real-Time AI Token

Experience AI Inference in Action

Lucy OS1 puts these concepts to work in a real, streaming voice AI pipeline — Deepgram STT, GPT-4o-mini, and Cartesia TTS delivering natural voice conversation.

Start talking to Lucy →

Welcome