← Back to glossary
+Suggest a term
Tool·Infrastructure·Added 1 month ago

Groq

Also known as: Groq AI, GroqCloud, LPU

A hardware and API company that runs open-source models on custom chips called LPUs (Language Processing Units), designed specifically for fast sequential inference. Builders use Groq primarily for its exceptionally low latency responses.

Groq is notable in the inference provider market for one specific reason: speed. Where standard GPU-based servers might take 800ms to 2 seconds before returning a first token, Groq routinely delivers responses in under 200ms. The company built custom silicon called LPUs (Language Processing Units) that eliminate many of the memory bottlenecks that make GPU inference slower for sequential generation tasks like text output.

Through the GroqCloud API, builders can run popular open-weight models (Llama, Qwen, Mistral, and others) with an OpenAI-compatible interface — meaning you can often swap Groq in by changing one line of code. It is now also available as an inference provider directly through the Hugging Face Hub.

Groq is not the right choice for every use case. It runs open-weight models, not proprietary ones, and availability of specific model versions can vary. But for applications where TTFT matters — voice agents, real-time coding assistants, low-latency chatbots — it is one of the most-referenced options in builder discussions. The tradeoff is choosing it for speed over the broader model selection and fine-tuning options you get with GPU clouds.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.
Related terms
InferenceTTFTModel servingGPU cloudOpen weights