Concept·Agents & Automation·Added 1 month ago

Guardrail

Also known as: guardrails, AI guardrail, safety guardrail, output guardrail, input guardrail

Any constraint or check you put on a model's inputs or outputs to prevent undesirable behavior. Could be a filter that blocks certain prompt types, a validator that rejects off-format outputs, or a policy layer that keeps the model on topic. Guardrails are you saying 'not that, even if the model would go there.'

Guardrails are the product safety layer around AI models. The model itself has its own built-in training-based safety (Anthropic's Constitutional AI, OpenAI's RLHF alignment work), but those are never enough on their own for production use. Guardrails are the additional checks you build: input filters that catch harmful or off-topic requests before they reach the model, output validators that verify responses meet your format and content requirements, and policy layers that restrict what the model is allowed to say in your specific context.

Common guardrail implementations include: topic filters (the customer service bot only talks about our products), format validators (the output must be valid JSON), PII (personally identifiable information) detectors that strip or flag sensitive data, and toxicity filters that block harmful content. Tools like NeMo Guardrails from NVIDIA, and built-in features in AI gateways, make this layer easier to implement.

For agentic systems, guardrails become more critical because agents take actions, not just produce text. A guardrail on an agent might prevent it from writing to a database without human approval, from making API calls above a cost threshold, or from accessing files outside a specified directory. The 12-factor agents principles and excessive agency guidelines both emphasize guardrails as a core part of responsible agent design.

This definition is AI-generated and refreshed weekly. It may contain inaccuracies. Use your own judgment, especially for production decisions.

Related terms