Safety on AI Knowledge Base

The EU's AI Act and How Companies Can Achieve Compliance

Mon, 01 Jan 0001 00:00:00 +0000

Practical guide to the EU AI Act — the world’s first comprehensive AI law — explaining risk categories, compliance requirements, and what companies need to do to prepare.

Designing a Responsible AI Program? Start with this Checklist

Mon, 01 Jan 0001 00:00:00 +0000

Eight critical questions organizations should answer before implementing enterprise-wide responsible AI programs to avoid rushing deployment and wasting resources.

Introduction to Responsible AI

Mon, 01 Jan 0001 00:00:00 +0000

Google’s concise introduction to responsible AI practices — covers fairness, interpretability, privacy, and security in AI systems. At just 30 minutes, this is the fastest way to understand the principles that should guide any AI deployment. Pairs well with the AI Safety learning paths in this knowledge base for a more comprehensive view of safety across providers.

AI Safety & Guardrails

Mon, 01 Jan 0001 00:00:00 +0000

Master the techniques for building safe, responsible AI applications across providers. This path covers content moderation, jailbreak defense, hallucination reduction, and guardrail patterns — drawing from 6 different sources to show how each provider approaches safety differently.

Cross-provider comparison is central to this path: OpenAI uses a separate moderation endpoint, Anthropic leverages the main model with prompt design, Mistral has its own categorization, and Cohere offers declarative safety modes. Understanding all approaches lets you build defense-in-depth systems.

AI Safety & Risk for Decision-Makers

Mon, 01 Jan 0001 00:00:00 +0000

A concise path for executives, product leaders, and board members who need to understand AI risks and safety without technical depth. Covers data privacy, hallucination risk, safety engineering principles, regulatory compliance (EU AI Act), and responsible AI governance.

By the end of this 2-hour path, you’ll understand the key risks of deploying AI in production, what questions to ask your engineering team about safety measures, and how to establish an AI governance framework for your organization.

AI Risk & Safety for Technical Leaders

Mon, 01 Jan 0001 00:00:00 +0000

A bridge between engineering safety practices and board-level risk governance for technical leaders who need to evaluate their team’s safety architecture and communicate residual risks to stakeholders. This path covers hallucination mitigation, data protection, enterprise guardrails, regulatory compliance, and production readiness assessment.

By the end of this path, you’ll be able to: evaluate whether your engineering team has implemented defense-in-depth safety, assess compliance requirements under the EU AI Act, conduct technical due diligence on AI projects before production deployment, and translate engineering risk into board-level governance language.

Agents

Mon, 01 Jan 0001 00:00:00 +0000

Configure agent instructions, tools, guardrails, memory, and streaming behavior.

Classifier Factory

Mon, 01 Jan 0001 00:00:00 +0000

Create and fine-tune custom classification models for intent detection, moderation, sentiment analysis, and more using Mistral’s Classifier Factory

Content Moderation

Mon, 01 Jan 0001 00:00:00 +0000

This guide covers using Claude as a content moderation layer, including prompt design for classification, handling ambiguous cases, and calibrating sensitivity thresholds for different content categories. The practical impact is significant: a well-tuned Claude moderation pipeline can replace or augment dedicated moderation APIs at lower cost while offering more nuanced category distinctions. Pay attention to the latency and cost tradeoffs between using Haiku for high-volume screening versus Sonnet for borderline cases that need deeper reasoning. Combining Claude moderation with traditional keyword filters as a first pass is a pattern worth adopting early to keep API costs manageable at scale.

Cybersecurity checks

Mon, 01 Jan 0001 00:00:00 +0000

OpenAI’s cybersecurity checks documentation details the built-in safeguards that prevent models from generating malicious code, exploitation techniques, or detailed vulnerability instructions. Understanding these guardrails matters when building security-adjacent applications like code review tools or penetration testing assistants, where legitimate security content can trigger false positives. Focus on the boundary between allowed and blocked content so you can design prompts that stay within policy while still being useful for defensive security work. If your application handles security-sensitive domains, read this before encountering unexpected refusals in production.

Data controls in the OpenAI platform

Mon, 01 Jan 0001 00:00:00 +0000

Your data is your data. An overview of how OpenAI uses your data, including retention and usage policies.

Evaluate using local scorers

Mon, 01 Jan 0001 00:00:00 +0000

Small language models that run locally to evaluate AI system safety and quality

Example Gallery

Mon, 01 Jan 0001 00:00:00 +0000

Task-oriented examples that demonstrate agent loops, tool usage, guardrails, and integration patterns.

Guardrails

Mon, 01 Jan 0001 00:00:00 +0000

Implement safety checks and content filtering for your agents

Guardrails

Mon, 01 Jan 0001 00:00:00 +0000

Implement safety checks and content filtering for your agents

Guardrails

Mon, 01 Jan 0001 00:00:00 +0000

Define validators that run alongside the agent loop to enforce business and safety rules.

Guardrails and human review

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use guardrails and human review in the OpenAI Agents SDK for safer, more controlled workflows.

Hallucination Guardrail

Mon, 01 Jan 0001 00:00:00 +0000

Prevent and detect AI hallucinations in your CrewAI tasks

Handle Streaming Refusals

Mon, 01 Jan 0001 00:00:00 +0000

Streaming refusals present a unique UX challenge: tokens have already been sent to the client before the model decides to refuse, so you cannot simply suppress the response. This guide covers detection strategies and graceful recovery patterns for when Claude mid-stream determines a request violates safety guidelines. Pay close attention to the stop reason codes and how they differ from normal completion events — your streaming parser needs to handle refusal signals without crashing or displaying partial unsafe content. Implement these patterns early in development rather than retrofitting them after users encounter jarring truncated responses in production.

Increase Consistency

Mon, 01 Jan 0001 00:00:00 +0000

Output consistency matters most when Claude powers automated pipelines where downstream code parses its responses. This guide covers techniques like temperature reduction, few-shot examples, structured output formats, and explicit schemas that make Claude’s responses more deterministic. The single biggest lever is providing concrete output examples in your prompt – this anchors the model’s formatting far more reliably than verbal instructions alone. Read this before building any system that pipes Claude output into JSON parsers, database inserts, or multi-step agent workflows.

Mitigate Jailbreaks

Mon, 01 Jan 0001 00:00:00 +0000

Jailbreak mitigation is essential for any production deployment where Claude interacts with untrusted user input. This guide covers defense-in-depth strategies including system prompt hardening, input validation, and output filtering. A common pitfall is relying solely on system prompt instructions for safety – attackers routinely bypass single-layer defenses, so layering multiple techniques is critical. Read this alongside the harmlessness screens documentation to understand how Anthropic’s built-in protections complement your application-level guardrails.

Model customization

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to customize LLMs for your application with system prompts, fine-tuning, and moderation layers

Moderation

Mon, 01 Jan 0001 00:00:00 +0000

Mistral’s moderation API detects harmful content across multiple categories using AI-powered classification for text and conversations

Moderation

Mon, 01 Jan 0001 00:00:00 +0000

OpenAI’s free moderation endpoint for detecting harmful content across categories like hate, violence, self-harm, and sexual content in both text and images.

PII Redaction for Traces

Mon, 01 Jan 0001 00:00:00 +0000

Automatically redact sensitive data from crew and flow execution traces

Python v2 SDK Migration Guide

Mon, 01 Jan 0001 00:00:00 +0000

Migrate from Together Python v1 to v2 - the new Together AI Python SDK with improved type safety and modern architecture.

Red teaming

Mon, 01 Jan 0001 00:00:00 +0000

Learn how red teaming fits into AI evaluation, including Promptfoo open source and OpenAI Red Teaming for enterprise teams.

Reduce Hallucinations

Mon, 01 Jan 0001 00:00:00 +0000

Hallucination reduction is arguably the most impactful guardrail topic for practitioners building retrieval-augmented or factual applications with Claude. The guide covers grounding techniques such as providing source documents, instructing the model to quote directly, and asking it to flag uncertainty. A key gotcha is that simply telling Claude “don’t hallucinate” is far less effective than structuring prompts so the model can cite or decline – give it an explicit escape hatch like “say I don’t know if the answer isn’t in the provided context.” Pair this with the evaluation techniques in the testing docs to measure hallucination rates systematically.

Reduce Latency

Mon, 01 Jan 0001 00:00:00 +0000

Latency optimization directly impacts user experience and cost in production Claude deployments. This guide walks through techniques like prompt length reduction, streaming, model selection trade-offs, and caching strategies that can cut response times significantly. Start with the quick wins – enabling streaming and trimming unnecessary context from prompts – before moving to architectural changes like prompt caching. Be aware that some latency reduction techniques (such as using smaller models or shorter prompts) trade off against output quality, so always measure both metrics together.

Reduce Prompt Leak

Mon, 01 Jan 0001 00:00:00 +0000

Prompt leakage is one of the most common security concerns in production LLM applications, and this guide provides concrete techniques for preventing Claude from revealing system prompts to end users. Focus on the layered defense approach — no single technique is sufficient, so you need to combine prompt structure, output filtering, and behavioral instructions. A frequent mistake is relying solely on “do not reveal your instructions” directives, which are trivially bypassed by indirect extraction attacks. Read this alongside the general guardrails documentation to build a comprehensive safety posture before shipping user-facing agents.

Safety and Security for AI Agents

Mon, 01 Jan 0001 00:00:00 +0000

Safety best practices

Mon, 01 Jan 0001 00:00:00 +0000

Comprehensive safety practices for responsible AI deployment — covering moderation, adversarial testing, human oversight, prompt engineering for safety, and production monitoring.

Safety in building agents

Mon, 01 Jan 0001 00:00:00 +0000

Minimize prompt injections and other risks in agent workflows — covering input validation, tool call authorization, and sandboxing strategies specific to agentic systems.

Safety Modes

Mon, 01 Jan 0001 00:00:00 +0000

The safety modes documentation describes how to use default and strict modes in order to exercise additional control over model output.

Scorers As Guardrails

Mon, 01 Jan 0001 00:00:00 +0000

Learn how to use scorers as guardrails with W&B Weave

Set up guardrails

Mon, 01 Jan 0001 00:00:00 +0000

Ensure LLM safety and measure output quality in production applications

Under 18 API Guidance

Mon, 01 Jan 0001 00:00:00 +0000

This guide outlines OpenAI’s requirements and recommendations for API applications that may serve users under 18, covering content filtering, age-gating, and compliance considerations. If you are building an education, tutoring, or family-oriented product on the OpenAI API, this is mandatory reading – non-compliance can result in API access revocation. Focus on the specific content categories that require additional filtering for minors, as these go beyond the default safety settings. The guidance here also has implications for your application’s terms of service and privacy policy, so loop in your legal and product teams early.