<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Safety on AI Knowledge Base</title><link>https://learn-ai.blindshot.kz/topics/safety/</link><description>Recent content in Safety on AI Knowledge Base</description><generator>Hugo</generator><language>en-us</language><atom:link href="https://learn-ai.blindshot.kz/topics/safety/index.xml" rel="self" type="application/rss+xml"/><item><title>The EU's AI Act and How Companies Can Achieve Compliance</title><link>https://learn-ai.blindshot.kz/docs/ai-strategy/risk/ai-governance-leaders/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/ai-strategy/risk/ai-governance-leaders/</guid><description>Practical guide to the EU AI Act — the world&amp;rsquo;s first comprehensive AI law — explaining risk categories, compliance requirements, and what companies need to do to prepare.</description></item><item><title>Designing a Responsible AI Program? Start with this Checklist</title><link>https://learn-ai.blindshot.kz/docs/ai-strategy/risk/managing-ai-risk/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/ai-strategy/risk/managing-ai-risk/</guid><description>Eight critical questions organizations should answer before implementing enterprise-wide responsible AI programs to avoid rushing deployment and wasting resources.</description></item><item><title>Introduction to Responsible AI</title><link>https://learn-ai.blindshot.kz/courses/google-responsible-ai/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/courses/google-responsible-ai/</guid><description>&lt;p&gt;Google&amp;rsquo;s concise introduction to responsible AI practices — covers fairness, interpretability, privacy, and security in AI systems. At just 30 minutes, this is the fastest way to understand the principles that should guide any AI deployment. Pairs well with the AI Safety learning paths in this knowledge base for a more comprehensive view of safety across providers.&lt;/p&gt;</description></item><item><title>AI Safety &amp; Guardrails</title><link>https://learn-ai.blindshot.kz/paths/ai-safety/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/paths/ai-safety/</guid><description>&lt;p&gt;Master the techniques for building safe, responsible AI applications across providers. This path covers content moderation, jailbreak defense, hallucination reduction, and guardrail patterns — drawing from 6 different sources to show how each provider approaches safety differently.&lt;/p&gt;
&lt;p&gt;Cross-provider comparison is central to this path: OpenAI uses a separate moderation endpoint, Anthropic leverages the main model with prompt design, Mistral has its own categorization, and Cohere offers declarative safety modes. Understanding all approaches lets you build defense-in-depth systems.&lt;/p&gt;</description></item><item><title>AI Safety &amp; Risk for Decision-Makers</title><link>https://learn-ai.blindshot.kz/paths/ai-safety-decision-makers/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/paths/ai-safety-decision-makers/</guid><description>&lt;p&gt;A concise path for executives, product leaders, and board members who need to understand AI risks and safety without technical depth. Covers data privacy, hallucination risk, safety engineering principles, regulatory compliance (EU AI Act), and responsible AI governance.&lt;/p&gt;
&lt;p&gt;By the end of this 2-hour path, you&amp;rsquo;ll understand the key risks of deploying AI in production, what questions to ask your engineering team about safety measures, and how to establish an AI governance framework for your organization.&lt;/p&gt;</description></item><item><title>AI Risk &amp; Safety for Technical Leaders</title><link>https://learn-ai.blindshot.kz/paths/ai-risk-technical-leaders/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/paths/ai-risk-technical-leaders/</guid><description>&lt;p&gt;A bridge between engineering safety practices and board-level risk governance for technical leaders who need to evaluate their team&amp;rsquo;s safety architecture and communicate residual risks to stakeholders. This path covers hallucination mitigation, data protection, enterprise guardrails, regulatory compliance, and production readiness assessment.&lt;/p&gt;
&lt;p&gt;By the end of this path, you&amp;rsquo;ll be able to: evaluate whether your engineering team has implemented defense-in-depth safety, assess compliance requirements under the EU AI Act, conduct technical due diligence on AI projects before production deployment, and translate engineering risk into board-level governance language.&lt;/p&gt;</description></item><item><title>Agents</title><link>https://learn-ai.blindshot.kz/docs/openai/agents-sdk/agents/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/agents-sdk/agents/</guid><description>Configure agent instructions, tools, guardrails, memory, and streaming behavior.</description></item><item><title>Classifier Factory</title><link>https://learn-ai.blindshot.kz/docs/mistral/docs/capabilities/finetuning/classifier-factory/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/mistral/docs/capabilities/finetuning/classifier-factory/</guid><description>Create and fine-tune custom classification models for intent detection, moderation, sentiment analysis, and more using Mistral&amp;rsquo;s Classifier Factory</description></item><item><title>Content Moderation</title><link>https://learn-ai.blindshot.kz/docs/anthropic/platform/about-claude/use-case-guides/content-moderation/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/anthropic/platform/about-claude/use-case-guides/content-moderation/</guid><description>&lt;p&gt;This guide covers using Claude as a content moderation layer, including prompt design for classification, handling ambiguous cases, and calibrating sensitivity thresholds for different content categories. The practical impact is significant: a well-tuned Claude moderation pipeline can replace or augment dedicated moderation APIs at lower cost while offering more nuanced category distinctions. Pay attention to the latency and cost tradeoffs between using Haiku for high-volume screening versus Sonnet for borderline cases that need deeper reasoning. Combining Claude moderation with traditional keyword filters as a first pass is a pattern worth adopting early to keep API costs manageable at scale.&lt;/p&gt;</description></item><item><title>Cybersecurity checks</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/safety-checks/cybersecurity/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/safety-checks/cybersecurity/</guid><description>&lt;p&gt;OpenAI&amp;rsquo;s cybersecurity checks documentation details the built-in safeguards that prevent models from generating malicious code, exploitation techniques, or detailed vulnerability instructions. Understanding these guardrails matters when building security-adjacent applications like code review tools or penetration testing assistants, where legitimate security content can trigger false positives. Focus on the boundary between allowed and blocked content so you can design prompts that stay within policy while still being useful for defensive security work. If your application handles security-sensitive domains, read this before encountering unexpected refusals in production.&lt;/p&gt;</description></item><item><title>Data controls in the OpenAI platform</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/your-data/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/your-data/</guid><description>Your data is your data. An overview of how OpenAI uses your data, including retention and usage policies.</description></item><item><title>Evaluate using local scorers</title><link>https://learn-ai.blindshot.kz/docs/wandb/weave/guides/evaluation/weave_local_scorers/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/wandb/weave/guides/evaluation/weave_local_scorers/</guid><description>Small language models that run locally to evaluate AI system safety and quality</description></item><item><title>Example Gallery</title><link>https://learn-ai.blindshot.kz/docs/openai/agents-sdk/examples/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/agents-sdk/examples/</guid><description>Task-oriented examples that demonstrate agent loops, tool usage, guardrails, and integration patterns.</description></item><item><title>Guardrails</title><link>https://learn-ai.blindshot.kz/docs/langchain/oss/javascript/langchain/guardrails/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/langchain/oss/javascript/langchain/guardrails/</guid><description>Implement safety checks and content filtering for your agents</description></item><item><title>Guardrails</title><link>https://learn-ai.blindshot.kz/docs/langchain/oss/python/langchain/guardrails/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/langchain/oss/python/langchain/guardrails/</guid><description>Implement safety checks and content filtering for your agents</description></item><item><title>Guardrails</title><link>https://learn-ai.blindshot.kz/docs/openai/agents-sdk/guardrails/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/agents-sdk/guardrails/</guid><description>Define validators that run alongside the agent loop to enforce business and safety rules.</description></item><item><title>Guardrails and human review</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/agents/guardrails-approvals/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/agents/guardrails-approvals/</guid><description>Learn how to use guardrails and human review in the OpenAI Agents SDK for safer, more controlled workflows.</description></item><item><title>Hallucination Guardrail</title><link>https://learn-ai.blindshot.kz/docs/crewai/en/enterprise/features/hallucination-guardrail/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/crewai/en/enterprise/features/hallucination-guardrail/</guid><description>Prevent and detect AI hallucinations in your CrewAI tasks</description></item><item><title>Handle Streaming Refusals</title><link>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/handle-streaming-refusals/</guid><description>&lt;p&gt;Streaming refusals present a unique UX challenge: tokens have already been sent to the client before the model decides to refuse, so you cannot simply suppress the response. This guide covers detection strategies and graceful recovery patterns for when Claude mid-stream determines a request violates safety guidelines. Pay close attention to the stop reason codes and how they differ from normal completion events — your streaming parser needs to handle refusal signals without crashing or displaying partial unsafe content. Implement these patterns early in development rather than retrofitting them after users encounter jarring truncated responses in production.&lt;/p&gt;</description></item><item><title>Increase Consistency</title><link>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/increase-consistency/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/increase-consistency/</guid><description>&lt;p&gt;Output consistency matters most when Claude powers automated pipelines where downstream code parses its responses. This guide covers techniques like temperature reduction, few-shot examples, structured output formats, and explicit schemas that make Claude&amp;rsquo;s responses more deterministic. The single biggest lever is providing concrete output examples in your prompt &amp;ndash; this anchors the model&amp;rsquo;s formatting far more reliably than verbal instructions alone. Read this before building any system that pipes Claude output into JSON parsers, database inserts, or multi-step agent workflows.&lt;/p&gt;</description></item><item><title>Mitigate Jailbreaks</title><link>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/mitigate-jailbreaks/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/mitigate-jailbreaks/</guid><description>&lt;p&gt;Jailbreak mitigation is essential for any production deployment where Claude interacts with untrusted user input. This guide covers defense-in-depth strategies including system prompt hardening, input validation, and output filtering. A common pitfall is relying solely on system prompt instructions for safety &amp;ndash; attackers routinely bypass single-layer defenses, so layering multiple techniques is critical. Read this alongside the harmlessness screens documentation to understand how Anthropic&amp;rsquo;s built-in protections complement your application-level guardrails.&lt;/p&gt;</description></item><item><title>Model customization</title><link>https://learn-ai.blindshot.kz/docs/mistral/docs/getting-started/model_customization/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/mistral/docs/getting-started/model_customization/</guid><description>Learn how to customize LLMs for your application with system prompts, fine-tuning, and moderation layers</description></item><item><title>Moderation</title><link>https://learn-ai.blindshot.kz/docs/mistral/docs/capabilities/moderation/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/mistral/docs/capabilities/moderation/</guid><description>Mistral&amp;rsquo;s moderation API detects harmful content across multiple categories using AI-powered classification for text and conversations</description></item><item><title>Moderation</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/moderation/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/moderation/</guid><description>OpenAI&amp;rsquo;s free moderation endpoint for detecting harmful content across categories like hate, violence, self-harm, and sexual content in both text and images.</description></item><item><title>PII Redaction for Traces</title><link>https://learn-ai.blindshot.kz/docs/crewai/en/enterprise/features/pii-trace-redactions/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/crewai/en/enterprise/features/pii-trace-redactions/</guid><description>Automatically redact sensitive data from crew and flow execution traces</description></item><item><title>Python v2 SDK Migration Guide</title><link>https://learn-ai.blindshot.kz/docs/together-ai/docs/pythonv2-migration-guide/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/together-ai/docs/pythonv2-migration-guide/</guid><description>Migrate from Together Python v1 to v2 - the new Together AI Python SDK with improved type safety and modern architecture.</description></item><item><title>Red teaming</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/red-teaming/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/red-teaming/</guid><description>Learn how red teaming fits into AI evaluation, including Promptfoo open source and OpenAI Red Teaming for enterprise teams.</description></item><item><title>Reduce Hallucinations</title><link>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/reduce-hallucinations/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/reduce-hallucinations/</guid><description>&lt;p&gt;Hallucination reduction is arguably the most impactful guardrail topic for practitioners building retrieval-augmented or factual applications with Claude. The guide covers grounding techniques such as providing source documents, instructing the model to quote directly, and asking it to flag uncertainty. A key gotcha is that simply telling Claude &amp;ldquo;don&amp;rsquo;t hallucinate&amp;rdquo; is far less effective than structuring prompts so the model can cite or decline &amp;ndash; give it an explicit escape hatch like &amp;ldquo;say I don&amp;rsquo;t know if the answer isn&amp;rsquo;t in the provided context.&amp;rdquo; Pair this with the evaluation techniques in the testing docs to measure hallucination rates systematically.&lt;/p&gt;</description></item><item><title>Reduce Latency</title><link>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/reduce-latency/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/reduce-latency/</guid><description>&lt;p&gt;Latency optimization directly impacts user experience and cost in production Claude deployments. This guide walks through techniques like prompt length reduction, streaming, model selection trade-offs, and caching strategies that can cut response times significantly. Start with the quick wins &amp;ndash; enabling streaming and trimming unnecessary context from prompts &amp;ndash; before moving to architectural changes like prompt caching. Be aware that some latency reduction techniques (such as using smaller models or shorter prompts) trade off against output quality, so always measure both metrics together.&lt;/p&gt;</description></item><item><title>Reduce Prompt Leak</title><link>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/reduce-prompt-leak/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/anthropic/platform/test-and-evaluate/strengthen-guardrails/reduce-prompt-leak/</guid><description>&lt;p&gt;Prompt leakage is one of the most common security concerns in production LLM applications, and this guide provides concrete techniques for preventing Claude from revealing system prompts to end users. Focus on the layered defense approach — no single technique is sufficient, so you need to combine prompt structure, output filtering, and behavioral instructions. A frequent mistake is relying solely on &amp;ldquo;do not reveal your instructions&amp;rdquo; directives, which are trivially bypassed by indirect extraction attacks. Read this alongside the general guardrails documentation to build a comprehensive safety posture before shipping user-facing agents.&lt;/p&gt;</description></item><item><title>Safety and Security for AI Agents</title><link>https://learn-ai.blindshot.kz/docs/google/adk/safety/_overview/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/google/adk/safety/_overview/</guid><description/></item><item><title>Safety best practices</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/safety-best-practices/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/safety-best-practices/</guid><description>Comprehensive safety practices for responsible AI deployment — covering moderation, adversarial testing, human oversight, prompt engineering for safety, and production monitoring.</description></item><item><title>Safety in building agents</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/agent-builder-safety/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/agent-builder-safety/</guid><description>Minimize prompt injections and other risks in agent workflows — covering input validation, tool call authorization, and sandboxing strategies specific to agentic systems.</description></item><item><title>Safety Modes</title><link>https://learn-ai.blindshot.kz/docs/cohere/docs/safety-modes/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/cohere/docs/safety-modes/</guid><description>The safety modes documentation describes how to use default and strict modes in order to exercise additional control over model output.</description></item><item><title>Scorers As Guardrails</title><link>https://learn-ai.blindshot.kz/docs/wandb/weave/cookbooks/scorers_as_guardrails/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/wandb/weave/cookbooks/scorers_as_guardrails/</guid><description>Learn how to use scorers as guardrails with W&amp;amp;B Weave</description></item><item><title>Set up guardrails</title><link>https://learn-ai.blindshot.kz/docs/wandb/weave/guides/evaluation/guardrails/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/wandb/weave/guides/evaluation/guardrails/</guid><description>Ensure LLM safety and measure output quality in production applications</description></item><item><title>Under 18 API Guidance</title><link>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/safety-checks/under-18-api-guidance/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://learn-ai.blindshot.kz/docs/openai/api/api/docs/guides/safety-checks/under-18-api-guidance/</guid><description>&lt;p&gt;This guide outlines OpenAI&amp;rsquo;s requirements and recommendations for API applications that may serve users under 18, covering content filtering, age-gating, and compliance considerations. If you are building an education, tutoring, or family-oriented product on the OpenAI API, this is mandatory reading &amp;ndash; non-compliance can result in API access revocation. Focus on the specific content categories that require additional filtering for minors, as these go beyond the default safety settings. The guidance here also has implications for your application&amp;rsquo;s terms of service and privacy policy, so loop in your legal and product teams early.&lt;/p&gt;</description></item></channel></rss>