AI Glossary

Practitioner-level definitions of key AI and LLM concepts, linked to documentation across providers in this catalog. Each term includes links to the most relevant artifacts and cross-references to related concepts.

Quick Navigation#

Foundation Concepts: Agent | Alignment | Chain-of-Thought | Context Window | Embedding | Few-Shot Learning | Grounding | Hallucination | Inference | Large Language Model (LLM) | Multimodal | Prompt | Reasoning | System Prompt | Temperature | Token

Model Training & Optimization: Distillation | Fine-Tuning | Mixture of Experts (MoE) | Pre-Training | Quantization | RLHF | Transfer Learning

Architecture Patterns: Function Calling | GroupChat | Guardrails | Harness Engineering | Memory (Agent) | Multi-Agent System | Orchestration | Pipeline | Retrieval-Augmented Generation (RAG) | Streaming | Structured Output | Tool Use | Workflow | Workflow Agent

Embeddings & Search: Chunking | Cosine Similarity | Hybrid Search | Reranking | Semantic Search | Similarity Search | Vector Database

Protocols & Standards: A2A Protocol | AG-UI Protocol | API | JSON Schema | Model Context Protocol (MCP)

Evaluation & Quality: Benchmark | Evaluation (Evals) | Faithfulness | Precision & Recall | Prompt Engineering | Red Teaming | Relevance

Infrastructure & Operations: Batching | Caching | Cost Optimization | Deployment | Model Serving | Observability | Rate Limiting | Safetensors

Safety: Content Moderation | Jailbreaking | Safety | Toxicity


A2A Protocol#

The Agent-to-Agent (A2A) protocol is an open standard initiated by Google for communication between AI agents across different platforms and frameworks. It defines how agents discover each other’s capabilities, negotiate tasks, and exchange results, enabling interoperability in multi-agent systems regardless of the underlying implementation.

Key resources: A2A Protocol Docs

See also: Agent, Multi-Agent System, Model Context Protocol (MCP)


AG-UI Protocol#

The Agent-User Interaction (AG-UI) protocol standardizes how AI agents communicate with frontend user interfaces. It defines event types, message formats, and interaction patterns for streaming agent responses, tool invocations, and state updates to client applications in real time.

Key resources: AG-UI Concepts: Agents | AG-UI Protocol Docs

See also: Agent, Streaming, Model Context Protocol (MCP)


Agent#

An AI system that can autonomously plan, reason, and take actions to accomplish goals. Unlike a simple chatbot that responds to single prompts, an agent maintains state across steps, uses tools to interact with external systems, and iterates toward objectives with minimal human intervention. Agent architectures range from single-loop ReAct patterns to complex multi-agent orchestrations.

Key resources: Agents (OpenAI) | Agents Introduction (Mistral) | Anthropic Agent SDK Overview | Agents (CrewAI)

See also: Tool Use, Multi-Agent System, Orchestration, Memory (Agent)


Alignment#

The process of ensuring an AI system’s behavior matches intended human values, goals, and preferences. Alignment techniques include RLHF, constitutional AI, and safety training. In practice, alignment shows up as a model refusing harmful requests, following instructions accurately, and producing helpful rather than deceptive outputs.

Key resources: Safety Best Practices (OpenAI) | Reduce Hallucinations (Anthropic)

See also: RLHF, Safety, Guardrails


API#

Application Programming Interface — a defined contract for how software components communicate. In the AI ecosystem, APIs are the primary way developers access model capabilities: sending prompts and receiving completions over HTTP. Each provider (OpenAI, Anthropic, Mistral, Cohere) exposes REST APIs with provider-specific request/response formats, authentication, and rate limits.

Key resources: OpenAI API Concepts | Anthropic Platform Docs | Mistral Docs

See also: Rate Limiting, Streaming, JSON Schema


Batching#

Processing multiple requests together in a single API call or scheduled job rather than one at a time. Batch APIs let you submit large sets of prompts for asynchronous processing at reduced cost and higher throughput, with the trade-off of higher latency per individual request. Useful for offline tasks like classification, summarization, or embedding generation over large datasets.

Key resources: Anthropic Platform Docs

See also: Cost Optimization, Inference, Rate Limiting


Benchmark#

A standardized test suite used to measure and compare AI model performance across specific capabilities. Common benchmarks include MMLU (knowledge), HumanEval (code generation), GSM8K (math reasoning), and MTEB (embeddings). Benchmarks provide objective comparison points but can be gamed and don’t always reflect real-world task performance.

Key resources: Model Selection (OpenAI) | Models Overview (Anthropic) | Evaluation Guide (OpenAI)

See also: Evaluation (Evals), Relevance


Caching#

Storing and reusing previously computed results to reduce latency and cost. In LLM applications, two forms dominate: prompt caching stores the processed prefix of repeated system prompts so the model doesn’t reprocess them on every request, and KV caching stores the key-value attention states during generation to speed up autoregressive decoding. Prompt caching can cut costs by 50-90% for applications with long, stable system prompts.

Key resources: Prompt Caching (Anthropic) | KV Cache (DeepSeek)

See also: Cost Optimization, Context Window, Token


Chain-of-Thought#

A prompting technique where the model is instructed to show its reasoning step-by-step before arriving at a final answer. Chain-of-thought (CoT) improves accuracy on math, logic, and multi-step reasoning tasks by forcing the model to decompose problems rather than jumping to conclusions. Can be elicited with phrases like “think step by step” or through structured reasoning formats.

Key resources: Prompt Engineering Overview (Anthropic) | Reasoning Best Practices (OpenAI)

See also: Reasoning, Prompt Engineering, Prompt


Chunking#

Splitting documents into smaller segments before embedding and indexing them in a vector database. Chunk size affects retrieval quality: too large and relevant information gets diluted by surrounding context; too small and chunks lose coherence. Common strategies include fixed-size windows with overlap, sentence-boundary splitting, and semantic chunking based on topic shifts.

Key resources: Pinecone Concepts | RAG Complete Example (Cohere)

See also: Embedding, Vector Database, Retrieval-Augmented Generation (RAG)


Content Moderation#

Automated detection and filtering of harmful, toxic, or policy-violating content in model inputs or outputs. Moderation APIs classify text across categories like hate speech, violence, sexual content, and self-harm. Used both as a safety layer on user inputs before they reach the model and as a filter on model outputs before they reach users.

Key resources: Content Moderation (Anthropic) | Moderation (OpenAI) | Moderation (Mistral)

See also: Safety, Guardrails, Toxicity


Context Window#

The maximum number of tokens a model can process in a single request, including both the input (system prompt, conversation history, documents) and the generated output. Context windows range from 4K tokens to over 1M tokens depending on the model. Managing context effectively — deciding what to include, when to summarize, and how to compress — is a core challenge in production LLM applications.

Key resources: Context Windows (Anthropic) | Token Counting (OpenAI) | Models Overview (Anthropic)

See also: Token, Caching, Prompt


Cosine Similarity#

A mathematical measure of the angle between two vectors, used to quantify how semantically similar two pieces of text are after embedding. Values range from -1 (opposite) to 1 (identical), with higher scores indicating greater similarity. Cosine similarity is the default distance metric in most vector databases because it’s normalized for vector magnitude, making it robust across different embedding models.

Key resources: Embeddings (Cohere) | Pinecone Concepts | Embeddings Guide (OpenAI)

See also: Embedding, Similarity Search, Vector Database


Cost Optimization#

Strategies for reducing the financial cost of running LLM-powered applications in production. Key levers include: choosing smaller models for simpler tasks, using prompt caching to avoid reprocessing stable prefixes, batching requests for offline workloads, reducing token usage through concise prompts, and routing between model tiers based on query complexity.

Key resources: Pricing (Anthropic) | Token Counting (OpenAI)

See also: Token, Caching, Batching, Model Serving


Deployment#

The process of making an AI application available to users in a production environment. Encompasses infrastructure decisions (cloud vs. on-premise, GPU allocation), serving strategy (serverless vs. dedicated endpoints), scaling (auto-scaling, load balancing), and operational concerns (monitoring, rollback, A/B testing). Self-hosted open-source models add complexity around quantization, hardware selection, and model loading.

Key resources: Deployment (Mistral) | Dedicated Inference (Together AI)

See also: Model Serving, Observability, Cost Optimization


Distillation#

Training a smaller “student” model to reproduce the behavior of a larger “teacher” model. The student learns from the teacher’s output probabilities (soft labels) rather than just the ground-truth labels, capturing nuanced patterns that the teacher has learned. Distillation produces models that are cheaper and faster to run while retaining much of the teacher’s capability. Used to create production-optimized versions of large research models.

Key resources: Fine-Tuning Best Practices (OpenAI)

External references: Knowledge Distillation (Hugging Face)

See also: Fine-Tuning, Quantization, Transfer Learning


Embedding#

A dense numerical vector representation of text, images, or other data that captures semantic meaning in a continuous vector space. Embeddings are the foundation of similarity search, clustering, and retrieval systems. Text with similar meanings produces embeddings that are geometrically close, enabling operations like “find documents similar to this query” without keyword matching.

Key resources: Embeddings (Cohere) | Embeddings Overview (Mistral) | Embeddings Guide (OpenAI) | Embeddings Overview (Together AI)

See also: Vector Database, Cosine Similarity, Chunking, Semantic Search


Evaluation (Evals)#

Systematic measurement of AI model or application quality using defined metrics and test cases. Evaluations range from automated metrics (accuracy, F1, BLEU) to model-graded assessments where an LLM judges another LLM’s output. Building robust eval suites is considered the single most important practice for production AI applications — without them, you can’t reliably detect regressions or compare approaches.

Key resources: Evals Guide (OpenAI) | Evaluation Getting Started (OpenAI) | RAGAS Docs | W&B Docs

See also: Benchmark, Faithfulness, Relevance, Precision & Recall


Faithfulness#

An evaluation metric measuring whether a model’s generated response is factually consistent with the provided source material. A faithful response only contains claims supported by the context — it doesn’t fabricate, contradict, or extrapolate beyond what the sources state. Critical for RAG systems where users expect answers grounded in retrieved documents.

Key resources: RAGAS Docs | RAG Citations (Cohere)

See also: Hallucination, Grounding, Retrieval-Augmented Generation (RAG), Evaluation (Evals)


Few-Shot Learning#

Providing a small number of input-output examples in the prompt to demonstrate the desired behavior, format, or reasoning pattern. The model generalizes from these examples without any weight updates. Zero-shot means no examples (just instructions), one-shot means one example, and few-shot typically means 2-10 examples. Few-shot prompting is one of the most reliable techniques for controlling output format and improving task accuracy.

Key resources: Prompt Engineering Overview (Anthropic) | Glossary (Anthropic)

See also: Prompt, Prompt Engineering, Transfer Learning


Fine-Tuning#

Adapting a pre-trained model for a specific task or domain by training it on additional labeled data. Fine-tuning updates the model’s weights to specialize its behavior — improving accuracy on domain-specific tasks, adjusting output style, or teaching it formats that prompting alone can’t reliably achieve. Methods range from full-weight fine-tuning to parameter-efficient approaches like LoRA that only update a small subset of weights.

Key resources: Supervised Fine-Tuning (OpenAI) | Fine-Tuning (Mistral) | Fine-Tuning Best Practices (OpenAI) | Reinforcement Fine-Tuning (OpenAI)

See also: RLHF, Distillation, Pre-Training, Transfer Learning


Function Calling#

A model capability where the LLM can output structured requests to invoke specific functions or APIs rather than generating plain text. The model doesn’t execute the function itself — it produces a JSON object specifying the function name and arguments, which your code executes and returns the result to the model. OpenAI calls this “function calling”; Anthropic calls it “tool use.” The underlying pattern is identical but the request/response formats differ.

Key resources: Function Calling (Mistral) | Tool Use Overview (Cohere) | Tools (OpenAI)

See also: Tool Use, Structured Output, Agent, JSON Schema


Grounding#

Anchoring a model’s responses in verifiable source material rather than relying solely on its parametric knowledge. Grounded responses cite specific documents, data, or retrieved passages. RAG is the most common grounding technique, but grounding also includes web search augmentation, database lookups, and tool-verified facts. Grounding is the primary defense against hallucination.

Key resources: RAG (Cohere) | RAG Citations (Cohere)

See also: Hallucination, Retrieval-Augmented Generation (RAG), Faithfulness


GroupChat#

A multi-agent coordination pattern pioneered by AutoGen where agents collaborate through managed conversation turns in a shared chat. A GroupChat manager determines which agent speaks next using a selection strategy — round-robin, random, LLM-driven (based on agent descriptions and conversation context), or custom logic. This is a fundamentally different model from handoffs (explicit transfer) or graph-based orchestration (state machines) — it treats multi-agent coordination as a conversation management problem.

Key resources: GroupChat Pattern (AutoGen) | Speaker Selection (AutoGen)

See also: Multi-Agent System, Orchestration, Agent


Guardrails#

Programmatic constraints that prevent AI systems from producing harmful, off-topic, or policy-violating outputs. Guardrails operate at multiple levels: input validation (blocking prompt injections), output filtering (content moderation), structural constraints (enforcing JSON schemas), and behavioral boundaries (limiting which tools an agent can call). Effective guardrails are defense-in-depth — no single layer catches everything.

Key resources: Guardrails (OpenAI Agents SDK) | Guardrails (LangChain) | Strengthen Guardrails (Anthropic) | Hallucination Guardrail (CrewAI)

See also: Safety, Content Moderation, Alignment


Harness Engineering#

The practice of structuring a codebase — through documentation (AGENTS.md, WORKFLOW.md), CI pipelines, custom linters, and architectural constraints — so that autonomous coding agents can operate effectively without constant human supervision. Coined by OpenAI in the context of their Symphony and Codex projects. The core insight: the bottleneck shifts from writing code to designing environments, feedback loops, and control systems that enable agents to build and maintain software reliably at scale.

Key resources: Symphony Overview | Symphony Specification | WORKFLOW.md Configuration

See also: Orchestration, Agent, Workflow Agent


Hallucination#

When a model generates text that is factually incorrect, fabricated, or unsupported by its training data or provided context. Hallucinations range from subtle inaccuracies (wrong dates, non-existent citations) to entirely invented information presented with high confidence. Causes include training data gaps, over-generalization, and the autoregressive nature of text generation. Mitigation strategies include RAG, grounding, lower temperature, and explicit instructions to say “I don’t know.”

Key resources: Reduce Hallucinations (Anthropic) | RAG (Cohere) | RAG (LangChain)

See also: Grounding, Faithfulness, Retrieval-Augmented Generation (RAG)


Combining keyword-based (lexical) search with vector-based (semantic) search to get the benefits of both. Lexical search excels at exact matches, proper nouns, and rare terms; semantic search handles paraphrases and conceptual similarity. Hybrid search typically uses a weighted fusion (like Reciprocal Rank Fusion) of both result sets, consistently outperforming either method alone for RAG retrieval.

Key resources: Pinecone Concepts | Embeddings (Cohere)

See also: Semantic Search, Similarity Search, Reranking, Vector Database


Inference#

The process of generating predictions or outputs from a trained model given new inputs. In the LLM context, inference means processing a prompt through the model to produce a completion. Inference cost is measured in tokens processed per second and dollars per million tokens. Factors affecting inference performance include model size, quantization level, hardware (GPU type and count), batching strategy, and KV cache efficiency.

Key resources: Reasoning (OpenAI) | Dedicated Inference (Together AI)

See also: Token, Model Serving, Cost Optimization, Batching


Jailbreaking#

Techniques for bypassing a model’s safety training and content filters to produce outputs the model was designed to refuse. Methods include role-playing prompts, encoding schemes, multi-turn escalation, and adversarial suffixes. Understanding jailbreaking is essential for building robust guardrails — you can’t defend against attacks you don’t know about.

Key resources: Mitigate Jailbreaks (Anthropic) | Safety Best Practices (OpenAI)

See also: Safety, Guardrails, Red Teaming, Content Moderation


JSON Schema#

A vocabulary for annotating and validating the structure of JSON data. In AI applications, JSON Schema is used to define the expected output format for structured generation — you provide a schema, and the model constrains its output to match. This enables reliable extraction of typed data (names, dates, categories) from unstructured text. Used by OpenAI’s structured outputs, Anthropic’s tool use, and Instructor’s response models.

Key resources: Structured Outputs (OpenAI) | Structured Outputs (Cohere) | Instructor Docs

See also: Structured Output, Function Calling, Tool Use


Large Language Model (LLM)#

A neural network trained on massive text corpora to predict the next token in a sequence. LLMs (GPT-4, Claude, Mistral, Command) demonstrate emergent capabilities in reasoning, code generation, translation, and instruction following that scale with model size and data. Modern LLMs are transformer-based architectures with billions of parameters, typically pre-trained on trillions of tokens and then fine-tuned for instruction following and safety.

Key resources: Models Overview (Anthropic) | Model Selection (OpenAI) | Glossary (Mistral) | Glossary (Anthropic)

See also: Token, Pre-Training, Fine-Tuning, Context Window


Memory (Agent)#

The mechanism by which an AI agent retains information across interactions or steps. Short-term memory holds the current conversation context (bounded by the context window). Long-term memory persists information across sessions using external storage — vector databases, key-value stores, or structured summaries. Memory design determines whether an agent can learn from past interactions, recall user preferences, and build on prior work.

Key resources: Conversation State (OpenAI) | Context (LangChain)

See also: Agent, Context Window, Vector Database


Mixture of Experts (MoE)#

A model architecture that routes each input token to a subset of specialized “expert” sub-networks rather than processing it through the entire model. This allows models to have very large total parameter counts while only activating a fraction for each token, dramatically reducing inference cost. Mistral’s models (Mixtral) popularized this approach. A 47B-parameter MoE model might only use 13B parameters per token.

Key resources: Glossary (Mistral)

External references: Mixture of Experts (Hugging Face)

See also: Large Language Model (LLM), Inference, Quantization


Model Context Protocol (MCP)#

An open protocol that standardizes how AI applications provide context to LLMs. MCP defines the communication layer between AI assistants (clients) and external tools, data sources, and services (servers). It uses a host/client/server architecture with typed capabilities (tools, resources, prompts) and supports multiple transport mechanisms (stdio, SSE, streamable HTTP). Created by Anthropic and widely adopted across the ecosystem.

Key resources: MCP Architecture | MCP Docs

See also: Tool Use, Agent, A2A Protocol, AG-UI Protocol


Model Serving#

The infrastructure and software layer that makes a trained model available for inference at production scale. Serving systems handle model loading, request routing, batching, GPU memory management, auto-scaling, and health monitoring. Options range from managed API endpoints (OpenAI, Anthropic) to self-hosted solutions (vLLM, TGI, Triton) for open-source models.

Key resources: Dedicated Inference (Together AI) | Self-Deployment with TGI (Mistral)

See also: Deployment, Inference, Quantization, Cost Optimization


Multi-Agent System#

An architecture where multiple specialized AI agents collaborate to solve complex tasks. Each agent has a defined role (researcher, coder, reviewer) and capability set. Agents communicate through structured messages, share context, and coordinate work — either through a central orchestrator or peer-to-peer protocols. Multi-agent systems excel when tasks require diverse expertise or parallel workstreams.

Key resources: CrewAI Docs | Agents (OpenAI) | A2A Protocol

See also: Agent, Orchestration, A2A Protocol, Workflow


Multimodal#

AI models or systems that can process and generate multiple types of data — text, images, audio, video, or code — within a single interaction. A multimodal model can analyze an image and answer questions about it, transcribe audio, or generate images from text descriptions. Most frontier models (GPT-4o, Claude, Gemini) are natively multimodal.

Key resources: Image Inputs (Cohere) | Multimodal Embeddings (Chroma) | Multimodal Agents (CrewAI)

See also: Embedding, Large Language Model (LLM)


Observability#

Monitoring, logging, and tracing of AI application behavior in production. Observability tools capture prompt/completion pairs, latency, token usage, error rates, and cost per request. Trace-level observability records the full execution path through multi-step agent workflows — which tools were called, what was retrieved, and where failures occurred. Essential for debugging, optimization, and quality assurance.

Key resources: W&B Docs | LangChain Docs

See also: Evaluation (Evals), Deployment, Cost Optimization


Orchestration#

Coordinating multiple AI agents, tools, and retrieval systems into coherent workflows. The orchestration landscape spans several paradigms: handoffs (OpenAI Agents SDK — explicit control transfer), group chat (AutoGen — agents take turns in managed conversation), state machines (LangGraph — directed graphs with typed state), crews (CrewAI — role-based teams with sequential/hierarchical processes), workflow agents (Google ADK — deterministic sequential/parallel/loop patterns), enterprise graphs (MS Agent Framework — graph workflows with executors and edges), and autonomous runs (Symphony — issue-to-PR with workspace isolation). Each makes different trade-offs between flexibility, predictability, and developer control.

Key resources: Google ADK Agents | Symphony Overview | MS Agent Framework Workflows | LangGraph | CrewAI Crews

See also: Agent, Workflow, Workflow Agent, GroupChat, Harness Engineering, Pipeline, Multi-Agent System


Pipeline#

A sequence of processing steps where the output of one step feeds into the next. In AI applications, pipelines chain operations like: retrieve documents → rerank → generate answer → validate output. RAGAS defines evaluation pipelines; LangChain and DSPy provide pipeline abstractions for building multi-step AI workflows. Pipelines enforce structure and make complex systems testable at each stage.

Key resources: RAGAS Docs | DSPy Docs | LangChain Docs

See also: Orchestration, Workflow, Retrieval-Augmented Generation (RAG)


Precision & Recall#

Two fundamental evaluation metrics. Precision measures the fraction of retrieved or generated items that are relevant (how many of the things you found are correct). Recall measures the fraction of all relevant items that were successfully retrieved (how many of the correct things you found). In RAG evaluation, context precision measures whether retrieved chunks are relevant, and context recall measures whether all necessary information was retrieved.

Key resources: RAGAS Docs | Evaluation Best Practices (OpenAI)

See also: Evaluation (Evals), Faithfulness, Relevance


Pre-Training#

The initial training phase where a model learns general language understanding from a massive, diverse text corpus. During pre-training, the model learns to predict the next token across trillions of tokens from books, websites, code, and other text. This builds the model’s foundational knowledge, reasoning ability, and language fluency. Pre-training is extremely expensive (millions of dollars in compute) and is performed by model providers, not end users.

Key resources: Glossary (Mistral) | Glossary (Anthropic)

See also: Large Language Model (LLM), Fine-Tuning, Transfer Learning, Token


Prompt#

The input text sent to a language model to elicit a response. A prompt can range from a simple question to a complex multi-part instruction including system context, examples, constraints, and output format specifications. Prompt quality is the single biggest lever for output quality — a well-crafted prompt with clear instructions, relevant context, and good examples consistently outperforms vague or ambiguous requests.

Key resources: Prompt Engineering Overview (Anthropic) | Glossary (Mistral)

See also: System Prompt, Prompt Engineering, Few-Shot Learning, Token


Prompt Engineering#

The practice of designing, testing, and iterating on prompts to get optimal model behavior for a specific task. Techniques include providing clear instructions, using structured formats, adding examples (few-shot), chain-of-thought reasoning, role assignment, and output constraints. Prompt engineering is often the fastest and cheapest way to improve AI application quality before resorting to fine-tuning or architectural changes.

Key resources: Prompt Engineering Overview (Anthropic) | Prompt Engineering Concepts (LangSmith)

See also: Prompt, Few-Shot Learning, Chain-of-Thought, System Prompt


Quantization#

Reducing the numerical precision of a model’s weights (e.g., from 32-bit floating point to 8-bit or 4-bit integers) to decrease memory usage and increase inference speed. Quantization makes it possible to run large models on smaller GPUs with minimal quality loss. Common formats include GPTQ, AWQ, and GGUF. The trade-off is a small degradation in output quality, particularly on reasoning-heavy tasks.

Key resources: Self-Deployment with TGI (Mistral)

External references: Quantization (Hugging Face)

See also: Model Serving, Deployment, Inference


Rate Limiting#

Controls that cap the number of API requests or tokens a client can consume within a time window. Rate limits protect provider infrastructure, ensure fair access, and prevent runaway costs. Limits are typically expressed as requests per minute (RPM) and tokens per minute (TPM). Applications must handle rate limit errors (HTTP 429) gracefully with retry logic, exponential backoff, or request queuing.

Key resources: Glossary (Anthropic)

See also: API, Batching, Cost Optimization


Reasoning#

A model’s ability to perform logical, mathematical, or multi-step deduction to arrive at conclusions. Reasoning capabilities have improved dramatically with dedicated reasoning models (OpenAI o-series, DeepSeek-R1) that use extended “thinking” time before responding. These models trade latency for accuracy on complex problems like math proofs, code debugging, and scientific analysis.

Key resources: Reasoning (OpenAI) | Reasoning Best Practices (OpenAI) | Reasoning Overview (Together AI)

See also: Chain-of-Thought, Large Language Model (LLM), Inference


Red Teaming#

Deliberately attempting to make an AI system fail, produce harmful outputs, or behave outside its intended boundaries. Red teaming involves creative adversarial testing: crafting jailbreak prompts, probing edge cases, testing for bias, and verifying that safety guardrails hold under pressure. Essential practice before deploying AI systems to production. Findings are used to improve system prompts, guardrails, and training data.

Key resources: Safety Best Practices (OpenAI) | Agent Builder Safety (OpenAI)

See also: Safety, Jailbreaking, Guardrails, Evaluation (Evals)


Reranking#

A second-pass ranking step applied to search results to improve relevance. After initial retrieval (vector search, keyword search, or hybrid), a reranking model scores each result against the original query using a cross-encoder architecture that considers the query-document pair jointly. Reranking significantly improves retrieval quality in RAG systems, typically boosting answer accuracy by 5-15% over retrieval alone.

Key resources: Rerank Overview (Cohere) | Rerank Overview (Together AI)

See also: Retrieval-Augmented Generation (RAG), Semantic Search, Hybrid Search


Relevance#

An evaluation metric measuring whether retrieved documents or generated responses address the user’s actual question. In RAG evaluation, answer relevance checks if the response answers the query (regardless of correctness), while context relevance checks if retrieved documents are pertinent to the question. Distinguished from faithfulness, which measures factual accuracy rather than topical alignment.

Key resources: RAGAS Docs | Evaluation Best Practices (OpenAI)

See also: Faithfulness, Precision & Recall, Evaluation (Evals)


Retrieval-Augmented Generation (RAG)#

An architecture pattern that enhances LLM responses by first retrieving relevant documents from an external knowledge base, then including those documents in the model’s context as grounding material. RAG solves the knowledge cutoff problem (models don’t know recent facts), reduces hallucination (answers are grounded in sources), and enables domain specialization without fine-tuning. The standard RAG pipeline is: embed query → search vector database → retrieve top-k chunks → inject into prompt → generate response.

Key resources: RAG (Cohere) | RAG (LangChain) | Contextual RAG (Together AI) | Pinecone Docs

See also: Embedding, Vector Database, Chunking, Reranking, Grounding


RLHF#

Reinforcement Learning from Human Feedback — a training technique where human evaluators rank model outputs by quality, and this preference data is used to train a reward model that guides further model optimization. RLHF is the primary method used to align pre-trained LLMs with human preferences: making them helpful, honest, and harmless. It bridges the gap between “predict the next token” (pre-training objective) and “produce responses humans prefer.”

Key resources: Glossary (Anthropic) | Reinforcement Fine-Tuning (OpenAI)

External references: RLHF Explained (Hugging Face)

See also: Alignment, Fine-Tuning, Pre-Training


Safety#

The broad discipline of ensuring AI systems operate within intended boundaries and don’t cause harm. Safety encompasses alignment (models follow instructions), content filtering (blocking harmful outputs), robustness (resisting adversarial attacks), and transparency (explaining model behavior). Every major provider publishes safety guidelines and offers safety-specific APIs, and responsible deployment requires layered safety measures throughout the application stack.

Key resources: Safety Best Practices (OpenAI) | Strengthen Guardrails (Anthropic) | Safety Modes (Cohere)

See also: Alignment, Guardrails, Content Moderation, Red Teaming


Safetensors#

A file format for storing model weights designed to be safe, fast, and simple. Unlike pickle-based formats (which can execute arbitrary code on load), safetensors files cannot contain executable code, preventing supply-chain attacks through malicious model files. The format also supports memory-mapped loading for faster model initialization. Adopted as the standard format on Hugging Face Hub.

Key resources: Safetensors Docs

See also: Model Serving, Deployment


Finding documents or passages based on meaning rather than exact keyword matches. Semantic search works by embedding both the query and the document corpus into the same vector space, then finding documents whose embeddings are closest to the query embedding. This handles synonyms, paraphrases, and conceptual similarity that keyword search misses entirely.

Key resources: Embeddings (Cohere) | Pinecone Concepts

See also: Embedding, Cosine Similarity, Vector Database, Hybrid Search


Querying a collection of vectors to find the k items most similar to a query vector, typically measured by cosine similarity or Euclidean distance. This is the core operation of vector databases and the retrieval step in RAG. Approximate nearest neighbor (ANN) algorithms like HNSW and IVF make similarity search fast even over millions of vectors, trading a small amount of recall for orders-of-magnitude speed improvements.

Key resources: Pinecone Concepts | Chroma Docs

See also: Cosine Similarity, Vector Database, Semantic Search


Streaming#

Delivering model output incrementally as tokens are generated rather than waiting for the complete response. Streaming dramatically improves perceived latency — users see the first token in milliseconds rather than waiting seconds for the full response. Implemented via Server-Sent Events (SSE) or WebSocket connections. Each provider’s streaming format differs in structure, and handling tool calls mid-stream adds complexity.

Key resources: Streaming (Cohere) | Streaming Overview (LangChain)

See also: API, AG-UI Protocol, Token


Structured Output#

Constraining a model’s output to conform to a predefined schema (typically JSON Schema) rather than free-form text. Structured output guarantees parseable responses — no malformed JSON, no missing fields, no unexpected types. Providers implement this via constrained decoding (OpenAI’s structured outputs, Anthropic’s tool use) or post-processing validation (Instructor, Pydantic AI). Critical for any application that needs to extract typed data from LLM responses.

Key resources: Structured Outputs (OpenAI) | Structured Outputs (Cohere) | Instructor Docs | Pydantic AI Docs

See also: JSON Schema, Function Calling, Tool Use


System Prompt#

A special instruction block provided to the model before the user’s message that sets behavioral context: persona, constraints, output format, tone, and domain knowledge. The system prompt persists across all turns of a conversation and shapes every response. Well-designed system prompts are the foundation of production AI applications — they define what the model should and shouldn’t do, how to handle edge cases, and what format to use.

Key resources: Prompt Engineering Overview (Anthropic)

See also: Prompt, Prompt Engineering, Context Window


Temperature#

A parameter controlling the randomness of a model’s output. Temperature 0 makes the model deterministic (always choosing the highest-probability token), while higher values (0.7-1.0) increase diversity and creativity. Temperature above 1.0 produces increasingly random output. Use low temperature for factual tasks (classification, extraction, code) and higher temperature for creative tasks (brainstorming, writing).

Key resources: Glossary (Mistral) | Glossary (Anthropic)

See also: Inference, Large Language Model (LLM), Prompt


Token#

The fundamental unit of text that LLMs process. Tokenizers split text into subword pieces — common words become single tokens, while rare words are split into multiple tokens. “Embedding” is one token; “uncharacteristically” might be three. Token counts determine context window usage, API costs, and processing time. A rough rule of thumb: 1 token ≈ 4 characters in English, or about ¾ of a word.

Key resources: Token Counting (OpenAI) | Tokenization (Mistral) | Glossary (Mistral)

See also: Context Window, Cost Optimization, Large Language Model (LLM)


Tool Use#

The capability for LLMs to interact with external systems by requesting tool invocations. When enabled, the model can decide to call tools (search engines, calculators, databases, APIs) during generation, receive the results, and incorporate them into its response. This extends LLM capabilities beyond text generation to real-world actions. Anthropic calls this “tool use”; OpenAI calls it “function calling.” Both use JSON-based schemas to define available tools.

Key resources: Tool Use Overview (Anthropic) | Tool Use Overview (Cohere) | Tools (OpenAI)

See also: Function Calling, Agent, Model Context Protocol (MCP), Structured Output


Toxicity#

Harmful, offensive, or inappropriate content generated by or input to AI models. Toxicity categories include hate speech, harassment, threats, explicit content, and discriminatory language. Toxicity detection is a classification task handled by specialized moderation models. Building safe AI applications requires both input-side toxicity filtering (blocking harmful prompts) and output-side monitoring (catching harmful generations).

Key resources: Content Moderation (Anthropic) | Moderation (OpenAI)

See also: Content Moderation, Safety, Guardrails


Transfer Learning#

The technique of using knowledge gained from training on one task or dataset to improve performance on a different but related task. In the LLM ecosystem, the entire paradigm of pre-training then fine-tuning is a form of transfer learning: general language knowledge transfers to specific tasks. Transfer learning is why fine-tuning on a few thousand examples works — the model already understands language; it just needs to learn your specific task.

Key resources: Glossary (Anthropic)

External references: Transfer Learning (Google ML Glossary)

See also: Fine-Tuning, Pre-Training, Few-Shot Learning


Vector Database#

A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings. Vector databases enable fast similarity search over millions of embeddings — the retrieval backbone of RAG systems. Key features include approximate nearest neighbor (ANN) indexing, metadata filtering, and real-time upserts. Leading options include Pinecone (managed), Chroma (open-source), Weaviate, Qdrant, and pgvector (PostgreSQL extension).

Key resources: Pinecone Docs | Chroma Docs | Pinecone Concepts

See also: Embedding, Similarity Search, Retrieval-Augmented Generation (RAG), Cosine Similarity


Workflow#

A defined sequence or graph of steps that an AI application executes to complete a task. Workflows can be deterministic (fixed step order) or dynamic (agent-driven, with branching based on model output). Workflow engines manage state, handle retries, and coordinate between LLM calls, tool invocations, and human approvals. LangGraph, CrewAI processes, and OpenAI’s agent loops are all workflow implementations with different levels of agent autonomy.

Key resources: CrewAI Docs | LangChain Docs | Agents (OpenAI)

See also: Orchestration, Pipeline, Agent, Multi-Agent System, Workflow Agent


Workflow Agent#

A deterministic agent that follows a predefined execution pattern — sequential, parallel, or loop — rather than using an LLM for control flow decisions. Google ADK introduced this taxonomy to distinguish workflow agents (predictable, no LLM overhead for routing) from LLM agents (flexible, model-driven decisions). The key insight: use workflow agents for orchestration logic that doesn’t require reasoning, and LLM agents for decisions that do. This mirrors the broader principle that not every step in a multi-agent system needs to involve a language model.

Key resources: Workflow Agents (Google ADK) | Sequential Agents (Google ADK) | Parallel Agents (Google ADK)

See also: Orchestration, Workflow, Agent