Long-Form Text Strategies with Cohere ↗

cohere guide intermediate streaming transport

Summary: This discusses ways of getting Cohere's LLM platform to perform well in generating long-form text.

Original Documentation

title: Long-Form Text Strategies with Cohere slug: /page/long-form-general-strategies description: >- This discusses ways of getting Cohere’s LLM platform to perform well in generating long-form text. image: type: fileId value: ‘https://files.buildwithfern.com/cohere.docs.buildwithfern.com/8ba30b46486ea7bfab24f3e8856d7411d1b745b26e9026abff3ee62af52ce268/assets/images/f1cc130-cohere_meta_image.jpg' keywords: ‘Cohere, text comprehension, reading comprehension, AI, context windows’#

Large Language Models (LLMs) are becoming increasingly capable of comprehending text, among others excelling in document analysis. The new Cohere model, Command A, boasts a context length of 256k, which makes it particularly effective for such tasks. Nevertheless, even with the extended context window, some documents might be too lengthy to accommodate in full.

In this cookbook, we’ll explore techniques to address cases when relevant information doesn’t fit in the model context window.

We’ll show you three potential mitigation strategies: truncating the document, query-based retrieval, and a “text rank” approach we use internally at Cohere.

Summary#

Approach	Description	Pros	Cons	When to use?
Truncation	Truncate the document to fit the context window.	- Simplicity of implementation (does not rely on extrenal infrastructure)	- Loses information at the end of the document	Utilize when all relevant information is contained at the beginning of the document.
Query Based Retrieval	Utilize semantic similarity to retrieve text chunks that are most relevant to the query.	- Focuses on sections directly relevant to the query	- Relies on a semantic similarity algorithm. - Might lose broader context	Employ when seeking specific information within the text.
Text Rank	Apply graph theory to generate a cohesive set of chunks that effectively represent the document.	- Preserves the broader picture.	- Might lose detailed information.	Utilize in summaries and when the question requires broader context.

Getting Started [#getting-started]#

%%capture
!pip install cohere
!pip install python-dotenv
!pip install tokenizers
!pip install langchain
!pip install nltk
!pip install networkx
!pip install pypdf2

import os
import requests
from collections import deque
from typing import List, Tuple

import cohere

import numpy as np

import PyPDF2
from dotenv import load_dotenv

from tokenizers import Tokenizer

import nltk
nltk.download('punkt')  # Download the necessary data for sentence tokenization
from nltk.tokenize import sent_tokenize

import networkx as nx
from getpass import getpass

[nltk_data] Downloading package punkt to
[nltk_data]     /home/anna_cohere_com/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

co_model = 'command-a-03-2025'
co_api_key = getpass("Enter your Cohere API key: ")
co = cohere.Client(api_key=co_api_key)

def load_long_pdf(file_path):
    """
    Load a long PDF file and extract its text content.

    Args:
        file_path (str): The path to the PDF file.

    Returns:
        str: The extracted text content of the PDF file.
    """
    with open(file_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        num_pages = len(pdf_reader.pages)
        full_text = ''
        for page_num in range(num_pages):
            page = pdf_reader.pages[page_num]
            full_text += page.extract_text()
    return full_text

def save_pdf_from_url(pdf_url, save_path):
    try:
        # Send a GET request to the PDF URL
        response = requests.get(pdf_url, stream=True)
        response.raise_for_status()  # Raise an exception for HTTP errors

        # Open the local file for writing in binary mode
        with open(save_path, 'wb') as file:
            # Write the content of the response to the local file
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

        print(f"PDF saved successfully to '{save_path}'")
    except requests.exceptions.RequestException as e:
        print(f"Error downloading PDF: {e}")

In this example we use the Proposal for a Regulation of the European Parliament and of the Council defining rules on Artificial Intelligence from 26 January 2024, link.

# Download the PDF file from the URL
pdf_url = 'https://data.consilium.europa.eu/doc/document/ST-5662-2024-INIT/en/pdf'
save_path = 'example.pdf'
save_pdf_from_url(pdf_url, save_path)

# Load the PDF file and extract its text content
long_text = load_long_pdf(save_path)
long_text = long_text.replace('\n', ' ')

# Print the length of the document
print("Document length - #tokens:", len(co.tokenize(text=long_text, model=co_model).tokens))

PDF saved successfully to 'example.pdf'
Document length - #tokens: 134184

Summarizing the text#

def generate_response(message, max_tokens=300, temperature=0.2, k=0):
  """
  A wrapper around the Cohere API to generate a response based on a given prompt.

  Args:
    messsage (str): The input message for generating the response.
    max_tokens (int, optional): The maximum number of tokens in the generated response. Defaults to 300.
    temperature (float, optional): Controls the randomness of the generated response. Higher values (e.g., 1.0) make the output more random, while lower values (e.g., 0.2) make it more deterministic. Defaults to 0.2.
    k (int, optional): Controls the diversity of the generated response. Higher values (e.g., 5) make the output more diverse, while lower values (e.g., 0) make it more focused. Defaults to 0.

  Returns:
    str: The generated response.

  """
  response = co.chat(
    model = co_model,
    message=message,
    max_tokens=max_tokens,
    temperature=temperature
    )
  return response.text

# Example summary prompt.
prompt_template = """
## Instruction
Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.

## Document
{document}

## Summary
""".strip()

If you run the cell below, an error will occur. Therefore, in the following sections, we will explore some techniques to address this limitation.

Error: :CohereAPIError: too many tokens:

prompt = prompt_template.format(document=long_text)
# print(generate_response(message=prompt))

Therefore, in the following sections, we will explore some techniques to address this limitation.

Approach 1 - Truncate [#approach-1]#

First we try to truncate the document so that it meets the length constraints. This approach is simple to implement and understand. However, it drops potentially important information contained towards the end of the document.

# The new Cohere model has a context limit of 128k tokens. However, for the purpose of this exercise, we will assume a smaller context window.
# Employing a smaller context window also has the additional benefit of reducing the cost per request, especially if billed by the number of tokens.

MAX_TOKENS = 40000

def truncate(long: str, max_tokens: int) -> str:
    """
    Shortens `long` by brutally truncating it to the first `max_tokens` tokens.
    This can break up sentences, passages, etc.
    """

    tokenized = co.tokenize(text=long, model=co_model).token_strings
    truncated = tokenized[:max_tokens]
    short = "".join(truncated)
    return short

short_text = truncate(long_text, MAX_TOKENS)

prompt = prompt_template.format(document=short_text)
print(generate_response(message=prompt))

The document discusses the impact of a specific protein, p53, on the process of angiogenesis, which is the growth of new blood vessels. Angiogenesis plays a critical role in various physiological processes, including wound healing and embryonic development. The presence of the p53 protein can inhibit angiogenesis by regulating the expression of certain genes and proteins. This inhibition can have significant implications for tumor growth, as angiogenesis is essential for tumor progression. Therefore, understanding the role of p53 in angiogenesis can contribute to our knowledge of tumor suppression and potential therapeutic interventions.

Additionally, the document mentions that the regulation of angiogenesis by p53 occurs independently of the protein’s role in cell cycle arrest and apoptosis, which are other key functions of p53 in tumor suppression. This suggests that p53 has a complex and multifaceted impact on cellular processes.

Approach 2: Query Based Retrieval [#appoach-2]#

In this section we present how we can leverage a query retriereval based approach to generate an answer to the following question: Based on the document, are there any risks related to Elon Musk?.

The solution is outlined below and can be broken down into four functional steps.

Chunk the text into units
- Here we employ a simple chunking algorithm. More information about different chunking strategies can be found [here](TODO: link to chunking post).
Use a ranking algorithm to rank chunks against the query
- We leverage another Cohere endpoint, co.rerank (docs link), to rank each chunk against the query.
Keep the most-relevant chunks until context limit is reached
- co.rerank returns a relevance score, facilitating the selection of the most pertinent chunks. We can choose the most relevant chunks based on this score.
Put condensed text back in original order
- Finally, we arrange the chosen chunks in their original sequence as they appear in the document.

See query_based_retrieval function for the starting point.

Query based retrieval implementation#

def split_text_into_sentences(text) -> List[str]:
    """
    Split the input text into a list of sentences.
    """
    sentences = sent_tokenize(text)

    return sentences

def group_sentences_into_passages(sentence_list, n_sentences_per_passage=5):
    """
    Group sentences into passages of n_sentences sentences.
    """
    passages = []
    passage = ""
    for i, sentence in enumerate(sentence_list):
        passage += sentence + " "
        if (i + 1) % n_sentences_per_passage == 0:
            passages.append(passage)
            passage = ""
    return passages

def build_simple_chunks(text, n_sentences=5):
    """
    Build chunks of text from the input text.
    """
    sentences = split_text_into_sentences(text)
    chunks = group_sentences_into_passages(sentences, n_sentences_per_passage=n_sentences)
    return chunks

sentences = split_text_into_sentences(long_text)
passages = group_sentences_into_passages(sentences, n_sentences_per_passage=5)
print('Example sentence:', np.random.choice(np.asarray(sentences), size=1, replace=False))
print()
print('Example passage:', np.random.choice(np.asarray(passages), size=1, replace=False))

Example sentence: ['The European Data Protection Supervisor may also establish an AI regulatory sandbox for  the EU institutions, bodies and agencies and exercise the roles and the tasks of national  competent authorities in accordance with this chapter.']

Example passage: ['This flexibility could mean, for example a decision  by the provider to integrate a part of the necessary testing and reporting processes,  information and documentation required under this Regulation into already existing  documentation and procedu res required under the existing Union harmonisation legislation  listed in Annex II, Section A. This however should not in any way undermine the  obligation of the provider to comply with all the applicable requirements. (42a)   The risk management system shou ld consist of a continuous, iterative process that is  planned and run throughout the entire lifecycle of a high - risk AI system. This process  should be aimed at identifying and mitigating the relevant risks of artificial intelligence  systems on health, safe ty and fundamental rights. The risk management system should be  regularly reviewed and updated to ensure its continuing effectiveness, as well as  justification and documentation of any significant decisions and actions taken subject to  this Regulation. ']

def _add_chunks_by_priority(
    chunks: List[str],
    idcs_sorted_by_priority: List[int],
    max_tokens: int,
) -> List[Tuple[int, str]]:
    """
    Given chunks of text and their indices sorted by priority (highest priority first), this function
    fills the model context window with as many highest-priority chunks as possible.

    The output is a list of (index, chunk) pairs, ordered by priority. To stitch back the chunks into
    a cohesive text that preserves chronological order, sort the output on its index.
    """

    selected = []
    num_tokens = 0
    idcs_queue = deque(idcs_sorted_by_priority)

    while num_tokens < max_tokens and len(idcs_queue) > 0:
        next_idx = idcs_queue.popleft()
        num_tokens += len(co.tokenize(text=chunks[next_idx], model=co_model).tokens)
        # keep index and chunk, to reorder chronologically
        selected.append((next_idx, chunks[next_idx]))
    if num_tokens > max_tokens:
        selected.pop()

    return selected

def query_based_retrieval(
    long: str,
    max_tokens: int,
    query: str,
    n_setences_per_passage: int = 5,
) -> str:
    """
    Performs query-based retrieval on a long text document.
    """
    # 1. Chunk text into units
    chunks = build_simple_chunks(long, n_setences_per_passage)

    # 2. Use co.rerank to rank chunks vs. query
    chunks_reranked = co.rerank(query=query, documents=chunks, model="rerank-english-v3.0")
    idcs_sorted_by_relevance = [
        chunk.index for chunk in sorted(chunks_reranked.results, key=lambda c: c.relevance_score, reverse=True)
    ]

    # 3. Add chunks back in order of relevance
    selected = _add_chunks_by_priority(chunks, idcs_sorted_by_relevance, max_tokens)

    # 4. Put condensed text back in original order
    separator = " "
    short = separator.join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])
    return short

# Example prompt
prompt_template = """
## Instruction
{query}

## Document
{document}

## Answer
""".strip()

query = "What does the report say about biometric identification? Answer only based on the document."
short_text = query_based_retrieval(long_text, MAX_TOKENS, query)
prompt = prompt_template.format(query=query, document=short_text)
print(generate_response(message=prompt, max_tokens=300))

The report outlines several key points regarding biometric identification within the context of the proposed Artificial Intelligence Act:

1. **Prohibition of Real-Time Biometric Identification in Public Spaces**: The report proposes a ban on real-time biometric identification by law enforcement authorities in publicly accessible spaces, with specific exceptions. These exceptions are detailed in Article 5(1)(d) and are subject to safeguards, including monitoring, oversight, and limited reporting obligations at the EU level.

2. **Exceptions to the Prohibition**: The exceptions to the ban on real-time biometric identification include:
   - Search for victims of specific crimes (e.g., abduction, trafficking, sexual exploitation).
   - Prevention of imminent threats to life or physical safety, including terrorist attacks.
   - Localization or identification of suspects for serious criminal offenses (as defined in Annex IIa) punishable by a custodial sentence of at least four years.

3. **Safeguards and Conditions**: The use of real-time biometric identification systems in these exceptional cases must comply with specific safeguards and conditions, including:
   - A fundamental rights impact assessment.
   - Registration of the system in a database.
   - Prior authorization by a judicial or independent administrative authority, except in urgent situations where authorization can be sought within 24 hours.
   - Limitation to what is strictly necessary in terms of time, geography, and personal scope.

4. **Post-Remote Biometric Identification**: The use of post-remote biometric identification

Approach 3: Text rank [#approach-3]#

In the final section we will show how we leverage graph theory to select chunks based on their centrality. Centrality is a graph-theoretic measure of how connected a node is; the higher the centrality, the more connected the node is to surrounding nodes (with fewer connections among those neighbors).

The solution presented in this document can be broken down into five functional steps:

Break the document into chunks.
- This mirrors the first step in Approach 2.
Embed each chunk using an embedding model and construct a similarity matrix.
- We utilize co.embed documentation link.
Compute the centrality of each chunk.
- We employ a package called NetworkX. It constructs a graph where the chunks are nodes, and the similarity score between them serves as the weight of the edges. Then, we calculate the centrality of each chunk as the sum of the edge weights adjacent to the node representing that chunk.
Retain the highest-centrality chunks until the context limit is reached.
- This step follows a similar approach to Approach 2.
Reassemble the shortened text by reordering chunks in their original order.
- This step mirrors the last step in Approach 2.

See text_rank as the starting point.

Text rank implementation#

def text_rank(text: str, max_tokens: int, n_setences_per_passage: int) -> str:
    """
    Shortens text by extracting key units of text from it based on their centrality.
    The output is the concatenation of those key units, in their original order.
    """

    # 1. Chunk text into units
    chunks = build_simple_chunks(text, n_setences_per_passage)

    # 2. Embed and construct similarity matrix
    embeddings = np.array(
        co.embed(
            texts=chunks,
            model="embed-v4.0",
            input_type="clustering",
        ).embeddings
    )
    similarities = np.dot(embeddings, embeddings.T)

    # 3. Compute centrality and sort sentences by centrality
    # Easiest to use networkx's `degree` function with similarity as weight
    g = nx.from_numpy_array(similarities, edge_attr="weight")
    centralities = g.degree(weight="weight")
    idcs_sorted_by_centrality = [node for node, degree in sorted(centralities, key=lambda item: item[1], reverse=True)]

    # 4. Add chunks back in order of centrality
    selected = _add_chunks_by_priority(chunks, idcs_sorted_by_centrality, max_tokens)

    # 5. Put condensed text back in original order
    short = " ".join([chunk for index, chunk in sorted(selected, key=lambda item: item[0], reverse=False)])

    return short

# Example summary prompt.
prompt_template = """
## Instruction
Summarize the following Document in 3-5 sentences. Only answer based on the information provided in the document.

## Document
{document}

## Summary
""".strip()

short_text = text_rank(long_text, MAX_TOKENS, 5)
prompt = prompt_template.format(document=short_text)
print(generate_response(message=prompt, max_tokens=600))

The document outlines the European Union's regulatory framework for artificial intelligence (AI) systems, focusing on high-risk AI applications. It establishes rules for placing AI systems on the market, including prohibitions on certain practices, requirements for high-risk systems, and transparency obligations. The regulation defines high-risk AI systems based on their intended use and potential risks to health, safety, and fundamental rights. Providers of high-risk AI systems must comply with specific requirements, such as risk management, data governance, and human oversight. The regulation also mandates conformity assessments, registration in an EU database, and post-market monitoring. It emphasizes the importance of AI literacy, prohibits manipulative or exploitative AI practices, and ensures compliance through market surveillance and enforcement mechanisms. Additionally, the regulation addresses general-purpose AI models, requiring providers to meet specific obligations, especially for models with systemic risks. The framework aims to promote trustworthy AI while safeguarding public interests and supporting innovation.

Summary#

In this notebook we present three useful methods to over come the limitations of context window size. In the following blog post, we talk more about how these methods can be evaluated.

Link last verified June 7, 2026. View original ↗

Source: Cohere Docs

Link last verified: 2026-02-26