Evaluating Text Summarization Models ↗
noOriginal Documentation
title: Evaluating Text Summarization Models slug: /page/summarization-evals description: This page discusses how to evaluate a model’s text summarization. image: type: fileId value: ‘https://files.buildwithfern.com/cohere.docs.buildwithfern.com/8ba30b46486ea7bfab24f3e8856d7411d1b745b26e9026abff3ee62af52ce268/assets/images/f1cc130-cohere_meta_image.jpg' keywords: ‘Cohere, text summarization’#
In this cookbook, we will be demonstrating an approach we use for evaluating summarization tasks using LLM evaluation.
You’ll need a Cohere API key to run this notebook. If you don’t have a key, head to https://cohere.com/ to generate your key.
!pip install cohere datasets --quietimport json
import random
import re
from typing import List, Optional
import cohere
from getpass import getpass
from datasets import load_dataset
import pandas as pd
co_api_key = getpass("Enter your Cohere API key: ")
co_model = "command-r"
co = cohere.Client(api_key=co_api_key)As test data, we’ll use transcripts from the QMSum dataset. Note that in addition to the transcripts, this dataset also contains reference summaries – we will use only the transcripts as our approach is reference-free.
qmsum = load_dataset("MocktaiLEngineer/qmsum-processed", split="validation")
transcripts = [x for x in qmsum["meeting_transcript"] if x is not None]Generating train split: 0%| | 0/1095 [00:00<?, ? examples/s]
Generating validation split: 0%| | 0/237 [00:00<?, ? examples/s]
Generating test split: 0%| | 0/244 [00:00<?, ? examples/s]Construct the evaluation dataset [#dataset]#
We are interested in evaluating summarization in real-world, enterprise use cases, which typically have two distinguishing features as compared to academic summarization benchmarks:
- Enterprise use cases often focus on specific summarization objectives, e.g. “summarize action items”.
- Enterprise use cases often feature specific instruction constraints, e.g. “summarize in bullets with each bullet under 20 words”.
Therefore, we must first create a dataset that contains diverse summarization prompts. We will do this programmatically by building prompts from their components, as defined below:
- Prompt = text (e.g. transcript to be summarized) + instruction
- Instruction = instruction objective (e.g. “summarize action items”) + modifiers
- Modifiers = format/length modifiers (e.g. “use bullets”) + style/tone modifiers (e.g. “do not mention names”) + …
First, we define the prompt that combines the text and instructions. Here, we use a very basic prompt:
prompt_template = """## meeting transcript
{transcript}
## instructions
{instructions}"""Next, we build the instructions. Because each instruction may have a different objective and modifiers, we track them using metadata. This will later be required for evaluation (i.e. to know what the prompt is asking).
instruction_objectives = {
"general_summarization": "Summarize the meeting based on the transcript.",
"action_items": "What are the follow-up items based on the meeting transcript?",
}
format_length_modifiers = {
"paragraphs_short": {
"text": "In paragraph form, output your response. Use at least 10 words and at most 50 words in total.",
"objectives": ["general_summarization"],
"eval_metadata": {
"format": "paragraphs",
"min_length": 10,
"max_length": 50,
},
},
"paragraphs_medium": {
"text": "Return the answer in the form of paragraphs. Make sure your answer is between 50 and 200 words long.",
"objectives": ["general_summarization"],
"eval_metadata": {
"format": "paragraphs",
"min_length": 50,
"max_length": 200,
},
},
"bullets_short_3": {
"text": "Format your answer in the form of bullets. Use exactly 3 bullets. Each bullet should be at least 10 words and at most 20 words.",
"objectives": ["general_summarization", "action_items"],
"eval_metadata": {
"format": "bullets",
"number": 3,
"min_length": 10,
"max_length": 20,
},
},
"bullets_medium_2": {
"text": "In bullets, output your response. Make sure to use exactly 2 bullets. Make sure each bullet is between 20 and 80 words long.",
"objectives": ["general_summarization", "action_items"],
"eval_metadata": {
"format": "bullets",
"number": 2,
"min_length": 20,
"max_length": 80,
},
},
}Let’s combine the objectives and format/length modifiers to finish building the instructions.
instructions = []
for obj_name, obj_text in instruction_objectives.items():
for mod_data in format_length_modifiers.values():
for mod_obj in mod_data["objectives"]:
if mod_obj == obj_name:
instruction = {
"instruction": f"{obj_text} {mod_data['text']}",
"eval_metadata": mod_data["eval_metadata"],
"objective": obj_name,
}
instructions.append(instruction)
print(json.dumps(instructions[:2], indent=4))[
{
"instruction": "Summarize the meeting based on the transcript. In paragraph form, output your response. Use at least 10 words and at most 50 words in total.",
"eval_metadata": {
"format": "paragraphs",
"min_length": 10,
"max_length": 50
},
"objective": "general_summarization"
},
{
"instruction": "Summarize the meeting based on the transcript. Return the answer in the form of paragraphs. Make sure your answer is between 50 and 200 words long.",
"eval_metadata": {
"format": "paragraphs",
"min_length": 50,
"max_length": 200
},
"objective": "general_summarization"
}
]Finally, let’s build the final prompts by semi-randomly pairing the instructions with transcripts from the QMSum dataset.
data = pd.DataFrame(instructions)
transcripts = sorted(transcripts, key=lambda x: len(x), reverse=True)[:int(len(transcripts) * 0.25)]
random.seed(42)
random.shuffle(transcripts)
data["transcript"] = transcripts[:len(data)]
data["prompt"] = data.apply(lambda x: prompt_template.format(transcript=x["transcript"], instructions=x["instruction"]), axis=1)data["transcript_token_len"] = [len(x) for x in co.batch_tokenize(data["transcript"].tolist(), model=co_model)]print(data["prompt"][0])## meeting transcript
PhD F: As opposed to the rest of us
PhD D: Well comment OK I I remind that me my first objective eh in the project is to to study difference parameters to to find a a good solution to detect eh the overlapping zone in eh speech recorded But eh tsk comment ehhh comment In that way comment I I I begin to to study and to analyze the ehn the recorded speech eh the different session to to find and to locate and to mark eh the the different overlapping zone And eh so eh I was eh I am transcribing the the first session and I I have found eh eh one thousand acoustic events eh besides the overlapping zones eh I I I mean the eh breaths eh aspiration eh eh talk eh eh clap eh comment I do not know what is the different names eh you use to to name the the pause n speech
Grad G: Oh I do not think we ve been doing it at that level of detail So
PhD D: Eh I I I do I do not need to to to mmm to m to label the the different acoustic but I prefer because eh I would like to to study if eh I I will find eh eh a good eh parameters eh to detect overlapping I would like to to to test these parameters eh with the another eh eh acoustic events to nnn to eh to find what is the ehm the false eh the false eh hypothesis eh nnn which eh are produced when we use the the ehm this eh parameter eh I mean pitch eh eh difference eh feature
PhD A: You know I think some of these that are the nonspeech overlapping events may be difficult even for humans to tell that there s two there I mean if it s a tapping sound you would not necessarily or you know something like that it would be it might be hard to know that it was two separate events
Grad G: Well You were not talking about just overlaps were you ? You were just talking about acoustic events
PhD D: I I I I t I t I talk eh about eh acoustic events in general but eh my my objective eh will be eh to study eh overlapping zone Eh ? comment n Eh in twelve minutes I found eh eh one thousand acoustic events
Professor E: How many overlaps were there in it ? No no how many of them were the overlaps of speech though ?
PhD D: How many ? Eh almost eh three hundred eh in one session in five eh in forty five minutes Alm Three hundred overlapping zone With the overlapping zone overlapping speech speech what eh different duration
Postdoc B: Does this ? So if you had an overlap involving three people how many times was that counted ?
PhD D: three people two people Eh I would like to consider eh one people with difference noise eh in the background be
Professor E: No no but I think what she s asking is pause if at some particular for some particular stretch you had three people talking instead of two did you call that one event ?
PhD D: Oh Oh I consider one event eh for th for that eh for all the zone This th I I I con I consider I consider eh an acoustic event the overlapping zone the period where three speaker or eh are talking together
Grad G: So let s say me and Jane are talking at the same time and then Liz starts talking also over all of us How many events would that be ?
PhD D: So I do not understand
Grad G: So two people are talking comment and then a third person starts talking Is there an event right here ?
PhD D: Eh no No no For me is the overlapping zone because because you you have s you have more one eh more one voice eh eh produced in a in in a moment
Grad G: So i if two or more people are talking
Professor E: OK So I think We just wanted to understand how you are defining it So then in the region between since there there is some continuous region in between regions where there is only one person speaking And one contiguous region like that you are calling an event Is it Are you calling the beginning or the end of it the event or are you calling the entire length of it the event ?
PhD D: I consider the the nnn the nnn nnn eh the entirety eh eh all all the time there were the voice has overlapped This is the idea But eh I I do not distinguish between the the numbers of eh speaker I m not considering eh the the ehm eh the fact of eh eh for example what did you say ? Eh at first eh eh two talkers are eh speaking and eh eh a third person eh join to to that For me it s eh it s eh all overlap zone with eh several numbers of speakers is eh eh the same acoustic event Wi but without any mark between the zone of the overlapping zone with two speakers eh speaking together and the zone with the three speakers
Postdoc B: That would j just be one
PhD D: Eh with eh a beginning mark and the ending mark Because eh for me is the is the zone with eh some kind of eh distortion the spectral I do not mind By the moment by the moment
Grad G: Well but But you could imagine that three people talking has a different spectral characteristic than two
PhD D: I I do not but eh but eh I have to study comment What will happen in a general way
Grad G: So You had to start somewhere
PhD C: So there s a lot of overlap
PhD D: I I do not know what eh will will happen with the
Grad G: That s a lot of overlap
Professor E: So again that s that s three three hundred in forty five minutes that are that are speakers just speakers
Postdoc B: But a a a th
Professor E: So that s about eight per minute
Postdoc B: But a thousand events in twelve minutes that s
PhD C: But that can include taps
Postdoc B: Well but a thousand taps in eight minutes is a l in twelve minutes is a lot
PhD D: I I con I consider I consider acoustic events eh the silent too
Grad G: Silence starting or silence ending
PhD D: silent ground to bec to detect eh because I consider acoustic event all the things are not eh speech In ge in in in a general point of view
Professor E: OK so how many of those thousand were silence ?
PhD F: Not speech not speech or too much speech
Professor E: Right So how many of those thousand were silence silent sections ?
PhD D: silent I I I I do not I I have not the eh I I would like to to do a stylistic study
## instructions
Summarize the meeting based on the transcript. In paragraph form, output your response. Use at least 10 words and at most 50 words in total.Build the evaluation framework [#eval-framework]#
We now setup the tools we will use for evaluation.
We use three criteria that are graded using LLMs:
- Completeness: checks if the summary includes all the important information from the original text that it should include
- Correctness: checks if there are any hallucinations or factual inaccuracies in the summary
- Conciseness: checks if the summary includes any unnecessary information or wordiness
In this cookbook, we will use Command-R to grade the completions. However, note that in practice, we typically use an ensemble of multiple LLM evaluators to reduce any bias.
grading_prompt_template = """You are an AI grader that given a prompt, a completion, and a criterion, grades the completion based on the prompt and criterion. Below is a prompt, a completion, and a criterion with which to grade the completion. You need to respond according to the criterion instructions.
PROMPT:
{prompt}
COMPLETION:
{completion}
CRITERION:
{criteria}
Do not provide an explanation or respond with additional text. Only respond with the final score in the format SCORE: <final_score>
SCORE:"""
criteria_completeness = """Please judge the above completion on whether or not all the completion fully addresses the task posed in the prompt. The completion should contain all the information that is asked for by the prompt while also upholding the constraints mentioned in the prompt. Please return a grade between 1 and 5, where 1 means the completion does not address the prompt at all, 2 means the completion mostly does not address the prompt, 3 is neutral, 4 means the completion mostly addresses the prompt, and 5 means the completion fully addresses the prompt."""
criteria_correctness = """Please judge the above completion on whether or not all information was derived from the prompt and that no information is false. Return a grade between 1 and 5, where 1 means the answers are entirely untruthful relative to the prompt, 2 means the answers are somewhat untruthful relative to the prompt, 3 means neutral, 4 means the answers are somewhat truthful relative to the prompt, and 5 means the answers are entirely truthful relative to the prompt. Evaluate only the truthfulness of the answers, not whether or not they capture all the relevant information in the prompt."""
criteria_conciseness = """Please judge the above completion on whether or not the completion contains any unnecessary information or wordiness that does not help answer the specific instruction given in the prompt. Return a grade between 1 and 5, where 1 means the completion contains many unnecessary details and wordiness that do not answer the specific instruction given in the prompt, 2 means the completion contains some unnecessary details or wordiness, 3 means neutral, 4 means the completion contains few unnecessary details or wordiness, and 5 means the completion contains only necessary details that answer the specific instruction given in the prompt."""
def score_llm(prompt: str, completion: str, criteria: str) -> int:
"""
Score a completion based on a prompt and a criterion using LLM Because we
grade all completions on a scale of 1-5, we will normalize the scores by 5 so that the final score
is between 0 and 1.
"""
grading_prompt = grading_prompt_template.format(
prompt=prompt, completion=completion, criteria=criteria
)
# Use Cohere to grade the completion
completion = co.chat(message=grading_prompt, model=co_model, temperature=0.2).text
### Alternatively, use OpenAI to grade the completion (requires key)
# import openai
# completion = openai.OpenAI(api_key="INSERT OPENAI KEY HERE").chat.completions.create(
# model="gpt-4",
# messages=[{"role": "user", "content": grading_prompt}],
# temperature=0.2,
# ).choices[0].message.content
# Extract the score from the completion
score = float(re.search(r"[12345]", completion).group()) / 5
return scoreIn addition, we have two criteria that are graded programmatically:
- Format: checks if the summary follows the format (e.g. bullets) that was requested in the prompt
- Length: checks if the summary follows the length that was requested in the prompt.
def score_format(completion: str, format_type: str) -> int:
"""
Returns 1 if the completion is in the correct format, 0 otherwise.
"""
if format_type == "paragraphs":
return int(_is_only_paragraphs(completion))
elif format_type == "bullets":
return int(_is_only_bullets(completion))
return 0
def score_length(
completion: str,
format_type: str,
min_val: int,
max_val: int,
number: Optional[int] = None
) -> int:
"""
Returns 1 if the completion has the correct length for the given format, 0 otherwise. This
includes both word count and number of items (optional).
"""
# Split into items (each bullet for bullets or each paragraph for paragraphs)
if format_type == "bullets":
items = _extract_markdown_bullets(completion, include_bullet=False)
elif format_type == "paragraphs":
items = completion.split("\n")
# Strip whitespace and remove empty items
items = [item for item in items if item.strip() != ""]
# Check number of items if provided
if number is not None and len(items) != number:
return 0
# Check length of each item
for item in items:
num_words = item.strip().split()
if min_val is None and len(num_words) > max_val:
return 0
elif max_val is None and len(num_words) < min_val:
return 0
elif not min_val <= len(num_words) <= max_val:
return 0
return 1
def _is_only_bullets(text: str) -> bool:
"""
Returns True if text is only markdown bullets.
"""
bullets = _extract_markdown_bullets(text, include_bullet=True)
for bullet in bullets:
text = text.replace(bullet, "")
return text.strip() == ""
def _is_only_paragraphs(text: str) -> bool:
"""
Returns True if text is only paragraphs (no bullets).
"""
bullets = _extract_markdown_bullets(text, include_bullet=True)
return len(bullets) == 0
def _extract_markdown_bullets(text: str, include_bullet: bool = False) -> List[str]:
"""
Extracts markdown bullets from text as a list. If include_bullet is True, the bullet will be
included in the output. The list of accepted bullets is: -, *, +, •, and any number followed by
a period.
"""
if include_bullet:
return re.findall(r"^[ \t]*(?:[-*+•]|[\d]+\.).*\w+.*$", text, flags=re.MULTILINE)
return re.findall(r"^[ \t]*(?:[-*+•]|[\d]+\.)(.*\w+.*)$", text, flags=re.MULTILINE)Run evaluations [#run-evals]#
Now that we have our evaluation dataset and defined our evaluation functions, let’s run evaluations!
First, we generate completions to be graded. We will use Cohere’s Command-R model, boasting a context length of 128K.
completions = []
for prompt in data["prompt"]:
completion = co.chat(message=prompt, model="command-r", temperature=0.2).text
completions.append(completion)
data["completion"] = completionsprint(data["completion"][0])PhD D is transcribing recorded sessions to locate overlapping speech zones and categorizing them as acoustic events. The team discusses the parameters PhD D should use and how to define these events, considering the number of speakers and silence.
Let’s grade the completions using our LLM and non-LLM checks.
data["format_score"] = data.apply(
lambda x: score_format(x["completion"], x["eval_metadata"]["format"]), axis=1
)
data["length_score"] = data.apply(
lambda x: score_length(
x["completion"],
x["eval_metadata"]["format"],
x["eval_metadata"].get("min_length"),
x["eval_metadata"].get("max_length"),
),
axis=1,
)
data["completeness_score"] = data.apply(
lambda x: score_llm(x["prompt"], x["completion"], criteria_completeness), axis=1
)
data["correctness_score"] = data.apply(
lambda x: score_llm(x["prompt"], x["completion"], criteria_correctness), axis=1
)
data["conciseness_score"] = data.apply(
lambda x: score_llm(x["prompt"], x["completion"], criteria_conciseness), axis=1
)data |
|---|
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
Finally, let’s print the average scores per critiera.
avg_scores = data[["format_score", "length_score", "completeness_score", "correctness_score", "conciseness_score"]].mean()
print(avg_scores)format_score 1.000000
length_score 0.833333
completeness_score 0.800000
correctness_score 1.000000
conciseness_score 0.800000
dtype: float64