Discourse Segmentation
1. Introduction to Discourse and Discourse Structure
Discourse refers to coherent sequences of sentences or utterances that form meaningful
communication beyond individual sentences. In computational linguistics and NLP, discourse
analysis focuses on the structure and organization of text or speech in extended communication
such as articles, conversations, and narratives.
A discourse structure organizes text into meaningful units or segments—such as paragraphs,
sections, or dialogue turns—each contributing to the overall communicative goal. These
segments are often related through coherence relations (e.g., cause-effect, contrast,
elaboration).
2. Importance of Segmenting Text into Coherent Units
Discourse segmentation is the process of dividing text into coherent segments, each typically
focusing on a single topic, subtopic, or communicative intention. It's critical for:
● Text Summarization: Extracting the most relevant content from each coherent unit.
● Question Answering (QA): Locating specific segments related to a question.
● Dialogue Systems: Understanding topic shifts or turns in a conversation.
● Information Retrieval: Improving relevance by segment-aware indexing.
● Sentiment Analysis: Detecting sentiment changes across different segments.
Example: In a news article discussing an election, segmentation can separate parts covering
candidate background, polling data, public opinion, and campaign events—enabling better
content understanding and retrieval.
3. Techniques and Algorithms for Discourse Segmentation
A. TextTiling Algorithm
Developed by Marti Hearst (1997), TextTiling is a pioneering unsupervised method for
segmenting expository text into topically coherent blocks.
Working Principle:
● The text is divided into token sequences (pseudo-sentences).
● For each pair of adjacent blocks, the cosine similarity of their word distributions is
computed.
● Valleys (drops) in similarity scores indicate potential segment boundaries.
Advantages:
● Language-independent and unsupervised.
● Works well for structured texts like essays or reports.
Real-time Example:
● Used in document summarizers to identify thematic units before summary extraction.
● Educational software uses it to segment chapters or lessons for adaptive learning.
B. Machine Learning Approaches
With the availability of annotated corpora, supervised machine learning techniques have
become popular for discourse segmentation.
Features Used:
● Lexical cues: Discourse markers like “however”, “on the other hand”, “furthermore”.
● Syntactic features: Part-of-speech tags, sentence lengths, punctuation.
● Semantic cues: Word embeddings or BERT-like contextual embeddings.
● Topic modeling: LDA or clustering to detect topic shifts.
Algorithms:
● SVMs, Decision Trees: Classical ML methods on structured features.
● CRFs (Conditional Random Fields): Useful for sequential segmentation tasks.
● Neural models (BiLSTM, BERT): Contextual deep learning models fine-tuned on
discourse data.
Real-time Example:
● In customer service chatbots, ML-based discourse segmentation helps detect new
intents or issues when a user shifts topics mid-conversation.
● In legal document analysis, ML models segment contracts into clauses (e.g., payment
terms, liability, termination).
4. Applications of Discourse Segmentation
A. Text Summarization
● Identifies relevant segments that represent key points across the document.
● Prevents inclusion of disjoint or off-topic sentences in summaries.
Example: News summarizers segment reports to isolate important sections like quotes, events,
and statistics before generating a summary.
B. Question Answering (QA)
● Narrows the search space by directing the QA system to specific discourse segments.
● Increases the precision of answer retrieval.
Example: In open-domain QA systems like Google’s passage-based search, discourse
segmentation improves locating relevant answer-containing passages.
C. Dialogue Systems
● Detects when speakers switch topics or intentions.
● Helps maintain coherence in multi-turn conversations.
Example: Virtual assistants (e.g., Siri, Alexa) segment user input into topics (e.g., “weather,”
“calendar”) to understand and respond appropriately across interactions.
Coherent sequences of sentences refer to a group of sentences that are logically connected
and flow smoothly together to express a unified idea, topic, or theme. In a coherent sequence,
each sentence relates meaningfully to the others, maintaining clarity and continuity throughout
the passage.
Key Characteristics of Coherent Sequences:
1. Logical Order: Ideas are presented in a logical progression (e.g., cause-effect,
chronological order).
2. Consistency: Maintains consistent subject, tense, and point of view.
3. Reference and Linkage: Uses devices like pronouns, conjunctions, and transition
words to link sentences (e.g., "however," "because," "this," "such as").
4. Topic Unity: All sentences focus on a single theme or central idea.
5. Smooth Transitions: There are no abrupt shifts in topic or structure.
Example of a Coherent Sequence:
"Air pollution is a growing concern in urban areas. Vehicles and industrial emissions
are the primary sources. To combat this, cities are investing in cleaner
transportation systems and stricter emission regulations. These efforts aim to
improve air quality and public health."
● Each sentence builds on the previous one.
● There is clear topic unity (air pollution).
● Pronouns like "this" and connectors like "to combat this" create cohesion.
In NLP and Discourse Analysis:
In computational linguistics, recognizing coherent sequences helps machines:
● Understand text structure
● Segment discourse meaningfully
● Summarize or answer questions accurately
Let me know if you’d like examples of incoherent sequences for contrast!