Customizing Spacy Sentence Segmentation

Making the Default Model More Robust by Add Custom Rules

Aug 14, 2019 · 341 words · 2 minute read nlp spacy python tips

The Problem

Often in natural language processing(NLP), we would want to split a large document into sentences, so we can analyze the individual sentences and the relationship between them.

Spacy’s pretrained neural models provide such functionality via their syntactic dependency parsers. It also provides a rule-based Sentencizer, which will be very likely to fail with more complex sentences.

While the statistical sentence segmentation of spacy works quite well in most cases, there are still some weird cases on which it fails. One of them is the difficulty in handling the ’s tokens, which I noticed when using Spacy version 1.0.18 and model en_core_web_md version 2.0.0.

For example, given this sentence (the title of a news article from The Atlantic):

Hong Kong Shows the Flaws in China’s Zero-Sum Worldview.

Spacy returns three sentences:

Hong Kong Shows the Flaws in China
’s
Zero-Sum Worldview.

Another example taken from a news article from the New York Times:

Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China’s military garrison descended into an evening of clashes, panic and widespread disruption.

Spacy splits it into two sentences:

Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China
’s military garrison descended into an evening of clashes, panic and widespread disruption.

The problem seems to be somewhat alleviated in the latest 2.1.0 model, but still, the solution provided below will be helpful.

The Solution

According to Spacy’s documentation, we can add custom rules as a custom pipeline component (before the dependency parser) that specifies the sentence boundaries. The later dependency parser will respect the Token.is_sent_start attribute set by this component.

We want to make sure that ’s tokens will never be the start of a sentence. Here is how to do it:

	# Remember to run $p y t h o n - m s p a c y d o w n l o a d e n_{c} or e_{w} e b_{m} d$ before this.
	import spacy

	NLP = spacy.load("en_core_web_md", disable=["tagger", "ner"])

	def set_custom_boundaries(doc):
	for i, token in enumerate(doc):
	if token.text in ("’s", "'s"):
	doc[i].is_sent_start = False
	elif token.text in ("“", "‘") and i < len(doc) - 1:
	# opening quote
	doc[i+1].is_sent_start = False
	elif token.text in ("”", "’"):
	# closing quote
	doc[i].is_sent_start = False
	return doc

	NLP.add_pipe(set_custom_boundaries, before="parser")

view raw custom_boundaries.py hosted with ❤ by GitHub

Now Spacy will correctly identify the previous two examples as full sentences.

20190822 Update: Added rules that improves the handling of curly quotes.

Source Code

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw spacy_sentencizer.ipynb hosted with ❤ by GitHub