Often in natural language processing(NLP), we would want to split a large document into sentences, so we can analyze the individual sentences and the relationship between them.
Spacy’s pretrained neural models provide such functionality via their syntactic dependency parsers. It also provides a rule-based Sentencizer, which will be very likely to fail with more complex sentences.
While the statistical sentence segmentation of spacy works quite well in most cases, there are still some weird cases on which it fails. One of them is the difficulty in handling the ’s tokens, which I noticed when using Spacy version 1.0.18 and model
en_core_web_md version 2.0.0.
For example, given this sentence (the title of a news article from The Atlantic):
Hong Kong Shows the Flaws in China’s Zero-Sum Worldview.
Spacy returns three sentences:
- Hong Kong Shows the Flaws in China
- Zero-Sum Worldview.
Another example taken from a news article from the New York Times:
Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China’s military garrison descended into an evening of clashes, panic and widespread disruption.
Spacy splits it into two sentences:
- Police officers fired tear gas in several locations as a day that began with a show of peaceful defiance outside the headquarters of China
- ’s military garrison descended into an evening of clashes, panic and widespread disruption.
The problem seems to be somewhat alleviated in the latest 2.1.0 model, but still, the solution provided below will be helpful.
According to Spacy’s documentation, we can add custom rules as a custom pipeline component (before the dependency parser) that specifies the sentence boundaries. The later dependency parser will respect the
Token.is_sent_start attribute set by this component.
We want to make sure that ’s tokens will never be the start of a sentence. Here is how to do it:
Now Spacy will correctly identify the previous two examples as full sentences.
20190822 Update: Added rules that improves the handling of curly quotes.