I have a situation where an automatic mechanism to remove texts in a dataset that are not in Chinese. The dataset contains characters from Traditional Chinese, Simplified Chinese, English, and on some rare occasion French, Arabic, and other languages.
General purpose language detection packages (such as this one) produces a lot more false positives than expected. Texts with Chinese characters mixed with Latin characters are often classified as different languages. And quite often Chinese texts are classified as Korean, which is very interesting because the dataset does not have any Korean characters.
Since the tasks only requires a binary label (Chinese or not Chinese) for each input, I figure a better approach might be building my own algorithm that utilize the block range information of Unicode.
This post documents what I’ve learned about CJK characters in Unicode, and presents a better but still imperfect solution.
CJK characters in Unicode
It is a commonly used acronym for “Chinese, Japanese, and Korean”. The term “CJK character” generally refers to “Chinese characters”, or more specifically, the Chinese (= Han) ideographs used in the writing systems of the Chinese and Japanese languages, occasionally for Korean, and historically in Vietnam. 
In Unicode, the shared characters among Chinese, Japanese, and Korean, were identified and merged as “CJK Unified Ideographs”. They includes characters used in Chinese writing system, kanji in Japanese, and hanja in Korean. 
So can we take a character out from the unified ideographs and tell if it is a Chinese, Japanese, or Korean characters? Sadly, the answer is no:
It’s basically impossible and largely meaningless. It’s the equivalent of asking if “a” is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it’s traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable.
The good news is that Korean rarely use Chinese characters in modern days, and Japanese texts in most cases contains their own “hiragana” and “katakana” characters. The other cases with all Chinese characters, unfortunately, might require us to rely on probabilistic models that use language features. Distinguishing between Traditional and Simplified Chinese can be also quite difficult as well, and awaits further research.
Respective Unicode Blocks
The “common” block of CJK Unified Ideographs should cover most of the case. We should be able to ignore the extensions when classifying.
|CJK Unified Ideographs||4E00-9FFF||Common|
|Hangul Syllables||AC00-D7A3||Korean |
|CJK Unified Ideographs Extension A||3400-4DBF||Rare |
|CJK Unified Ideographs Extension B||20000-2A6DF||Rare, historic|
|CJK Unified Ideographs Extension C||2A700–2B73F||Rare, historic|
|CJK Unified Ideographs Extension D||2B740–2B81F||Uncommon, some in current use|
|CJK Unified Ideographs Extension E||2B820–2CEAF||Rare, historic|
|CJK Compatibility Ideographs||F900-FAFF||Duplicates, unifiable variants, corporate characters|
|CJK Compatibility Ideographs Supplement||2F800-2FA1F||Unifiable variants|
Python 3 Implementation
I emphasize using Python 3 to simplify things. The default encoding for Python 3 source code is UTF-8, and the language’s str type contains Unicode characters, meaning any string created using “unicode rocks!”, ‘unicode rocks!’, or the triple-quoted string syntax is stored as Unicode .
The following implementation uses
re.search to search for characters in the specified block(s). Some simple test cases are supplied, along with some incorrect cases, including Japanese texts with only Chinese characters, and a bizarre but common usage of a Japanese character in Traditional Chinese texts.
- [Unicode.org] FAQ - Chinese and Japanese
- [Wikipedia] CJK Unified Ideographs
- [key-shortcut.com] Japanese
- [Wikipedia] Hangul Syllables
- [StackOverflow] What’s the complete range for Chinese characters in Unicode?
- [python.org] Unicode HOWTO