Detecting Chinese Characters in Unicode Strings

CJK in Unicode and Python 3 implementation

Apr 24, 2019 · 610 words · 3 minute read nlp

Motivation

I have a situation where an automatic mechanism to remove texts in a dataset that are not in Chinese. The dataset contains characters from Traditional Chinese, Simplified Chinese, English, and on some rare occasion French, Arabic, and other languages.

General purpose language detection packages (such as this one) produces a lot more false positives than expected. Texts with Chinese characters mixed with Latin characters are often classified as different languages. And quite often Chinese texts are classified as Korean, which is very interesting because the dataset does not have any Korean characters.

Since the tasks only requires a binary label (Chinese or not Chinese) for each input, I figure a better approach might be building my own algorithm that utilize the block range information of Unicode.

This post documents what I’ve learned about CJK characters in Unicode, and presents a better but still imperfect solution.

CJK characters in Unicode

It is a commonly used acronym for “Chinese, Japanese, and Korean”. The term “CJK character” generally refers to “Chinese characters”, or more specifically, the Chinese (= Han) ideographs used in the writing systems of the Chinese and Japanese languages, occasionally for Korean, and historically in Vietnam. [1]

In Unicode, the shared characters among Chinese, Japanese, and Korean, were identified and merged as “CJK Unified Ideographs”. They includes characters used in Chinese writing system, kanji in Japanese, and hanja in Korean. [2]

So can we take a character out from the unified ideographs and tell if it is a Chinese, Japanese, or Korean characters? Sadly, the answer is no:

It’s basically impossible and largely meaningless. It’s the equivalent of asking if “a” is an English letter or a French one. There are some characters where one can guess based on the source information in the Unihan Database that it’s traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable.[1]

The good news is that Korean rarely use Chinese characters in modern days, and Japanese texts in most cases contains their own “hiragana” and “katakana” characters. The other cases with all Chinese characters, unfortunately, might require us to rely on probabilistic models that use language features. Distinguishing between Traditional and Simplified Chinese can be also quite difficult as well, and awaits further research.

Respective Unicode Blocks

The “common” block of CJK Unified Ideographs should cover most of the case. We should be able to ignore the extensions when classifying.

Block	Range	Comment
CJK Unified Ideographs	4E00-9FFF	Common
Hiragana	3040-309F	Japanese [3]
Katakana	30A0-30FF	Japanese [3]
Hangul Syllables	AC00-D7A3	Korean [4]
CJK Unified Ideographs Extension A	3400-4DBF	Rare [5]
CJK Unified Ideographs Extension B	20000-2A6DF	Rare, historic
CJK Unified Ideographs Extension C	2A700–2B73F	Rare, historic
CJK Unified Ideographs Extension D	2B740–2B81F	Uncommon, some in current use
CJK Unified Ideographs Extension E	2B820–2CEAF	Rare, historic
CJK Compatibility Ideographs	F900-FAFF	Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement	2F800-2FA1F	Unifiable variants

Python 3 Implementation

I emphasize using Python 3 to simplify things. The default encoding for Python 3 source code is UTF-8, and the language’s str type contains Unicode characters, meaning any string created using “unicode rocks!”, ‘unicode rocks!’, or the triple-quoted string syntax is stored as Unicode [6].

The following implementation uses re.search to search for characters in the specified block(s). Some simple test cases are supplied, along with some incorrect cases, including Japanese texts with only Chinese characters, and a bizarre but common usage of a Japanese character in Traditional Chinese texts.

	import re


	def cjk_detect(texts):
	# korean
	if re.search("[\uac00-\ud7a3]", texts):
	return "ko"
	# japanese
	if re.search("[\u3040-\u30ff]", texts):
	return "ja"
	# chinese
	if re.search("[\u4e00-\u9FFF]", texts):
	return "zh"
	return None


	def test_cjk_detect():
	# Pure English
	assert cjk_detect(
	"Is Obstruction an Impeachable Offense? History Says Yes") is None
	# Pure French
	assert cjk_detect(
	"Damian Lillard a réussi un nouveau shoot de la victoire"
	" au buzzer à très longue distance") is None
	# Simplified Chinese
	assert cjk_detect(
	"2009年，波音公司(Boeing)在查尔斯顿附近的新厂破土动工时，曾宣扬这里是最先进的制造中心"
	"，将制造一款世界上最先进的飞机。但在接下来的十年里，这家生产787梦想客机的工厂一直受到做"
	"工粗糙和监管不力的困扰，危及航空安全。") == "zh"
	# Traditional Chinese
	assert cjk_detect(
	"北查爾斯頓工廠的安全漏洞已經引起了航空公司和監管機構的密切關注。") == "zh"
	# Japanese
	assert cjk_detect(
	"日産自動車は24日、2019年3月期の連結業績予想を下方修正した。") == "ja"
	# Korean
	assert cjk_detect(
	"투서로 뜨고 투서에 지나") == "ko"
	# Korean with a Chinese character
	assert cjk_detect(
	"北 외무성 간부 총살설 주민들 사이서 확산…하노이 회담 실패 때문") == "ko"


	def print_incorrect_cases():
	# Japanese
	texts = "日産自動車、営業益45%減　前期下方修正"
	print(texts, "expected: ja actual:", cjk_detect(texts))
	# Traditional Chinese with Japanese hiragana
	texts = "健康の油切好吃の涼麵"
	print(texts, "expected: zh actual:", cjk_detect(texts))
	# Traditional Chinese with Japanese katakana punctuation
	texts = "鐵腕・都鐸王朝（五）：文藝復興最懂穿搭的高富帥——亨利八世"
	print(texts, "expected: zh actual:", cjk_detect(texts))


	if __name__ == "__main__":
	# Correct cases
	test_cjk_detect()
	# Incorrect cases
	print_incorrect_cases()

view raw detector.py hosted with ❤ by GitHub

Motivation

CJK characters in Unicode

Respective Unicode Blocks

Python 3 Implementation

References