This repository offers a Cantonese dictionary specifically designed for enhancing Cantonese word segmentation. It allows the integration with the jieba
segmentation library, enabling you to use it as a custom dictionary for Cantonese text processing.
The development of this Cantonese dictionary has been informed by the following resources:
- Most Common Cantonese Words Frequency List
- Words.hk Faiman Analysis
- PyCantonese
- FastText Pretrained Vectors
To get started, download the dictionary file from the repository:
Download the file `wordlist.txt`
You can integrate this dictionary into your jieba
processing workflow as follows:
import jieba
jieba.load_userdict('wordlist.txt')
Here are some examples demonstrating the improvements in segmentation when using this custom dictionary compared to the default jieba
dictionary:
-
Example Sentence:
"影相呃like真係一件好煩嘅事嚟㗎,不過呃like都唔一"
- Custom Dictionary Output:
[影相, 呃like, 真係, 一件, 好煩, 嘅, 事, 嚟, 不過, 呃like, 都, ...]
- Jieba Default Dictionary Output:
[影相, 呃, like, 真, 係, 一件, 好, 煩, 嘅, 事, 嚟, 㗎, ...]
- Custom Dictionary Output:
-
Example Sentence:
"真係唔好怪人哋叫香港做投訴之都⋯⋯繼有餐廳"
- Custom Dictionary Output:
[真係, 唔好, 怪人, 哋, 叫, 香港, 做, 投訴, 之, 都, 繼, 有, 餐廳, ...]
- Jieba Default Dictionary Output:
[真, 係, 唔, 好, 怪人, 哋, 叫, 香港, 做, 投訴, 之, 都, ⋯⋯, ...]
- Custom Dictionary Output: