Cantonese Word Segmentation

This repository offers a Cantonese dictionary specifically designed for enhancing Cantonese word segmentation. It allows the integration with the jieba segmentation library, enabling you to use it as a custom dictionary for Cantonese text processing.

Sources

The development of this Cantonese dictionary has been informed by the following resources:

Installation

To get started, download the dictionary file from the repository:

Download the file `wordlist.txt`

Usage

You can integrate this dictionary into your jieba processing workflow as follows:

import jieba
jieba.load_userdict('wordlist.txt')

Performance Comparison

Here are some examples demonstrating the improvements in segmentation when using this custom dictionary compared to the default jieba dictionary:

Example Sentence:

"影相呃like真係一件好煩嘅事嚟㗎，不過呃like都唔一"
- Custom Dictionary Output: [影相, 呃like, 真係, 一件, 好煩, 嘅, 事, 嚟, 不過, 呃like, 都, ...]
- Jieba Default Dictionary Output: [影相, 呃, like, 真, 係, 一件, 好, 煩, 嘅, 事, 嚟, 㗎, ...]
Example Sentence:

"真係唔好怪人哋叫香港做投訴之都⋯⋯繼有餐廳"
- Custom Dictionary Output: [真係, 唔好, 怪人, 哋, 叫, 香港, 做, 投訴, 之, 都, 繼, 有, 餐廳, ...]
- Jieba Default Dictionary Output: [真, 係, 唔, 好, 怪人, 哋, 叫, 香港, 做, 投訴, 之, 都, ⋯⋯, ...]

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
wordlist.txt		wordlist.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cantonese Word Segmentation

Sources

Installation

Usage

Performance Comparison

About

Releases

Packages

wchan757/Cantonese_Word_Segmentation

Folders and files

Latest commit

History

Repository files navigation

Cantonese Word Segmentation

Sources

Installation

Usage

Performance Comparison

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages