Undertheseanlp/word_tokenize: Vietnamese Word Tokenize - GitHub

Có thể bạn quan tâm

Skip to content Dismiss alert {{ message }} This repository was archived by the owner on Feb 15, 2023. It is now read-only. undertheseanlp / word_tokenize Public archive

Notifications You must be signed in to change notification settings
Fork 25
Star 55

Code
Issues
Pull requests 1
Actions
Projects
Security
Insights

Additional navigation options masterBranchesTagsGo to fileCode

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
docs		docs
egs		egs
tests		tests
tmp		tmp
util		util
.gitignore		.gitignore
README.en.md		README.en.md
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py
word_tokenize.py		word_tokenize.py
View all files

Repository files navigation

README

Tách từ tiếng Việt

Dự án nghiên cứu về bài toán tách từ tiếng Việt, được phát triển bởi nhóm nghiên cứu xử lý ngôn ngữ tự nhiên tiếng Việt - underthesea. Chứa mã nguồn các thử nghiệm cho việc xử lý dữ liệu, huấn luyện và đánh giá mô hình, cũng như cho phép dễ dàng tùy chỉnh mô hình đối với những tập dữ liệu mới.

Nhóm tác giả

Vũ Anh ([email protected])
Bùi Nhật Anh ([email protected])
Đoàn Việt Dũng ([email protected])

Tham gia đóng góp

Mọi ý kiến đóng góp hoặc yêu cầu trợ giúp xin gửi vào mục Issues của dự án. Các thảo luận được khuyến khích sử dụng tiếng Việt để dễ dàng trong quá trình trao đổi.

Nếu bạn có kinh nghiệm trong bài toán này, muốn tham gia vào nhóm phát triển với vai trò là Developer, xin hãy đọc kỹ Hướng dẫn tham gia đóng góp.

Mục lục

Yêu cầu hệ thống
Thiết lập môi trường
Hướng dẫn sử dụng
- Sử dụng mô hình đã huấn luyện
- Huấn luyện mô hình
Kết quả thử nghiệm
Trích dẫn
Bản quyền

Yêu cầu hệ thống

Hệ điều hành: Linux (Ubuntu, CentOS), Mac
Python 3.6
Anaconda
languageflow==1.1.7

Thiết lập môi trường

Tải project bằng cách sử dụng lệnh git clone

$ https://github.com/undertheseanlp/word_tokenize

Tạo môi trường mới và cài đặt các gói liên quan

$ cd word_tokenize $ conda create -n word_tokenize python=3.6 $ pip install -r requirements.txt

Hướng dẫn sử dụng

Trước khi chạy các thử nghiệm, hãy chắc chắn bạn đã activate môi trường word_tokenize, mọi câu lệnh đều được chạy trong thư mục gốc của dự án.

$ cd word_tokenize $ source activate word_tokenize

Sử dụng mô hình đã huấn luyện sẵn

$ python word_tokenize.py --text "Chàng trai 9X Quảng Trị khởi nghiệp từ nấm sò" $ python word_tokenize.py --fin tmp/input.txt --fout tmp/output.txt

Huấn luyện mô hình

Huấn luyện mô hình mới

$ python util/preprocess_vlsp2013.py $ python train.py \ --train tmp/vlsp2013/train.txt \ --model tmp/model.bin

Kiểm tra mô hình vừa huấn luyện

$ python word_tokenize.py \ --fin tmp/input.txt --fout tmp/output.txt \ --model tmp/model.bin

Kết quả thử nghiệm

Mô hình	F1 (%)	Thời gian train
CRF + full features	97.65

Trích dẫn

Vui lòng trích dẫn thông tin về dự án nếu bạn sử dụng mã nguồn này

@online{undertheseanlp/word_tokenize, author ={Vu Anh, Bui Nhat Anh, Doan Viet Dung}, year = {2018}, title ={Xây dựng hệ thống tách từ tiếng Việt}, url ={https://github.com/undertheseanlp/word_tokenize} }

Bản quyền

Mã nguồn của dự án được phân phối theo giấy phép GPL-3.0.

About

Vietnamese Word Tokenize

Topics

nlp natural-language-processing vietnamese word-segmentation vietnamese-nlp

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity Custom properties

Stars

55 stars

Watchers

6 watching

Forks

25 forks Report repository

Releases

No releases published

Packages

Uh oh!

There was an error while loading. Please reload this page.

Uh oh!

There was an error while loading. Please reload this page.

Contributors

Uh oh!

There was an error while loading. Please reload this page.

Languages

Python 100.0%

You can’t perform that action at this time.

Từ khóa » Tách Từ Tiếng Việt

Folders and files

Latest commit

History

Repository files navigation

Mục lục

Yêu cầu hệ thống

Thiết lập môi trường

Hướng dẫn sử dụng

Sử dụng mô hình đã huấn luyện sẵn

Huấn luyện mô hình

Kết quả thử nghiệm

Trích dẫn

Bản quyền

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Liên Hệ

Packages