Remove Vietnamese Accents - Xoá Dấu Tiếng Việt In Python · GitHub

Có thể bạn quan tâm

Skip to content Search Gists Search Gists All gists Back to GitHub Sign in Sign up Sign in Sign up Dismiss alert {{ message }}

Instantly share code, notes, and snippets.

J2TEAM/remove_accents.py Forked from cinoss/remove_accents.py Created August 31, 2016 17:11 Show Gist options

Star (20) You must be signed in to star a gist
Fork (8) You must be signed in to fork a gist

Embed Select an option
- Embed Embed this gist in your website.
- Share Copy sharable link for this gist.
- Clone via HTTPS Clone using the web URL.
No results found
Learn more about clone URLs Clone this repository at <script src="https://gist.github.com/J2TEAM/9992744f15187ba51d46aecab21fd469.js"></script>
Save J2TEAM/9992744f15187ba51d46aecab21fd469 to your computer and use it in GitHub Desktop.

Code Revisions 1 Stars 20 Forks 8 Embed Select an option

No results found

Learn more about clone URLs Clone this repository at <script src="https://gist.github.com/J2TEAM/9992744f15187ba51d46aecab21fd469.js"></script> Save J2TEAM/9992744f15187ba51d46aecab21fd469 to your computer and use it in GitHub Desktop. Download ZIP Remove Vietnamese Accents - Xoá dấu tiếng việt in Python Raw remove_accents.py This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters

s1 = u'ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚÝàáâãèéêìíòóôõùúýĂăĐđĨĩŨũƠơƯưẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ'

s0 = u'AAAAEEEIIOOOOUUYaaaaeeeiioooouuyAaDdIiUuOoUuAaAaAaAaAaAaAaAaAaAaAaAaEeEeEeEeEeEeEeEeIiIiOoOoOoOoOoOoOoOoOoOoOoOoUuUuUuUuUuUuUuYyYyYyYy'

def remove_accents(input_str):

s = ''

print input_str.encode('utf-8')

for c in input_str:

if c in s1:

s += s0[s1.index(c)]

else:

s += c

return s

Copy link

tnq177 commented Apr 12, 2017

thanks

Uh oh!

There was an error while loading. Please reload this page.

Copy link

huseyin39 commented Jun 18, 2018

Thanks

Uh oh!

There was an error while loading. Please reload this page.

Copy link

nvlong198 commented Feb 15, 2019

many thanks <3

Uh oh!

There was an error while loading. Please reload this page.

Copy link

locchung commented Jun 24, 2019

genius

Uh oh!

There was an error while loading. Please reload this page.

Copy link

trieuhaivo commented Jul 23, 2019

Đóng góp thêm:

import re def no_accent_vietnamese(s): s = re.sub(r'[àáạảãâầấậẩẫăằắặẳẵ]', 'a', s) s = re.sub(r'[ÀÁẠẢÃĂẰẮẶẲẴÂẦẤẬẨẪ]', 'A', s) s = re.sub(r'[èéẹẻẽêềếệểễ]', 'e', s) s = re.sub(r'[ÈÉẸẺẼÊỀẾỆỂỄ]', 'E', s) s = re.sub(r'[òóọỏõôồốộổỗơờớợởỡ]', 'o', s) s = re.sub(r'[ÒÓỌỎÕÔỒỐỘỔỖƠỜỚỢỞỠ]', 'O', s) s = re.sub(r'[ìíịỉĩ]', 'i', s) s = re.sub(r'[ÌÍỊỈĨ]', 'I', s) s = re.sub(r'[ùúụủũưừứựửữ]', 'u', s) s = re.sub(r'[ƯỪỨỰỬỮÙÚỤỦŨ]', 'U', s) s = re.sub(r'[ỳýỵỷỹ]', 'y', s) s = re.sub(r'[ỲÝỴỶỸ]', 'Y', s) s = re.sub(r'[Đ]', 'D', s) s = re.sub(r'[đ]', 'd', s) return s if __name__ == '__main__': print(no_accent_vietnamese("Việt Nam Đất Nước Con Người")) print(no_accent_vietnamese("Welcome to Vietnam !")) print(no_accent_vietnamese("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI

Hoặc có thể cài và sử dụng thư viện unidecode:

pip install unidecode from unidecode import unidecode print(unidecode("Việt Nam Đất Nước Con Người")) print(unidecode("Welcome to Vietnam !")) print(unidecode("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI

Uh oh!

There was an error while loading. Please reload this page.

Copy link

tmhung-nt commented Dec 21, 2019

thanks mates

Uh oh!

There was an error while loading. Please reload this page.

Copy link

vietvudanh commented Feb 3, 2020

pandas column version

def no_accent_vietnamese_col(df, col): s = df[col] s = s.replace(r'[àáạảãâầấậẩẫăằắặẳẵ]', 'a', regex=True) s = s.replace(r'[ÀÁẠẢÃĂẰẮẶẲẴÂẦẤẬẨẪ]', 'A', regex=True) s = s.replace(r'[èéẹẻẽêềếệểễ]', 'e', regex=True) s = s.replace(r'[ÈÉẸẺẼÊỀẾỆỂỄ]', 'E', regex=True) s = s.replace(r'[òóọỏõôồốộổỗơờớợởỡ]', 'o', regex=True) s = s.replace(r'[ÒÓỌỎÕÔỒỐỘỔỖƠỜỚỢỞỠ]', 'O', regex=True) s = s.replace(r'[ìíịỉĩ]', 'i', regex=True) s = s.replace(r'[ÌÍỊỈĨ]', 'I', regex=True) s = s.replace(r'[ùúụủũưừứựửữ]', 'u', regex=True) s = s.replace(r'[ƯỪỨỰỬỮÙÚỤỦŨ]', 'U', regex=True) s = s.replace(r'[ỳýỵỷỹ]', 'y', regex=True) s = s.replace(r'[ỲÝỴỶỸ]', 'Y', regex=True) s = s.replace(r'[Đ]', 'D', regex=True) s = s.replace(r'[đ]', 'd', regex=True) return s

Uh oh!

There was an error while loading. Please reload this page.

Copy link

truong0vanchien commented Aug 6, 2021

Cam on ban.

Uh oh!

There was an error while loading. Please reload this page.

Copy link

truong0vanchien commented Aug 6, 2021

Đóng góp thêm:
import re def no_accent_vietnamese(s): s = re.sub(r'[àáạảãâầấậẩẫăằắặẳẵ]', 'a', s) s = re.sub(r'[ÀÁẠẢÃĂẰẮẶẲẴÂẦẤẬẨẪ]', 'A', s) s = re.sub(r'[èéẹẻẽêềếệểễ]', 'e', s) s = re.sub(r'[ÈÉẸẺẼÊỀẾỆỂỄ]', 'E', s) s = re.sub(r'[òóọỏõôồốộổỗơờớợởỡ]', 'o', s) s = re.sub(r'[ÒÓỌỎÕÔỒỐỘỔỖƠỜỚỢỞỠ]', 'O', s) s = re.sub(r'[ìíịỉĩ]', 'i', s) s = re.sub(r'[ÌÍỊỈĨ]', 'I', s) s = re.sub(r'[ùúụủũưừứựửữ]', 'u', s) s = re.sub(r'[ƯỪỨỰỬỮÙÚỤỦŨ]', 'U', s) s = re.sub(r'[ỳýỵỷỹ]', 'y', s) s = re.sub(r'[ỲÝỴỶỸ]', 'Y', s) s = re.sub(r'[Đ]', 'D', s) s = re.sub(r'[đ]', 'd', s) return s if __name__ == '__main__': print(no_accent_vietnamese("Việt Nam Đất Nước Con Người")) print(no_accent_vietnamese("Welcome to Vietnam !")) print(no_accent_vietnamese("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI
Hoặc có thể cài và sử dụng thư viện unidecode:
pip install unidecode from unidecode import unidecode print(unidecode("Việt Nam Đất Nước Con Người")) print(unidecode("Welcome to Vietnam !")) print(unidecode("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI

Cam on ban.

Uh oh!

There was an error while loading. Please reload this page.

Copy link

lacls commented Apr 7, 2022

Error case: no_accent_vietnamese("Nguyễn Võ Tấn Đạt") Output: 'Nguyẽn Võ Tán Dạt'

Uh oh!

There was an error while loading. Please reload this page.

Copy link

maycuatroi1 commented Aug 23, 2022

new method:

pip install unidecode import unidecode accented_string = u'Việt Nam đất nước con người' # accented_string is of type 'unicode' unaccented_string = unidecode.unidecode(accented_string) print(unaccented_string) # output: Viet Nam dat nuoc con nguoi

Uh oh!

There was an error while loading. Please reload this page.

Copy link

ejinguyen commented Jan 3, 2023

mình cũng đang gặp 1 case như bên dưới: "Phú Mỹ Hưng" "Phú Mỹ Hưng"

2 từ trên trông giống nhau nhưng khi encode thì không giống nhau! Danh sách ký tự lên mở rộng cho nhiều bảng mã khác!

Uh oh!

There was an error while loading. Please reload this page.

Copy link

phineas-pta commented May 23, 2023

mình có làm 1 phiên bản khác hoàn thiện hơn và xử lí dc các trường hợp không thành công ở trên:

https://gist.github.com/phineas-pta/05cad38a29fea000ab6d9e13a6f7e623

Uh oh!

There was an error while loading. Please reload this page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment You can’t perform that action at this time.

Từ khóa » Bỏ Dấu Tiếng Việt Trong Python

No results found

No results found

tnq177 commented Apr 12, 2017

Uh oh!

huseyin39 commented Jun 18, 2018

Uh oh!

nvlong198 commented Feb 15, 2019

Uh oh!

locchung commented Jun 24, 2019

Uh oh!

trieuhaivo commented Jul 23, 2019

Uh oh!

tmhung-nt commented Dec 21, 2019

Uh oh!

vietvudanh commented Feb 3, 2020

Uh oh!

truong0vanchien commented Aug 6, 2021

Uh oh!

truong0vanchien commented Aug 6, 2021

Uh oh!

lacls commented Apr 7, 2022

Uh oh!

maycuatroi1 commented Aug 23, 2022

Uh oh!

ejinguyen commented Jan 3, 2023

Uh oh!

phineas-pta commented May 23, 2023

Uh oh!

Liên Hệ