Remove Vietnamese Accents - Xoá Dấu Tiếng Việt In Python · GitHub

Skip to content Search Gists Search Gists All gists Back to GitHub Sign in Sign up Sign in Sign up Dismiss alert {{ message }}

Instantly share code, notes, and snippets.

@J2TEAM J2TEAM/remove_accents.py Forked from cinoss/remove_accents.py Created August 31, 2016 17:11 Show Gist options
  • Star (20) You must be signed in to star a gist
  • Fork (8) You must be signed in to fork a gist
  • Embed Select an option
    • Embed Embed this gist in your website.
    • Share Copy sharable link for this gist.
    • Clone via HTTPS Clone using the web URL.

    No results found

    Learn more about clone URLs Clone this repository at <script src="https://gist.github.com/J2TEAM/9992744f15187ba51d46aecab21fd469.js"></script>
  • Save J2TEAM/9992744f15187ba51d46aecab21fd469 to your computer and use it in GitHub Desktop.
Code Revisions 1 Stars 20 Forks 8 Embed Select an option
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.

No results found

Learn more about clone URLs Clone this repository at <script src="https://gist.github.com/J2TEAM/9992744f15187ba51d46aecab21fd469.js"></script> Save J2TEAM/9992744f15187ba51d46aecab21fd469 to your computer and use it in GitHub Desktop. Download ZIP Remove Vietnamese Accents - Xoá dấu tiếng việt in Python Raw remove_accents.py This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters Show hidden characters
s1 = u'ÀÁÂÃÈÉÊÌÍÒÓÔÕÙÚÝàáâãèéêìíòóôõùúýĂăĐđĨĩŨũƠơƯưẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊịỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ'
s0 = u'AAAAEEEIIOOOOUUYaaaaeeeiioooouuyAaDdIiUuOoUuAaAaAaAaAaAaAaAaAaAaAaAaEeEeEeEeEeEeEeEeIiIiOoOoOoOoOoOoOoOoOoOoOoOoUuUuUuUuUuUuUuYyYyYyYy'
def remove_accents(input_str):
s = ''
print input_str.encode('utf-8')
for c in input_str:
if c in s1:
s += s0[s1.index(c)]
else:
s += c
return s
@tnq177 Copy link

tnq177 commented Apr 12, 2017

thanks

Uh oh!

There was an error while loading. Please reload this page.

@huseyin39 Copy link

huseyin39 commented Jun 18, 2018

Thanks

Uh oh!

There was an error while loading. Please reload this page.

@nvlong198 Copy link

nvlong198 commented Feb 15, 2019

many thanks <3

Uh oh!

There was an error while loading. Please reload this page.

@locchung Copy link

locchung commented Jun 24, 2019

genius

Uh oh!

There was an error while loading. Please reload this page.

@trieuhaivo Copy link

trieuhaivo commented Jul 23, 2019

Đóng góp thêm:

import re def no_accent_vietnamese(s): s = re.sub(r'[àáạảãâầấậẩẫăằắặẳẵ]', 'a', s) s = re.sub(r'[ÀÁẠẢÃĂẰẮẶẲẴÂẦẤẬẨẪ]', 'A', s) s = re.sub(r'[èéẹẻẽêềếệểễ]', 'e', s) s = re.sub(r'[ÈÉẸẺẼÊỀẾỆỂỄ]', 'E', s) s = re.sub(r'[òóọỏõôồốộổỗơờớợởỡ]', 'o', s) s = re.sub(r'[ÒÓỌỎÕÔỒỐỘỔỖƠỜỚỢỞỠ]', 'O', s) s = re.sub(r'[ìíịỉĩ]', 'i', s) s = re.sub(r'[ÌÍỊỈĨ]', 'I', s) s = re.sub(r'[ùúụủũưừứựửữ]', 'u', s) s = re.sub(r'[ƯỪỨỰỬỮÙÚỤỦŨ]', 'U', s) s = re.sub(r'[ỳýỵỷỹ]', 'y', s) s = re.sub(r'[ỲÝỴỶỸ]', 'Y', s) s = re.sub(r'[Đ]', 'D', s) s = re.sub(r'[đ]', 'd', s) return s if __name__ == '__main__': print(no_accent_vietnamese("Việt Nam Đất Nước Con Người")) print(no_accent_vietnamese("Welcome to Vietnam !")) print(no_accent_vietnamese("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI

Hoặc có thể cài và sử dụng thư viện unidecode:

pip install unidecode from unidecode import unidecode print(unidecode("Việt Nam Đất Nước Con Người")) print(unidecode("Welcome to Vietnam !")) print(unidecode("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI

Uh oh!

There was an error while loading. Please reload this page.

@tmhung-nt Copy link

tmhung-nt commented Dec 21, 2019

thanks mates

Uh oh!

There was an error while loading. Please reload this page.

@vietvudanh Copy link

vietvudanh commented Feb 3, 2020

pandas column version

def no_accent_vietnamese_col(df, col): s = df[col] s = s.replace(r'[àáạảãâầấậẩẫăằắặẳẵ]', 'a', regex=True) s = s.replace(r'[ÀÁẠẢÃĂẰẮẶẲẴÂẦẤẬẨẪ]', 'A', regex=True) s = s.replace(r'[èéẹẻẽêềếệểễ]', 'e', regex=True) s = s.replace(r'[ÈÉẸẺẼÊỀẾỆỂỄ]', 'E', regex=True) s = s.replace(r'[òóọỏõôồốộổỗơờớợởỡ]', 'o', regex=True) s = s.replace(r'[ÒÓỌỎÕÔỒỐỘỔỖƠỜỚỢỞỠ]', 'O', regex=True) s = s.replace(r'[ìíịỉĩ]', 'i', regex=True) s = s.replace(r'[ÌÍỊỈĨ]', 'I', regex=True) s = s.replace(r'[ùúụủũưừứựửữ]', 'u', regex=True) s = s.replace(r'[ƯỪỨỰỬỮÙÚỤỦŨ]', 'U', regex=True) s = s.replace(r'[ỳýỵỷỹ]', 'y', regex=True) s = s.replace(r'[ỲÝỴỶỸ]', 'Y', regex=True) s = s.replace(r'[Đ]', 'D', regex=True) s = s.replace(r'[đ]', 'd', regex=True) return s

Uh oh!

There was an error while loading. Please reload this page.

@truong0vanchien Copy link

truong0vanchien commented Aug 6, 2021

Cam on ban.

Uh oh!

There was an error while loading. Please reload this page.

@truong0vanchien Copy link

truong0vanchien commented Aug 6, 2021

Đóng góp thêm:

import re def no_accent_vietnamese(s): s = re.sub(r'[àáạảãâầấậẩẫăằắặẳẵ]', 'a', s) s = re.sub(r'[ÀÁẠẢÃĂẰẮẶẲẴÂẦẤẬẨẪ]', 'A', s) s = re.sub(r'[èéẹẻẽêềếệểễ]', 'e', s) s = re.sub(r'[ÈÉẸẺẼÊỀẾỆỂỄ]', 'E', s) s = re.sub(r'[òóọỏõôồốộổỗơờớợởỡ]', 'o', s) s = re.sub(r'[ÒÓỌỎÕÔỒỐỘỔỖƠỜỚỢỞỠ]', 'O', s) s = re.sub(r'[ìíịỉĩ]', 'i', s) s = re.sub(r'[ÌÍỊỈĨ]', 'I', s) s = re.sub(r'[ùúụủũưừứựửữ]', 'u', s) s = re.sub(r'[ƯỪỨỰỬỮÙÚỤỦŨ]', 'U', s) s = re.sub(r'[ỳýỵỷỹ]', 'y', s) s = re.sub(r'[ỲÝỴỶỸ]', 'Y', s) s = re.sub(r'[Đ]', 'D', s) s = re.sub(r'[đ]', 'd', s) return s if __name__ == '__main__': print(no_accent_vietnamese("Việt Nam Đất Nước Con Người")) print(no_accent_vietnamese("Welcome to Vietnam !")) print(no_accent_vietnamese("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI

Hoặc có thể cài và sử dụng thư viện unidecode:

pip install unidecode from unidecode import unidecode print(unidecode("Việt Nam Đất Nước Con Người")) print(unidecode("Welcome to Vietnam !")) print(unidecode("VIỆT NAM ĐẤT NƯỚC CON NGƯỜI")) # Output # Viet Nam Dat Nuoc Con Nguoi # Welcome to Vietnam ! # VIET NAM DAT NUOC CON NGUOI

Cam on ban.

Uh oh!

There was an error while loading. Please reload this page.

@lacls Copy link

lacls commented Apr 7, 2022

Error case: no_accent_vietnamese("Nguyễn Võ Tấn Đạt") Output: 'Nguyẽn Võ Tán Dạt'

Uh oh!

There was an error while loading. Please reload this page.

@maycuatroi1 Copy link

maycuatroi1 commented Aug 23, 2022

new method:

pip install unidecode import unidecode accented_string = u'Việt Nam đất nước con người' # accented_string is of type 'unicode' unaccented_string = unidecode.unidecode(accented_string) print(unaccented_string) # output: Viet Nam dat nuoc con nguoi

Uh oh!

There was an error while loading. Please reload this page.

@ejinguyen Copy link

ejinguyen commented Jan 3, 2023

mình cũng đang gặp 1 case như bên dưới: "Phú Mỹ Hưng" "Phú Mỹ Hưng"

2 từ trên trông giống nhau nhưng khi encode thì không giống nhau! Danh sách ký tự lên mở rộng cho nhiều bảng mã khác!

Uh oh!

There was an error while loading. Please reload this page.

@phineas-pta Copy link

phineas-pta commented May 23, 2023

mình có làm 1 phiên bản khác hoàn thiện hơn và xử lí dc các trường hợp không thành công ở trên:

https://gist.github.com/phineas-pta/05cad38a29fea000ab6d9e13a6f7e623

Uh oh!

There was an error while loading. Please reload this page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment You can’t perform that action at this time.

Từ khóa » Bỏ Dấu Tiếng Việt Trong Python