Unicode is great, but there are places where the conversion ends up with unintelligible characters. I first noticed this with curly quotes entered in forms on our site.
unicode_to_ascii converts compound characters to close approximations in ASCII: such as umlaut-u to u, 1/2 (fraction glyph) to 1/2. You can add additional mappings in CHAR_REPLACEMENTS.
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | import unicodedata, sys
# Translation dictionary.  Translation entries are added to this
# dictionary as needed.
CHAR_REPLACEMENT = {
    # latin-1 characters that don't have a unicode decomposition
    0xc6: u"AE", # LATIN CAPITAL LETTER AE
    0xd0: u"D",  # LATIN CAPITAL LETTER ETH
    0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE
    0xde: u"Th", # LATIN CAPITAL LETTER THORN
    0xdf: u"ss", # LATIN SMALL LETTER SHARP S
    0xe6: u"ae", # LATIN SMALL LETTER AE
    0xf0: u"d",  # LATIN SMALL LETTER ETH
    0xf8: u"oe", # LATIN SMALL LETTER O WITH STROKE
    0xfe: u"th", # LATIN SMALL LETTER THORN
    0x2018: u"'", # LEFT SINGLE QUOTATION MARK
    0x2019: u"'", # RIGHT SINGLE QUOTATION MARK
    0x201c: u'"', # LEFT DOUBLE QUOTATION MARK
    0x201d: u'"', # RIGHT DOUBLE QUOTATION MARK
    0x215D: u"5/8", # VULGAR FRACTION FIVE EIGHTHS
    0x215A: u"5/6", # VULGAR FRACTION FIVE SIXTHS
    0x2158: u"4/5", # VULGAR FRACTION FOUR FIFTHS
    0x215B: u"1/8", # VULGAR FRACTION ONE EIGHTH
    0x2155: u"1/5", # VULGAR FRACTION ONE FIFTH
    0x00BD: u"1/2", # VULGAR FRACTION ONE HALF
    0x00BC: u"1/4", # VULGAR FRACTION ONE QUARTER
    0x2159: u"1/6", # VULGAR FRACTION ONE SIXTH
    0x2153: u"1/3", # VULGAR FRACTION ONE THIRD
    0x215E: u"7/8", # VULGAR FRACTION SEVEN EIGHTHS
    0x215C: u"3/8", # VULGAR FRACTION THREE EIGHTHS
    0x2157: u"3/5", # VULGAR FRACTION THREE FIFTHS
    0x00BE: u"3/4", # VULGAR FRACTION THREE QUARTERS
    0x2156: u"2/5", # VULGAR FRACTION TWO FIFTHS
    0x2154: u"2/3", # VULGAR FRACTION TWO THIRDS
}
class unaccented_map(dict):
    """
    Maps a unicode character code (the key) to a replacement code
    (either a character code or a unicode string).
    """
    def mapchar(self, key):
        ch = self.get(key)
        if ch is not None:
            return ch
        
        de = unicodedata.decomposition(unichr(key))
        if key not in CHAR_REPLACEMENT and de:
            try:
                ch = int(de.split(None, 1)[0], 16)
            except (IndexError, ValueError):
                ch = key
        else:
            ch = CHAR_REPLACEMENT.get(key, key)
        self[key] = ch
        return ch
    if sys.version >= "2.5":
        # use __missing__ where available
        __missing__ = mapchar
    else:
        # otherwise, use standard __getitem__ hook (this is slower,
        # since it's called for each character)
        __getitem__ = mapchar
def unicode_to_ascii(unicodestring):
    """
    Convert a unicode string into an ASCII representation, converting non-ascii
    characters into close approximations where possible.
    
    Special thanks to http://effbot.org/zone/unicode-convert.htm
    
    @param Unicode String unicodestring  The string to translate
    @result String
    """
    charmap = unaccented_map()
    return unicodestring.translate(charmap).encode("ascii", "ignore")
 | 
More like this
- Add Toggle Switch Widget to Django Forms by OgliariNatan 1 month, 2 weeks ago
- get_object_or_none by azwdevops 5 months, 1 week ago
- Mask sensitive data from logger by agusmakmun 7 months, 1 week ago
- Template tag - list punctuation for a list of items by shapiromatron 1 year, 9 months ago
- JSONRequestMiddleware adds a .json() method to your HttpRequests by cdcarter 1 year, 9 months ago
Comments
Converting to HTML entities is fine if you are sending the resulting text to something that will decode it for the user. If you passing raw text, seeing ”Hello“ looks a bit odd to people compared to: "Hello".
#
Thanks so much for this snippet! I had hacked together a crude string-replacement script to achieve this, but your solution is much more elegant. Just two things i'd add:
at the beginning of the unicode_to_ascii function, i added a 'unicodestring = unicode(unicodestring)' to also catch regular strings that might have unicode characters.
i also added some entries to the translation dict to account for portuguese accented characters, as well as the cedilla (ç):
0xe0: u'a', 0xe1: u'a', 0xe3: u'a', 0xe8: u'e', 0xe9: u'e', 0xea: u'e', 0xec: u'i', 0xed: u'i', 0xf3: u'o', 0xf2: u'o', 0xf5: u'o', 0xfa: u'u', 0xf9: u'u', 0xe7: u'c',
Thanks again!
#
Please login first before commenting.