Character encoding fix

April 25, 2008
django python unicode latin1 character encoding
1 (after 1 ratings)

There is a commonly encountered problem with Django and character sets. Windows applications such as Word/Outlook add characters that are not valid ISO-8859-1 and this results in problems saving a Django model to a database with a Latin 1 encoding. These characters should also be converted to avoid any display issues for the end users even if you are using a UTF-8 database encoding. The topic is well covered at Effbot and contains a list of appropriate conversions for each of hte problem characters.

Correcting this for all of your Django models is another issue. Do you handle the re-encoding during the form validation? The save for each model? Create a base class that all your models need to inherit from?

The simplest solution I have created leverages Signals

Combining the re-encoding method suggested at Effbot and the pre_save signal gives you the ability to convert all the problem characters right before the save occurs for any model.

kill_gremlins method replaced with Gabor's suggestion

from django.db.models import signals
from django.dispatch import dispatcher
from django.db import models
def kill_gremlins(text): 
    return unicode(text).encode('iso-8859-1').decode('cp1252')
def charstrip(sender, instance):
	for i_attr in instance._meta.fields:
		if type(i_attr) == models.TextField or type(i_attr) == models.CharField:
			if getattr(instance, i_attr.name):
				setattr(instance, i_attr.name, kill_gremlins(getattr(instance, i_attr.name)))

dispatcher.connect(charstrip, signal=signals.pre_save)

More like this

  1. Improved Pickled Object Field by taavi223 7 years ago
  2. Custom model field to store dict object in database by rudyryk 6 years, 4 months ago
  3. unique validation for ModelForm by whiskybar 8 years, 5 months ago
  4. Compare objects list and get a list of object to inserted or updated by paridin 1 year, 1 month ago
  5. Validating Model subclass by slacy 4 years, 7 months ago


gabor (on April 25, 2008):
<p>very nice/clean approach with the signals,</p> <p>but the kill_gremlins function seems to be a little over-complex to me.</p> <p>i mean, cannot we achieve the same with:</p> <pre>def kill_gremlins(text): return text.encode('iso-8859-1').decode('cp1252') </pre> <p>?</p> <p>(assuming that we are dealing with mishandled unicode-strings.</p>


mrtron (on April 26, 2008):
<p>Yes, that does appear to work correctly. I thought that route would drop the non iso compatible characters, but it appears to be correctly making the conversion. Very nice, I will update the method.</p>


effbot (on July 16, 2008):
<p>Note that encode("iso-8859-1") does not handle non-latin-1 characters in a Unicode string (obviously):</p> <pre>>>> s = u"\u1234" # random unicode character >>> unicodedata.name(s) 'ETHIOPIC SYLLABLE SEE' >>> s.encode("iso-8859-1").decode("cp1252") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u1234' in position 0: ordinal not in range(256) >>> </pre> <p>Maybe the usecase for this snippet is more limited, but it's not a full replacement for my (rather dated) code.</p>


Please login first before commenting.