djangosnippets: Improved Pickled Object Field

Author:: taavi223
Posted:: August 20, 2009
Language:: Python
Version:: 1.1
Score:: 3 (after 3 ratings)

Download
Raw

Update 10/10/09: Further development is now occurring on GitHub, thanks to Shrubbery Software.

Incredibly useful for storing just about anything in the database (provided it is Pickle-able, of course) when there isn't a 'proper' field for the job.

PickledObjectField is database-agnostic, and should work with any database backend you can throw at it. You can pass in any Python object and it will automagically be converted behind the scenes. You never have to manually pickle or unpickle anything. Also works fine when querying; supports exact, in, and isnull lookups. It should be noted, however, that calling QuerySet.values() will only return the encoded data, not the original Python object.

Please note that this is supposed to be two files, one fields.py and one tests.py (if you don't care about the unit tests, just use fields.py).

This PickledObjectField has a few improvements over the one in snippet #513.

This one solves the DjangoUnicodeDecodeError problem when saving an object containing non-ASCII data by base64 encoding the pickled output stream. This ensures that all stored data is ASCII, eliminating the problem.
PickledObjectField will now optionally use zlib to compress (and uncompress) pickled objects on the fly. This can be set per-field using the keyword argument "compress=True". For most items this is probably not worth the small performance penalty, but for Models with larger objects, it can be a real space saver.
You can also now specify the pickle protocol per-field, using the protocol keyword argument. The default of 2 should always work, unless you are trying to access the data from outside of the Django ORM.
Worked around a rare issue when using the cPickle and performing lookups of complex data types. In short, cPickle would sometimes output different streams for the same object depending on how it was referenced. This of course could cause lookups for complex objects to fail, even when a matching object exists. See the docstrings and tests for more information.
You can now use the isnull lookup and have it function as expected. A consequence of this is that by default, PickledObjectField has null=True set (you can of course pass null=False if you want to change that). If null=False is set (the default for fields), then you wouldn't be able to store a Python None value, since None values aren't pickled or encoded (this in turn is what makes the isnull lookup possible).
You can now pass in an object as the default argument for the field without it being converted to a unicode string first. If you pass in a callable though, the field will still call it. It will not try to pickle and encode it.
You can manually import dbsafe_encode and dbsafe_decode from fields.py if you want to encode and decode objects yourself. This is mostly useful for decoding values returned from calling QuerySet.values(), which are still encoded strings.

The tests have been updated to match the added features, but if you find any bugs, please post them in the comments. My goal is to make this an error-proof implementation.

Note: If you are trying to store other django models in the PickledObjectField, please see the comments for a discussion on the problems associated with doing that. The easy solution is to put django models into a list or tuple before assigning them to the PickledObjectField.

Update 9/2/09: Fixed the value_to_string method so that serialization should now work as expected. Also added deepcopy back into dbsafe_encode, fixing #4 above, since deepcopy had somehow managed to remove itself. This means that lookups should once again work as expected in all situations. Also made the field editable=False by default (which I swear I already did once before!) since it is never a good idea to have a PickledObjectField be user editable.

# --------------------------------------- fields.py  --------------------------------------- #

from copy import deepcopy
from base64 import b64encode, b64decode
from zlib import compress, decompress
try:
    from cPickle import loads, dumps
except ImportError:
    from pickle import loads, dumps

from django.db import models
from django.utils.encoding import force_unicode

class PickledObject(str):
    """
    A subclass of string so it can be told whether a string is a pickled
    object or not (if the object is an instance of this class then it must
    [well, should] be a pickled one).
    
    Only really useful for passing pre-encoded values to ``default``
    with ``dbsafe_encode``, not that doing so is necessary. If you
    remove PickledObject and its references, you won't be able to pass
    in pre-encoded values anymore, but you can always just pass in the
    python objects themselves.
    
    """
    pass

def dbsafe_encode(value, compress_object=False):
    """
    We use deepcopy() here to avoid a problem with cPickle, where dumps
    can generate different character streams for same lookup value if
    they are referenced differently. 
    
    The reason this is important is because we do all of our lookups as
    simple string matches, thus the character streams must be the same
    for the lookups to work properly. See tests.py for more information.
    """
    if not compress_object:
        value = b64encode(dumps(deepcopy(value)))
    else:
        value = b64encode(compress(dumps(deepcopy(value))))
    return PickledObject(value)

def dbsafe_decode(value, compress_object=False):
    if not compress_object:
        value = loads(b64decode(value))
    else:
        value = loads(decompress(b64decode(value)))
    return value

class PickledObjectField(models.Field):
    """
    A field that will accept *any* python object and store it in the
    database. PickledObjectField will optionally compress it's values if
    declared with the keyword argument ``compress=True``.
    
    Does not actually encode and compress ``None`` objects (although you
    can still do lookups using None). This way, it is still possible to
    use the ``isnull`` lookup type correctly. Because of this, the field
    defaults to ``null=True``, as otherwise it wouldn't be able to store
    None values since they aren't pickled and encoded.
    
    """
    __metaclass__ = models.SubfieldBase
    
    def __init__(self, *args, **kwargs):
        self.compress = kwargs.pop('compress', False)
        self.protocol = kwargs.pop('protocol', 2)
        kwargs.setdefault('null', True)
        kwargs.setdefault('editable', False)
        super(PickledObjectField, self).__init__(*args, **kwargs)
    
    def get_default(self):
        """
        Returns the default value for this field.
        
        The default implementation on models.Field calls force_unicode
        on the default, which means you can't set arbitrary Python
        objects as the default. To fix this, we just return the value
        without calling force_unicode on it. Note that if you set a
        callable as a default, the field will still call it. It will
        *not* try to pickle and encode it.
        
        """
        if self.has_default():
            if callable(self.default):
                return self.default()
            return self.default
        # If the field doesn't have a default, then we punt to models.Field.
        return super(PickledObjectField, self).get_default()

    def to_python(self, value):
        """
        B64decode and unpickle the object, optionally decompressing it.
        
        If an error is raised in de-pickling and we're sure the value is
        a definite pickle, the error is allowed to propogate. If we
        aren't sure if the value is a pickle or not, then we catch the
        error and return the original value instead.
        
        """
        if value is not None:
            try:
                value = dbsafe_decode(value, self.compress)
            except:
                # If the value is a definite pickle; and an error is raised in
                # de-pickling it should be allowed to propogate.
                if isinstance(value, PickledObject):
                    raise
        return value

    def get_db_prep_value(self, value):
        """
        Pickle and b64encode the object, optionally compressing it.
        
        The pickling protocol is specified explicitly (by default 2),
        rather than as -1 or HIGHEST_PROTOCOL, because we don't want the
        protocol to change over time. If it did, ``exact`` and ``in``
        lookups would likely fail, since pickle would now be generating
        a different string. 
        
        """
        if value is not None and not isinstance(value, PickledObject):
            # We call force_unicode here explicitly, so that the encoded string
            # isn't rejected by the postgresql_psycopg2 backend. Alternatively,
            # we could have just registered PickledObject with the psycopg
            # marshaller (telling it to store it like it would a string), but
            # since both of these methods result in the same value being stored,
            # doing things this way is much easier.
            value = force_unicode(dbsafe_encode(value, self.compress))
        return value

    def value_to_string(self, obj):
        value = self._get_val_from_obj(obj)
        return self.get_db_prep_value(value)

    def get_internal_type(self): 
        return 'TextField'
    
    def get_db_prep_lookup(self, lookup_type, value):
        if lookup_type not in ['exact', 'in', 'isnull']:
            raise TypeError('Lookup type %s is not supported.' % lookup_type)
        # The Field model already calls get_db_prep_value before doing the
        # actual lookup, so all we need to do is limit the lookup types.
        return super(PickledObjectField, self).get_db_prep_lookup(lookup_type, value)

# --------------------------------------- tests.py  --------------------------------------- #

"""Unit testing for this module."""

from django.test import TestCase
from django.db import models
from fields import PickledObjectField

class TestingModel(models.Model):
    pickle_field = PickledObjectField()
    compressed_pickle_field = PickledObjectField(compress=True)
    default_pickle_field = PickledObjectField(default=({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]))

class TestCustomDataType(str):
    pass

class PickledObjectFieldTests(TestCase):
    def setUp(self):
        self.testing_data = (
            {1:2, 2:4, 3:6, 4:8, 5:10},
            'Hello World',
            (1, 2, 3, 4, 5),
            [1, 2, 3, 4, 5],
            TestCustomDataType('Hello World'),
        )
        return super(PickledObjectFieldTests, self).setUp()
    
    def testDataIntegriry(self):
        """
        Tests that data remains the same when saved to and fetched from
        the database, whether compression is enabled or not.
        
        """
        for value in self.testing_data:
            model_test = TestingModel(pickle_field=value, compressed_pickle_field=value)
            model_test.save()
            model_test = TestingModel.objects.get(id__exact=model_test.id)
            # Make sure that both the compressed and uncompressed fields return
            # the same data, even thought it's stored differently in the DB.
            self.assertEquals(value, model_test.pickle_field)
            self.assertEquals(value, model_test.compressed_pickle_field)
            model_test.delete()
        
        # Make sure the default value for default_pickled_field gets stored
        # correctly and that it isn't converted to a string.
        model_test = TestingModel()
        model_test.save()
        model_test = TestingModel.objects.get(id__exact=model_test.id)
        self.assertEquals(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]), model_test.default_pickle_field)


    def testLookups(self):
        """
        Tests that lookups can be performed on data once stored in the
        database, whether compression is enabled or not.
        
        One problem with cPickle is that it will sometimes output
        different streams for the same object, depending on how they are
        referenced. It should be noted though, that this does not happen
        for every object, but usually only with more complex ones.
                
        >>> from pickle import dumps
        >>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, \
        ... 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
        >>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, \
        ... 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]))
        "((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5\ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."
        >>> dumps(t)
        "((dp0\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np1\n(I1\nI2\nI3\nI4\nI5\ntp2\n(lp3\nI1\naI2\naI3\naI4\naI5\natp4\n."
        >>> # Both dumps() are the same using pickle.

        >>> from cPickle import dumps
        >>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
        >>> dumps(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]))
        "((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\ntp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
        >>> dumps(t)
        "((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2\nI1\naI2\naI3\naI4\naI5\natp3\n."
        >>> # But with cPickle the two dumps() are not the same!
        >>> # Both will generate the same object when loads() is called though.

        We can solve this by calling deepcopy() on the value before
        pickling it, as this copies everything to a brand new data
        structure.
        
        >>> from cPickle import dumps
        >>> from copy import deepcopy
        >>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
        >>> dumps(deepcopy(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])))
        "((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\ntp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
        >>> dumps(deepcopy(t))
        "((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\ntp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
        >>> # Using deepcopy() beforehand means that now both dumps() are idential.
        >>> # It may not be necessary, but deepcopy() ensures that lookups will always work.
        
        Unfortunately calling copy() alone doesn't seem to fix the
        problem as it lies primarily with complex data types.
        
        >>> from cPickle import dumps
        >>> from copy import copy
        >>> t = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
        >>> dumps(copy(({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])))
        "((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\np2\n(I1\nI2\nI3\nI4\nI5\ntp3\n(lp4\nI1\naI2\naI3\naI4\naI5\nat."
        >>> dumps(copy(t))
        "((dp1\nI1\nI1\nsI2\nI4\nsI3\nI6\nsI4\nI8\nsI5\nI10\nsS'Hello World'\n(I1\nI2\nI3\nI4\nI5\nt(lp2\nI1\naI2\naI3\naI4\naI5\natp3\n."

        """
        for value in self.testing_data:
            model_test = TestingModel(pickle_field=value, compressed_pickle_field=value)
            model_test.save()
            # Make sure that we can do an ``exact`` lookup by both the
            # pickle_field and the compressed_pickle_field.
            model_test = TestingModel.objects.get(pickle_field__exact=value, compressed_pickle_field__exact=value)
            self.assertEquals(value, model_test.pickle_field)
            self.assertEquals(value, model_test.compressed_pickle_field)
            # Make sure that ``in`` lookups also work correctly.
            model_test = TestingModel.objects.get(pickle_field__in=[value], compressed_pickle_field__in=[value])
            self.assertEquals(value, model_test.pickle_field)
            self.assertEquals(value, model_test.compressed_pickle_field)
            # Make sure that ``is_null`` lookups are working.
            self.assertEquals(1, TestingModel.objects.filter(pickle_field__isnull=False).count())
            self.assertEquals(0, TestingModel.objects.filter(pickle_field__isnull=True).count())
            model_test.delete()
        
        # Make sure that lookups of the same value work, even when referenced
        # differently. See the above docstring for more info on the issue.
        value = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5])
        model_test = TestingModel(pickle_field=value, compressed_pickle_field=value)
        model_test.save()
        # Test lookup using an assigned variable.
        model_test = TestingModel.objects.get(pickle_field__exact=value)
        self.assertEquals(value, model_test.pickle_field)
        # Test lookup using direct input of a matching value.
        model_test = TestingModel.objects.get(
            pickle_field__exact = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]),
            compressed_pickle_field__exact = ({1: 1, 2: 4, 3: 6, 4: 8, 5: 10}, 'Hello World', (1, 2, 3, 4, 5), [1, 2, 3, 4, 5]),
        )
        self.assertEquals(value, model_test.pickle_field)
        model_test.delete()

Comments

jamesgpearce (on August 30, 2009):

I have a baffling problem with this. I am trying to save an (unconnected) model instance into the field.

It works when the instance is in a field in a new record: it gets nicely pickled in the INSERT and I can see it in the database.

But it does not work on update and NULL gets written to the database (regardless of my default). Programmatically, the field contains the instance prior to (and post) the save - but somewhere between there and the UPDATE SQL it goes missing.

(Everything works perfectly if the pickled instance is not of a subclass of models.Model. Even models.Manager can be pickled!)

I am no field-extension expert and I am having trouble tracking this down. In the meantime any thoughts?

jamesgpearce (on August 30, 2009):

To (mostly) answer my own question.

My issue is way down at the bottom of the code, just before the SQL execution of an UPDATE:

if hasattr(val, 'prepare_database_save'): val = val.prepare_database_save(field) else: val = field.get_db_prep_save(val)

(It doesn't do this for INSERTs for reasons I don't quite understand... but that's why the inserts DO work)

Of course all models implement prepare_database_save (in order to get the ID for a foreign key relationship), and so the value turns into that key at the last minute (instead of going through your pickling code in get_db_prep_save).

And because my model is 'abstract' - in the sense that it hasn't gone into the database in the traditional way - it has no ID. Hence 'NULL' for the PickledObjectField value after an update.

Hard to find... not too hard to fix. (These 'picklable' models just need to derive from a super class that overrides that method to do get_db_prep_save instead).

Thought I'd go to the effort of writing it up, since I've seen at least one other person trying to do something similar (for an undo stack of model state

Otherwise, a wonderful snippet.

taavi223 (on September 2, 2009):

James,

That's a nice find! I mainly use the field for storing dictionary data that is arbitrary and that I don't need to query against, so I probably never would have found that error.

I spent a little bit of time trying to find a true solution, but was unable to come up with one. An easy workaround however, is to wrap the model object inside of a list or tuple. Since the list/tuple would not have the prepare_database_save method, it will call the field's get_db_prep_value as usual. Not fully transparent, but it does prevent the problem from occurring.

Another possibility is to write a proxy class for the model you wish to store, like so:

from django.db import models
from fields.py import dbsafe_encode, PickledObjectField

class MyClass(models.Model):
    pickled_field = PickledObjectField()

class MyProxyClass(MyClass):

    def prepare_database_save(self, field):
        return dbsafe_encode(self, field.compress)

    class Meta:
        proxy=True

You can then use the proxy class when assigning a model to the PickledObjectField and it should work as expect (although I haven't tested this out explicitly). This probably won't work well if you're trying to store an arbitrary model though, since you'd need a proxy class for each and every model.

Let me know if you find any other problems; I'll do my best to help solve them.

In other news I've fixed a few bugs with the snippet. Despite my best efforts, a change I thought I made somehow wasn't included (although the docstrings mentioned it--so where did it go!?). To fix this, I've once again added the deepcopy function into dbsafe_encode, so now lookups should work in all cases.

Second, I fixed the snippet's value_to_string method, so that serialization should now actually work as expected. Before, serializing a model with a PickledObjectField would return not the encoded object as expected, but the encoded __repr__ of the object. Can't believe I missed that.

Finally, I've changed the field to now be editable=False by default. I had changed it to this earlier, but somehow (like with deepcopy) it managed to disappear. Having the object editable in the admin is a bad idea, since any stored object will be converted to a string for display and then upon save, the string will be written to the database instead of the original object.

ivankirigin (on September 13, 2009):

I just got this error: ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '})' at line 1")

I'm working with MySQL on OS X, and maybe a too old version of django. I got the same unicode errors others got with the original snippet, and I'm trying to solve it.

Any ideas on what the hell is going on?

Thanks

taavi223 (on September 13, 2009):

ivankirigin,

Can you post the actual traceback you're getting and what data you're trying to save into the PickledObjectField? Without some more details I don't know what the problem could be. Also, are you getting both the ProgrammingError and the DjangoUnicodeError? More details would really help with troubleshooting this...

erussell (on September 30, 2009):

I made a form field to edit PicledObjectFields as JSON in the admin. This doesn't work if you're storing objects in your pickled field that can't be JSON-encoded. But for simple objects like dictionaries, it works very well. Add the following to the PickledObjectField class:

def formfield(self, **kwargs):
    defaults = {'form_class': JSONField}
    defaults.update(kwargs)
    return super(PickledObjectField, self).formfield(**defaults)

Then add this code to fields.py:

from django import forms
from django.forms import widgets
from django.forms.util import flatatt, ValidationError
from django.utils import simplejson
from django.utils.safestring import mark_safe
from django.utils.html import conditional_escape

class JSONWidget (widgets.Widget):

    def __init__(self, attrs=None):
        self.attrs = {'cols': '84', 'rows': '5'}
        if attrs:
            self.attrs.update(attrs)

    def render (self, name, value, attrs=None, choices=()):
        if not isinstance(value, unicode):
            value = simplejson.dumps(value)
        final_attrs = self.build_attrs(attrs, name=name)
        return mark_safe(
                u'<textarea%s>%s</textarea>' % 
                    ( flatatt(final_attrs), conditional_escape(force_unicode(value)) ) 
            )

    def value_from_datadict(self, data, files, name):
        return data.get(name, u'{ }')

class JSONField (forms.Field):

    widget = JSONWidget

    default_error_messages = {
        'invalid': u'Enter a valid JSON string.'
    }

    def __init__(self, max_value=None, min_value=None, *args, **kwargs):
        super(JSONField, self).__init__(*args, **kwargs)

    def clean (self, value):
        super(JSONField, self).clean(value)
        if value is None or value == '':
            return { }
        try:
            value = simplejson.loads(value) 
        except ValueError:
            raise ValidationError(self.error_messages['invalid'])
        return value

jkafader (on November 26, 2009):

This may be obvious (to you) but the information may save somebody some time down the road: if you use erussel's code for turning on editing via JSON above, you must also delete the line

 kwargs.setdefault('editable', False)

in the original class from above, otherwise you'll get an error in the admin that is difficult to debug.

sachmonkey (on December 24, 2009):

I found an issue that I was hoping you could take a look at.

I have a model that has a PickledObjectField, which works fine. But then I run a QuerySet operation in which I defer() the PickledObjectField, then make a few edits to the model, and then perform a model.save().

But now when I attempt to read the PickledObjectField value in all future QuerySet operations, I get returned the raw base64 pickled string instead of a python object! It seems like the data has been corrupted somehow. But sometimes I can manually call dbsafe_decode to get back the python object, but I shouldn't have to do that. But even using dbsafe_decode only works sometimes.

When I looked into the SQL queries being executed on the model.save(), it runs a SELECT to get the value of the pickledobjectfield from the db first so that it has the full model which is then saved. It appears that the full pickled object field string is appropriately saved, but for some reason it still isn't working.

Any help would be great!

Please login first before commenting.

Improved Pickled Object Field

More like this

Comments