Login

Faster pagination / model object seeking (10x faster infact :o) for larger datasets (500k +)

Author:
sleepycal
Posted:
November 28, 2010
Language:
Python
Version:
1.2
Tags:
model dataset big object optimized large quicker pagination faster seeking
Score:
1 (after 1 ratings)
ModelPagination
Designed and Coded by Cal Leeming
Many thanks to Harry Roberts for giving us a heads up on how to do this properly!

----------------------------------------------------------------------------

This is a super optimized way of paginating datasets over 1 million records.
It uses MAX() rather then COUNT(), because this is super faster.

EXAMPLE:
>>> _t = time.time(); x = Post.objects.aggregate(Max('id')); "Took %ss"%(time.time() - _t )
'Took 0.00103402137756s'
>>> _t = time.time(); x = Post.objects.aggregate(Count('id')); "Took %ss"%(time.time() - _t )
'Took 0.92404794693s'
>>>

This does mean that if you go deleting things, then the IDs won't be accurate,
so if you delete 50 rows, you're exact count() isn't going to match, but this is
okay for pagination, because for SEO, we want items to stay on the original page
they were scanned on. If you go deleting items, then the items shift backwards
through the pages, so you end up with inconsistent SEO on archive pages. If this
doesn't make sense, go figure it out for yourself, its 2am in the morning ffs ;p

Now, the next thing we do, is use id seeking, rather then OFFSET, because again,
this is a shitton faster:

EXAMPLE:

>>> _t = time.time(); x = map(lambda x: x, Post.objects.filter(id__gte=400000, id__lt=400500).all()); print "Took %ss"%(time.time() - _t)
Took 0.0467309951782s
>>> _t = time.time(); _res = map(lambda x: x, Post.objects.all()[400000:400500]); print "Took %ss"%(time.time() - _t)
Took 1.05785298347s
>>>

By using this seeking method (which btw, can be implemented on anything, not just pagination)
on a table with 5 million rows, we are saving 0.92s on row count, and 1.01s on item grabbing.
This may not seem like much, but if you have 1024 concurrent users, this will make a huge
difference.

If you have any questions or problems, feel free to contact me on
cal.leeming [at] simplicitymedialtd.co.uk
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
"""
    ModelPagination
    Designed and Coded by Cal Leeming
    Many thanks to Harry Roberts for giving us a heads up on how to do this properly!

    You may also notice the class is almost exactly the same as the django pagination, give or take :)
    http://docs.djangoproject.com/en/dev/topics/pagination/?from=olddocs
    So this means, in most cases, you can use this as a drop in replacement.
    Although, if you are looking at using this, you would probably not just "drop it in" lol.

    ----------------------------------------------------------------------------

    This is a super optimized way of paginating datasets over 1 million records.
    It uses MAX() rather then COUNT(), because this is super faster.

    EXAMPLE:
    >>> _t = time.time(); x = Post.objects.aggregate(Max('id')); "Took %ss"%(time.time() - _t )
    'Took 0.00103402137756s'
    >>> _t = time.time(); x = Post.objects.aggregate(Count('id')); "Took %ss"%(time.time() - _t )
    'Took 0.92404794693s'
    >>>

    This does mean that if you go deleting things, then the IDs won't be accurate,
    so if you delete 50 rows, you're exact count() isn't going to match, but this is
    okay for pagination, because for SEO, we want items to stay on the original page
    they were scanned on. If you go deleting items, then the items shift backwards
    through the pages, so you end up with inconsistent SEO on archive pages. If this
    doesn't make sense, go figure it out for yourself, its 2am in the morning ffs ;p

    Now, the next thing we do, is use id seeking, rather then OFFSET, because again,
    this is a shitton faster:

    EXAMPLE:
    >>> _t = time.time(); x = map(lambda x: x, Post.objects.filter(id__gte=400000, id__lt=400500).all()); print "Took %ss"%(time.time() - _t)
    Took 0.0467309951782s
    >>> _t = time.time(); _res = map(lambda x: x, Post.objects.all()[400000:400500]); print "Took %ss"%(time.time() - _t)
    Took 1.05785298347s
    >>>

    By using this seeking method (which btw, can be implemented on anything, not just pagination)
    on a table with 5 million rows, we are saving 0.92s on row count, and 1.01s on item grabbing.
    This may not seem like much, but if you have 1024 concurrent users, this will make a huge
    difference.

    If you have any questions or problems, feel free to contact me on
    cal.leeming [at] simplicitymedialtd.co.uk

"""
from django.core.paginator import Paginator, InvalidPage, EmptyPage
from django.db.models import Max,Count,Q,F

class ModelPagination:
    model = None
    items_per_page = None
    count = None
    page_range = []

    def __init__(self, model, items_per_page):
        self.model = model
        self.items_per_page = items_per_page
        self.count = self.model.aggregate(Max('id'))['id__max']
        self.num_pages = divmod(self.count, self.items_per_page)[0]+1

        for i in range(self.num_pages):
            self.page_range.append(i+1)

    def page(self, page_number):
        if page_number > self.num_pages:
            raise EmptyPage, "That page contains no results"

        if page_number <= 0:
            raise EmptyPage, "That page number is less than 1"

        start = self.items_per_page * (page_number-1)
        end = self.items_per_page * page_number

        object_list = self.model.filter(id__gte=start, id__lt=end)
        return ModelPaginationPage(object_list, page_number, self.count, start, end, self)

class ModelPaginationPage:
    object_list = None
    number = None
    count = None
    start = None
    end = None
    paginator = None

    def __unicode__(self):
        return "<Page %s of %s>"%(self.number, self.count)

    def __init__(self, object_list, number, count, start, end, paginator):
        self.number = number
        self.count = count
        self.object_list = object_list
        self.start = start
        self.end = end
        self.paginator = paginator

    def has_next(self):
        return False if self.number >= self.count else True

    def has_previous(self):
        return False if self.number <= 1 else True

    def has_other_pages(self):
        return True if self.has_next or self.has_previous else False

    def next_number(self):
        return self.number + 1

    def previous_number(self):
        return self.number + 1

    def start_index(self):
        return self.start

    def end_index(self):
        return self.end

###############################################################################
# OUR EXAMPLE USAGE
###############################################################################
def archive(request, *args, **kwargs):
    _t = time.time()

    # 4chan
    if kwargs.get('feed') == '4chan':
        ret = Post.objects
        url = '/archive/4chan-page-'

    else:
        raise Exception, "Invalid feed specified"

    # calculate what page we are on
    page_num = int(args[0]) if args and args[0] else 1

    # create the pagination object
    _items_per_page = 1000
    pagination = ModelPagination(Post.objects, 1000)
    
    # extract the items from the page
    page = pagination.page(page_num)

    items = map(lambda x: {
        'id' : x.get('id'),
        'username' : x.get('username'),
        'title' : make_title(x.get('message'), x.get('image_filename'), x.get('username')),
        'url' : "/fcp/%s-%s.html"%(make_title(x.get('message'), x.get('image_filename'), x.get('username')), x.get('id')),
        'partial_message' : x.get('message')[:256] if x.get('message') else None,
        'created': x.get('created'),
        'image_url' : x.get('image_url')

    }, page.object_list.values('id', 'username', 'message', 'image_filename', 'created', 'image_url'))

    context = RequestContext(request, {
        'url' : url,
        'page_num' : page_num,
        'loading_time' : time.time() - _t,
        'page' : page,
        'items' : items,
        'pagination' : pagination
    })

    return render_to_response('lazylittlegirl/archive/results.html', context_instance=context)


"""
<!-- Here is some example usage in a template, again this is just a copy and paste out of one of our projects, and not intended as a unit test or w/e -->
    <div id="content">
        <ol>
            {% for item in items %}
                <li class="li1">
                    <div class="box1">
                        <a href="{{item.url}}" alt="{{item.title}}" title="{{item.title}}" target="_blank">Post #{{item.id}}</a> - {{item.created}} by {{item.username}} 
                    </div>
                </li>
            {% endfor %}
        </ol>

   <br />
   <hr />
   
    <div id="pagenumbers"><b>Pages :</b>
        {% for xpage in pagination.page_range %}
            {% if page.number == xpage %}
                [<b>{{xpage}}</b>]
            {% else %}
                <a title="Page {{xpage}} of {{pagination.num_pages}}" alt="Page {{xpage}} of {{pagination.num_pages}}" href="{{url}}{{xpage}}.html">{{xpage}}</a>
            {% endif %}
        {% endfor %}
    </div>
"""

More like this

Comments

gmandx (on November 29, 2010):

What if the PK of models are not numeric? Like UUIDs? This still works?

#

thurloat (on November 29, 2010):

Great snippet! Since you are focussing on performance, have you thought about using a list comp instead of map & lambda? generally maps are quicker, but when introducing lambdas, tend to fall behind.


you can accomplish the same thing with something like this

items = [{"id": x.get('id'),
   "username": x.get('username'),
   "title": make_title(x.get('message'), x.get('image_filename'), x.get('username')),
   "partial_message": x.get('message')[:256] if x.get('message') else None,
  } for x in page.object_list.values('id', 'username', 'message', 'image_filename')]

#

sleepycal (on December 1, 2010):

@qmandx: Sadly, because UUIDs are not numerically incremental, this code would definately not work. However, if you added a second column, as an unsigned int 11 primary key (called _id or id2 or something), then you could use this in place, and it'll work fine. If you delete data physically rather than flagging though, you can end up with pages having less items and others. Hope this makes sense.

@thurloat: Ah, I still haven't come to terms with the fact they are removing lambda, so haven't used the new recommend syntax ;( At some point though, I will definitely do some benchmarks between the two, in an attempt to convince myself to ditch lambda ;p Thank you for letting me know though!

#

sleepycal (on December 1, 2010):

I've added some example template to show how it would be used.. Similar to the docs :)

#

Please login first before commenting.