Login

Web crawler/bot detection and blocking middleware

Author:
haloween
Posted:
January 13, 2010
Language:
Python
Version:
1.1
Score:
1 (after 1 ratings)

Sets request.is_crawler

Allow bot lockout from certain urls in urlconf ,add view parameter 'deny_crawlers'

ex.

url(r'^foo/$', 'views.foo',{'deny_crawlers' : True},name='foo')

view param is removed after middleware pass.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from django.http import HttpResponseForbidden
BotNames=['Googlebot','Slurp','Twiceler','msnbot','KaloogaBot','YodaoBot','"Baiduspider','googlebot','Speedy Spider','DotBot']
param_name='deny_crawlers'

class CrawlerBlocker:
    def process_request(self, request):
        user_agent=request.META.get('HTTP_USER_AGENT',None)
        
        if not user_agent:
            return HttpResponseForbidden('request without username are not supported. sorry')
        request.is_crawler=False
        
        for botname in BotNames:
            if botname in user_agent:
                request.is_crawler=True
                
    
    def process_view(self, request, view_func, view_args, view_kwargs):
        if param_name in view_kwargs:
            if view_kwargs[param_name]:
                del view_kwargs[param_name]
                if request.is_crawler:
                    return HttpResponseForbidden('adress removed from crawling. check robots.txt')

More like this

  1. Template tag - list punctuation for a list of items by shapiromatron 11 months, 2 weeks ago
  2. JSONRequestMiddleware adds a .json() method to your HttpRequests by cdcarter 11 months, 3 weeks ago
  3. Serializer factory with Django Rest Framework by julio 1 year, 6 months ago
  4. Image compression before saving the new model / work with JPG, PNG by Schleidens 1 year, 7 months ago
  5. Help text hyperlinks by sa2812 1 year, 8 months ago

Comments

myainab (on February 25, 2010):

using user-agent to block bots will only block/stop rocky/noob spammers. user-agent can easily be changed whatever the bot writer wants. I think, much better solution is to white list IP address block of google, msn and save bots and block all other bots.

#

Please login first before commenting.