Login

Web crawler/bot detection and blocking middleware

Author:
haloween
Posted:
January 13, 2010
Language:
Python
Version:
1.1
Tags:
googlebot slurp google cuil crawler bot yahoo middleware robot
Score:
1 (after 1 ratings)

Sets request.is_crawler

Allow bot lockout from certain urls in urlconf ,add view parameter 'deny_crawlers'

ex.

url(r'^foo/$', 'views.foo',{'deny_crawlers' : True},name='foo')

view param is removed after middleware pass.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from django.http import HttpResponseForbidden
BotNames=['Googlebot','Slurp','Twiceler','msnbot','KaloogaBot','YodaoBot','"Baiduspider','googlebot','Speedy Spider','DotBot']
param_name='deny_crawlers'

class CrawlerBlocker:
    def process_request(self, request):
        user_agent=request.META.get('HTTP_USER_AGENT',None)
        
        if not user_agent:
            return HttpResponseForbidden('request without username are not supported. sorry')
        request.is_crawler=False
        
        for botname in BotNames:
            if botname in user_agent:
                request.is_crawler=True
                
    
    def process_view(self, request, view_func, view_args, view_kwargs):
        if param_name in view_kwargs:
            if view_kwargs[param_name]:
                del view_kwargs[param_name]
                if request.is_crawler:
                    return HttpResponseForbidden('adress removed from crawling. check robots.txt')

More like this

  1. Flatpage Suggester Template tag for 404 templates by bradmontgomery 5 years, 1 month ago
  2. middleware for user_passes_test by gsf0 5 years, 5 months ago
  3. direct to template from a subdir by Scanner 6 years, 5 months ago
  4. Simple views dispatcher by http methods by kmerenkov 5 years, 4 months ago
  5. Middleware to resolve current URL to module and view by kuchin 4 years, 8 months ago

Comments

myainab (on February 25, 2010):

using user-agent to block bots will only block/stop rocky/noob spammers. user-agent can easily be changed whatever the bot writer wants. I think, much better solution is to white list IP address block of google, msn and save bots and block all other bots.

#

Please login first before commenting.