Web crawler/bot detection and blocking middleware

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from django.http import HttpResponseForbidden
BotNames=['Googlebot','Slurp','Twiceler','msnbot','KaloogaBot','YodaoBot','"Baiduspider','googlebot','Speedy Spider','DotBot']
param_name='deny_crawlers'

class CrawlerBlocker:
    def process_request(self, request):
        user_agent=request.META.get('HTTP_USER_AGENT',None)
        
        if not user_agent:
            return HttpResponseForbidden('request without username are not supported. sorry')
        request.is_crawler=False
        
        for botname in BotNames:
            if botname in user_agent:
                request.is_crawler=True
                
    
    def process_view(self, request, view_func, view_args, view_kwargs):
        if param_name in view_kwargs:
            if view_kwargs[param_name]:
                del view_kwargs[param_name]
                if request.is_crawler:
                    return HttpResponseForbidden('adress removed from crawling. check robots.txt')

More like this

  1. Middleware to detect visitors who arrived from a search engine by exogen 6 years, 1 month ago
  2. XhtmlDegraderMiddleware by dmh 5 years, 9 months ago
  3. django subdomain support for both resolve and reverse. by puppy 2 years, 11 months ago
  4. Repeat Tag by daniellindsley 4 years ago
  5. Resolve URLs to view name by UloPe 4 years, 2 months ago

Comments

Romain Hardouin (on January 13, 2010):

Kinda cool. One can complete bot list here : http://www.robotstxt.org/db.html

#

myainab (on February 25, 2010):

using user-agent to block bots will only block/stop rocky/noob spammers. user-agent can easily be changed whatever the bot writer wants. I think, much better solution is to white list IP address block of google, msn and save bots and block all other bots.

#

(Forgotten your password?)