Login

HTML Validation Middleware

Author:
adamcik
Posted:
February 6, 2009
Language:
Python
Version:
1.0
Score:
6 (after 6 ratings)

Development middleware to ensure that responses validate as HTML.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
# encoding: utf-8
#
# Copyright (c) 2009 Thomas Kongevold Adamcik
#
# Snippet is released under the MIT License. So feel free to use it in other
# projects as long as the notice remains intact :)
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.

# See http://www.djangosnippets.org/snippets/1312/

'''
HTML Validation Middleware
==========================

Simple development middleware to ensure that responses validate as HTML.

Dependencies:
-------------

 - tidy (http://utidylib.berlios.de/)

Installation:
-------------

Assuming this file has been place in your PYTHON_PATH (e.g.
djangovalidation/middleware.py), simply add the following
to your middleware settings:

  'djangovalidation.middleware.HTMLValidationMiddleware',

Remember that the order of your middleware settings does matter, this
middleware should be placed before eg. GzipMiddleware, djangologging and
any other middlewares that modify the response's content.

Operation:
----------

Validation only kicks in under to following conditions:
 - DEBUG == True
 - HTML_VALIDATION_ENABLE == True (default)
 - REMOTE_ADDR in INTERNAL_IPS
 - 'html' in Content-Type
 - 'disable-validation' not in GET
 - request.is_ajax() == False
 - type(response) == HttpResponse
 - request.path doesn't match HTML_VALIDATION_URL_IGNORE

To bypass the check any uri can be appended with ?disable-validation

Settings:
---------

 - HTML_VALIDATION_ENABLE     - Turns middleware on/off. Default: True

 - HTML_VALIDATION_ENCODING   - Default: 'utf-8'

 - HTML_VALIDATION_DOCTYPE    - Default: 'strict'

 - HTML_VALIDATION_IGNORE     - Default: ['trimming empty <option>',
                                          '<table> lacks "summary" attribute']

 - HTML_VALIDATION_URL_IGNORE - List of regular expressions to check
                                request.path against when deciding if we should
                                process the request. Default: [],

 - HTML_VALIDATION_XHTML      - Default: True

 - HTML_VALIDATION_OPTIONS    - Options that get passed to tidy, overrides
                                previous settings. Default: based on above
                                settings

For more information about settings use the source and consult tidy's
documentation.

History
-------

December 19, 2009:
 - Fix empty HTML_VALIDATION_URL_IGNORE. Thanks .iqqmuT

July 12, 2009:
 - Ignore ajax request
 - Add HTML_VALIDATION_URL_IGNORE settings

February 6, 2009:
 - Initial relase
'''

import re
import tidy

from django.conf import settings
from django.core.exceptions import MiddlewareNotUsed
from django.http import HttpResponse, HttpResponseServerError
from django.template import Context, Template

class HTMLValidationMiddleware(object):
    '''
        Checks that the response is valid HTML with proper Unicode. In the
        event of a failed check we show an simple page listing the HTML source
        and which errors need to be fixed.
    '''

    # Validation errors to ignore. Can be overridden with VALIDATION_IGNORE setting
    ignore = [
        'trimming empty <option>',
        '<table> lacks "summary" attribute',
    ]

    # Options for tidy. Can be overridden with HTML_VALIDATION_OPTIONS setting
    options = {
        'doctype': getattr(settings, 'HTML_VALIDATION_DOCTYPE', 'strict'),
        'output_xhtml': getattr(settings, 'HTML_VALIDATION_XHTML', True),
        'input_encoding': getattr(settings, 'HTML_VALIDATION_ENCODING', 'utf8'),
    }

    def __init__(self):
        if not settings.DEBUG or not getattr(settings, 'HTML_VALIDATION_ENABLE', True):
            raise MiddlewareNotUsed

        self.options = getattr(settings, 'HTML_VALIDATION_OPTIONS', self.options)
        self.ignore = set(getattr(settings, 'HTML_VALIDATION_IGNORE', self.ignore))
        self.ignore_regexp = self._build_ignore_regexp(getattr(settings, 'HTML_VALIDATION_URL_IGNORE', []))
        self.template = Template(self.HTML_VALIDATION_TEMPLATE.strip())

    def process_response(self, request, response):
        if not self._should_validate(request, response):
            return response

        errors = self._validate(response)

        if not errors:
            return response

        context = self._get_context(response, errors)

        return HttpResponseServerError(self.template.render(context))

    def _build_ignore_regexp(self, urls):
        if not urls:
            return None

        urls = [r'(%s)' % url for url in urls]
        return re.compile(r'(%s)' % r'|'.join(urls))

    def _should_validate(self, request, response):
        return ('html' in response['Content-Type'] and
                'disable-validation' not in request.GET and
                not request.is_ajax() and
                (not self.ignore_regexp or 
                 not self.ignore_regexp.search(request.path)) and
                request.META['REMOTE_ADDR'] in settings.INTERNAL_IPS and
                type(response) == HttpResponse)

    def _validate(self, response):
        errors = tidy.parseString(response.content, **self.options).errors
        return self._filter_errors(errors)

    def _filter_errors(self, errors):
        return filter(lambda e: e.message not in self.ignore, errors)

    def _get_context(self, response, errors):
        lines = []
        error_dict = dict(map(lambda e: (e.line, e.message), errors))

        for i, line in enumerate(response.content.split('\n')):
            lines.append((line, error_dict.get(i + 1, False)))

        return Context({'errors': errors,
                        'lines': lines,})

    HTML_VALIDATION_TEMPLATE = """
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <title>HTML validation error at {{ request.path_info|escape }}</title>
  <meta name="robots" content="NONE,NOARCHIVE">
  <style type="text/css">
    html * { padding: 0; margin: 0; }
    body * { padding: 10px 20px; }
    body * * { padding: 0; }
    body { font: small sans-serif; background: #eee; }
    body>div { border-bottom: 1px solid #ddd; }
    h1 { font-weight: normal; margin-bottom: 0.4em; }
    table { border: none; border-collapse: collapse; width: 100%; }
    td, th { vertical-align: top; padding: 2px 3px; }
    th { width: 6em; text-align: right; color: #666; padding-right: 0.5em; }
    #info { background: #f6f6f6; }
    #info th { width: 3em; }
    #summary { background: #ffc; }
    #explanation { background: #eee; border-bottom: 0px none; }
    .meta { margin: 1em 0; }
    .error { background: #FEE }
  </style>
</head>
<body>
  <div id="summary">
    <h1>HTML validation error</h1>
    <p>
        Your HTML did not validate. If this page contains user content that
        might be the problem. Please fix the following:
    </p>
    <table class="meta">
      {% for error in errors %}
        <tr>
          <th>Line: <a href="#line{{ error.line }}">{{ error.line }}</a></th>
          <td>{{ error.message|escape }}</td>
        </tr>
      {% endfor %}
    </table>
    <p>
      If you want to bypass this warning, click <a href="?disable-validation">
      here</a>. Please note that this warning will persist until you fix the
      problems mentioned above.
    </p>
  </div>
  <div id="info">
    <table>
      {% for line,error in lines %}
        <tr{% if error %} class="error"{% endif %}>
          <th id="line{{ forloop.counter }}">
            {{ forloop.counter|stringformat:"03d" }}
          </th>
          <td{% if error %} title="{{ error }}"{% endif %}>
            <pre>{{ line }}</pre>
          </td>
        </tr>
      {% endfor %}
    </table>
  </div>

  <div id="explanation">
    <p>
      You're seeing this error because you have not set
      <code>HTML_VALIDATION_ENABLE = False</code> in your Django settings file.
      Change that to <code>False</code>, and Django will stop validating your
      HTML.
    </p>
  </div>
</body>
</html>"""

More like this

  1. Template tag - list punctuation for a list of items by shapiromatron 1 year ago
  2. JSONRequestMiddleware adds a .json() method to your HttpRequests by cdcarter 1 year ago
  3. Serializer factory with Django Rest Framework by julio 1 year, 7 months ago
  4. Image compression before saving the new model / work with JPG, PNG by Schleidens 1 year, 8 months ago
  5. Help text hyperlinks by sa2812 1 year, 9 months ago

Comments

whiteinge (on February 6, 2009):

Really excellent and well thought out. Thanks!

This could benefit from an additional setting to ignore some URL patterns (since the Admin doesn't validate :) I added the following to the _should_validate() method, but this can probably be improved to be less convoluted:

and not [True for i in settings.HTML_VALIDATION_IGNORE_URLS if request.path.startswith(i)]

#

adamcik (on February 8, 2009):

Good to see that there are alternatives out there :)

Personally I prefer a KISS approach to these types of utilities. Not having to think about keeping state and adding development stuff to my URLconf is a big plus.

This of course comes at the cost of having to be in the developers face all the time, potentially disturbing their work flow (then again forcing the habit of writing valid HTML is a good thing).

It all comes down to personal preference which you prefer, given that both approaches have their merit :)

#

.iqqmuT (on August 21, 2009):

Thanks for a really nice snippet!

There is one minor bug, though: if HTML_VALIDATION_URL_IGNORE is empty, as it is by default, then validation will never happen. Here is a little fix suggestion for that:

$ diff --strip-trailing-cr /tmp/1312.py htmlvalidation.py
150a151,152
>         if len(urls) == 0:
>               return None
158c160
<                 not self.ignore_regexp.search(request.path) and
---
>               (not self.ignore_regexp or not
> self.ignore_regexp.search(request.path)) and

#

adamcik (on May 25, 2010):

Finally got around to updating the actual snippet with the provided patch :-)

#

gasull1 (on November 25, 2011):

It doesn't work with HTML5 right now. It isn't the fault of this snippet, but the fault of libtidy. Once libtidy is updated this snippet will useful again.

#

zimnyx (on December 9, 2011):

There is little bug in this snippet: template expects "request" variable which is not passed in Context object.

#

Please login first before commenting.