Sanitize text field HTML (here from the Dojo Toolkit Editor2 widget)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from django import newforms as forms
from BeautifulSoup import BeautifulSoup, Comment

class Editor2Field(forms.CharField):

    widget=forms.widgets.Textarea(attrs={'dojoType': 'Editor2'})

    valid_tags = 'p i strong b u a h1 h2 h3 pre br img'.split()
    valid_attrs = 'href src'.split()

    def clean(self, value):
        """
        Cleans non-allowed HTML from the input.
        """
        value = super(Editor2Field, self).clean(value)
        soup = BeautifulSoup(value)
        for comment in soup.findAll(
            text=lambda text: isinstance(text, Comment)):
            comment.extract()
        for tag in soup.findAll(True):
            if tag.name not in self.valid_tags:
                tag.hidden = True
            tag.attrs = [(attr, val) for attr, val in tag.attrs
                         if attr in self.valid_attrs]
        return soup.renderContents().decode('utf8')


class TestForm(forms.Form):
    title = forms.CharField()
    content = Editor2Field()

More like this

  1. Sanitize HTML filter with tag/attribute whitelist and XSS protection by harrym 4 years, 8 months ago
  2. urlize HTML by maguspk 3 years, 10 months ago
  3. TinyMCE Widget by semente 4 years, 7 months ago
  4. Sanitize HTML filter by henriklied 6 years, 11 months ago
  5. Revisiting Pygments and Markdown by djypsy 6 years, 8 months ago

Comments

guettli (on November 16, 2007):

Nice snippet!

#

marcink (on February 10, 2008):

This is nice, but you should also look into href attributes to make sure they don't contain javascript code.

#

akaihola (on April 21, 2008):

marcink: Thanks for the heads up. It's obviously a fatal mistake to have left out that check.

#

(Forgotten your password?)