Login

Cleanup dirty HTML from a WYSIWYG editor

Author:
denis
Posted:
May 29, 2009
Language:
Python
Version:
1.0
Score:
1 (after 1 ratings)

My admin allows editing of some html fields using TinyMCE, so I end up with horrible code that contains lots of nested <p>, <div>, <span> tags, and style properties which destroy my layout and consistence.

This tag based on lxml tries to kill as much unneeded tags as possible, and style properties. These properties can be customized by adapting the regex to your needs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from lxml import html, etree
import re

register = Library()

css_cleanup_regex = re.compile('((font|padding|margin)(-[^:]+)?|line-height):\s*[^;]+;')
def _cleanup_elements(elem):
    """
    Removes empty elements from HTML (i.e. those without text inside).
    If the tag has a 'style' attribute, we remove the css attributes we don't want.
    """
    if elem.text_content().strip() == '':
        elem.drop_tree()
    else:
        if elem.attrib.has_key('style'):
            elem.attrib['style'] = css_cleanup_regex.sub('', elem.attrib['style'])
        for sub in elem:
            _cleanup_elements(sub)

@register.simple_tag
def cleanup_html(string):
    """
    Makes generated HTML (i.e. ouput from the WYSISYG) look almost decent.
    """
    try:
        elem = html.fromstring(string)
        _cleanup_elements(elem)
        html_string = html.tostring(elem)
        lines = []
        for line in html_string.splitlines():
            line = line.rstrip()
            if line != '': lines.append(line)
        return '\n'.join(lines)
    except etree.XMLSyntaxError:
        return string

More like this

  1. Stuff by NixonDash 1 month ago
  2. Add custom fields to the built-in Group model by jmoppel 3 months ago
  3. Month / Year SelectDateWidget based on django SelectDateWidget by pierreben 6 months, 2 weeks ago
  4. Python Django CRUD Example Tutorial by tuts_station 7 months ago
  5. Browser-native date input field by kytta 8 months, 2 weeks ago

Comments

andybak (on May 31, 2009):

Why not just use TinyMCE's 'valid_elements' option to control which tags it allows?

#

Please login first before commenting.