Cleanup dirty HTML from a WYSIWYG editor

Author:: denis
Posted:: May 29, 2009
Language:: Python
Version:: 1.0
Score:: 1 (after 1 ratings)

Download
Raw

My admin allows editing of some html fields using TinyMCE, so I end up with horrible code that contains lots of nested <p>, <div>, <span> tags, and style properties which destroy my layout and consistence.

This tag based on lxml tries to kill as much unneeded tags as possible, and style properties. These properties can be customized by adapting the regex to your needs.

from lxml import html, etree
import re

register = Library()

css_cleanup_regex = re.compile('((font|padding|margin)(-[^:]+)?|line-height):\s*[^;]+;')
def _cleanup_elements(elem):
    """
    Removes empty elements from HTML (i.e. those without text inside).
    If the tag has a 'style' attribute, we remove the css attributes we don't want.
    """
    if elem.text_content().strip() == '':
        elem.drop_tree()
    else:
        if elem.attrib.has_key('style'):
            elem.attrib['style'] = css_cleanup_regex.sub('', elem.attrib['style'])
        for sub in elem:
            _cleanup_elements(sub)

@register.simple_tag
def cleanup_html(string):
    """
    Makes generated HTML (i.e. ouput from the WYSISYG) look almost decent.
    """
    try:
        elem = html.fromstring(string)
        _cleanup_elements(elem)
        html_string = html.tostring(elem)
        lines = []
        for line in html_string.splitlines():
            line = line.rstrip()
            if line != '': lines.append(line)
        return '\n'.join(lines)
    except etree.XMLSyntaxError:
        return string

Comments

andybak (on May 31, 2009):

Why not just use TinyMCE's 'valid_elements' option to control which tags it allows?

Please login first before commenting.

Cleanup dirty HTML from a WYSIWYG editor

More like this

Comments