Login

Cleanup dirty HTML from a WYSIWYG editor

Author:
denis
Posted:
May 29, 2009
Language:
Python
Version:
1.0
Tags:
html wysiwyg tinymce lxml dirty cleanup
Score:
1 (after 1 ratings)

My admin allows editing of some html fields using TinyMCE, so I end up with horrible code that contains lots of nested <p>, <div>, <span> tags, and style properties which destroy my layout and consistence.

This tag based on lxml tries to kill as much unneeded tags as possible, and style properties. These properties can be customized by adapting the regex to your needs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from lxml import html, etree
import re

register = Library()

css_cleanup_regex = re.compile('((font|padding|margin)(-[^:]+)?|line-height):\s*[^;]+;')
def _cleanup_elements(elem):
    """
    Removes empty elements from HTML (i.e. those without text inside).
    If the tag has a 'style' attribute, we remove the css attributes we don't want.
    """
    if elem.text_content().strip() == '':
        elem.drop_tree()
    else:
        if elem.attrib.has_key('style'):
            elem.attrib['style'] = css_cleanup_regex.sub('', elem.attrib['style'])
        for sub in elem:
            _cleanup_elements(sub)

@register.simple_tag
def cleanup_html(string):
    """
    Makes generated HTML (i.e. ouput from the WYSISYG) look almost decent.
    """
    try:
        elem = html.fromstring(string)
        _cleanup_elements(elem)
        html_string = html.tostring(elem)
        lines = []
        for line in html_string.splitlines():
            line = line.rstrip()
            if line != '': lines.append(line)
        return '\n'.join(lines)
    except etree.XMLSyntaxError:
        return string

More like this

  1. Django filter stack to cleanup WYSIWYG output by jbergantine 3 years, 11 months ago
  2. TinyMCE Widget by semente 5 years, 11 months ago
  3. Replace Paragraph Tags for Flash by blackbrrr 6 years, 12 months ago
  4. Make tags easier with properties by ubernostrum 8 years, 5 months ago
  5. Template tag to sort a list of links by pytechd 7 years, 11 months ago

Comments

andybak (on May 31, 2009):

Why not just use TinyMCE's 'valid_elements' option to control which tags it allows?

#

Please login first before commenting.