Cleanup dirty HTML from a WYSIWYG editor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from lxml import html, etree
import re

register = Library()

css_cleanup_regex = re.compile('((font|padding|margin)(-[^:]+)?|line-height):\s*[^;]+;')
def _cleanup_elements(elem):
    """
    Removes empty elements from HTML (i.e. those without text inside).
    If the tag has a 'style' attribute, we remove the css attributes we don't want.
    """
    if elem.text_content().strip() == '':
        elem.drop_tree()
    else:
        if elem.attrib.has_key('style'):
            elem.attrib['style'] = css_cleanup_regex.sub('', elem.attrib['style'])
        for sub in elem:
            _cleanup_elements(sub)

@register.simple_tag
def cleanup_html(string):
    """
    Makes generated HTML (i.e. ouput from the WYSISYG) look almost decent.
    """
    try:
        elem = html.fromstring(string)
        _cleanup_elements(elem)
        html_string = html.tostring(elem)
        lines = []
        for line in html_string.splitlines():
            line = line.rstrip()
            if line != '': lines.append(line)
        return '\n'.join(lines)
    except etree.XMLSyntaxError:
        return string

More like this

  1. Django filter stack to cleanup WYSIWYG output by jbergantine 2 years, 7 months ago
  2. TinyMCE Widget by semente 4 years, 7 months ago
  3. Replace Paragraph Tags for Flash by blackbrrr 5 years, 8 months ago
  4. Make tags easier with properties by ubernostrum 7 years, 1 month ago
  5. Template tag to sort a list of links by pytechd 6 years, 8 months ago

Comments

andybak (on May 31, 2009):

Why not just use TinyMCE's 'valid_elements' option to control which tags it allows?

#

(Forgotten your password?)