HTML to text filter

Author:: MasonM
Posted:: August 10, 2008
Language:: Python
Version:: .96
Score:: 0 (after 0 ratings)

Download
Raw

This filter converts HTML to nicely-formatted text using the text-browser W3M. I use this for constructing e-mail bodies, since it means I don't have to have two templates, one HTML and one plain-text, for each detailed e-mail I want to send. Besides the obvious maintenance benefits, this is nice because Django's templating system isn't well-suited to plain-text where whitespace and line-breaks are significant.

I chose W3M because it renders tables nicely and can take in HTML from STDIN (which Lynx can't do). An alternative is ELinks; to use it, change "cmd" to the following: elinks -force-html -stdin -dump -no-home

from subprocess import Popen, PIPE

from django import template

register = template.Library()

@register.filter
def html2text(value):
    """
    Pipes given HTML string into the text browser W3M, which renders it.
    Rendered text is grabbed from STDOUT and returned.
    """
    try:
        cmd = "w3m -dump -T text/html -O ascii"
        proc = Popen(cmd, shell = True, stdin = PIPE, stdout = PIPE)
        return proc.communicate(str(value))[0]
    except OSError:
        # something bad happened, so just return the input
        return value


if __name__ == "__main__":
    from urllib import urlopen
    print html2text(urlopen("http://www.w3.org/TR/REC-html40/").read())

Comments

whiteinge (on August 10, 2008):

cmd = "lynx -force_html -stdin -dump" works for me.

ludo (on August 12, 2008):

What's wrong with

from lxml import etree

def convert(text):
    return etree.tostring(
        etree.HTML(text),
        encoding='utf8', method='text'
    )

MasonM (on August 13, 2008):

cmd = "lynx -force_html -stdin -dump" works for me.

Right you are, but I don't see any reason to use Lynx. Both W3M and ELinks render HTML much better than Lynx.

What's wrong with ...

The etree.tostring() method is just a serialization method. It is not an HTML rendering engine, which is what W3M and ELinks provide.

Also, html2text

That script would be appropriate for very simple HTML, but ~400 lines of Python can't replace a complete rendering engine.

davenaff (on September 9, 2008):

Great idea. I'm trying this and am getting:

...
File "C:\Python25\lib\subprocess.py", line 885, in _communicate
  self.stdin.write(input)
IOError: [Errno 22] Invalid argument

After a little debugging, I haven't found a solution. Any ideas?

kevinguru (on December 25, 2013):

ProWeb365 is a results & relationship-driven company. We only bill you for results that we successfully deliver. Our Minneapolis web design firm provides the most reliable and professional web services, and strives to go above and beyond your expectations. We can help place your business website on the most important real estate on the Internet: Google’s page-one. Contact our Minneapolis internet marketing company at 612-590-8080 and let us prepare your website for Online marketing success.

Please login first before commenting.

HTML to text filter

More like this

Comments