HTML to text filter

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from subprocess import Popen, PIPE

from django import template

register = template.Library()

@register.filter
def html2text(value):
    """
    Pipes given HTML string into the text browser W3M, which renders it.
    Rendered text is grabbed from STDOUT and returned.
    """
    try:
        cmd = "w3m -dump -T text/html -O ascii"
        proc = Popen(cmd, shell = True, stdin = PIPE, stdout = PIPE)
        return proc.communicate(str(value))[0]
    except OSError:
        # something bad happened, so just return the input
        return value


if __name__ == "__main__":
    from urllib import urlopen
    print html2text(urlopen("http://www.w3.org/TR/REC-html40/").read())

More like this

  1. Auto HTML Linebreak filter by punteney 5 years, 1 month ago
  2. WordWrap template tag by Daeg 6 years, 1 month ago
  3. No Password E-mail by jefferya 4 years, 3 months ago
  4. email_links by sansmojo 5 years, 11 months ago
  5. Using Templates to Send E-Mails by rpoulton 6 years, 2 months ago

Comments

whiteinge (on August 10, 2008):

cmd = "lynx -force_html -stdin -dump" works for me.

#

ludo (on August 12, 2008):

What's wrong with

from lxml import etree

def convert(text):
    return etree.tostring(
        etree.HTML(text),
        encoding='utf8', method='text'
    )

#

baumer1122 (on August 12, 2008):

Also, html2text

#

MasonM (on August 13, 2008):

cmd = "lynx -force_html -stdin -dump" works for me.

Right you are, but I don't see any reason to use Lynx. Both W3M and ELinks render HTML much better than Lynx.

What's wrong with ...

The etree.tostring() method is just a serialization method. It is not an HTML rendering engine, which is what W3M and ELinks provide.

Also, html2text

That script would be appropriate for very simple HTML, but ~400 lines of Python can't replace a complete rendering engine.

#

davenaff (on September 9, 2008):

Great idea. I'm trying this and am getting:

...
File "C:\Python25\lib\subprocess.py", line 885, in _communicate
  self.stdin.write(input)
IOError: [Errno 22] Invalid argument

After a little debugging, I haven't found a solution. Any ideas?

#

(Forgotten your password?)