from subprocess import Popen, PIPE
from django import template
register = template.Library()
@register.filter
def html2text(value):
"""
Pipes given HTML string into the text browser W3M, which renders it.
Rendered text is grabbed from STDOUT and returned.
"""
try:
cmd = "w3m -dump -T text/html -O ascii"
proc = Popen(cmd, shell = True, stdin = PIPE, stdout = PIPE)
return proc.communicate(str(value))[0]
except OSError:
# something bad happened, so just return the input
return value
if __name__ == "__main__":
from urllib import urlopen
print html2text(urlopen("http://www.w3.org/TR/REC-html40/").read())
Comments
cmd = "lynx -force_html -stdin -dump"works for me.#
What's wrong with
#
Also, html2text
#
Right you are, but I don't see any reason to use Lynx. Both W3M and ELinks render HTML much better than Lynx.
The etree.tostring() method is just a serialization method. It is not an HTML rendering engine, which is what W3M and ELinks provide.
That script would be appropriate for very simple HTML, but ~400 lines of Python can't replace a complete rendering engine.
#
Great idea. I'm trying this and am getting:
After a little debugging, I haven't found a solution. Any ideas?
#