Utility function I am currently using to clean up taste pasted from Word into a Tiny MCE enabled Text field.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import re
def clean_word(txt, its):
for i in "font div span font img hr table td tr".split():
r=re.compile(r'</?%s[^>]*>' % i)
txt = r.sub('',txt)
for i in [
r'<!--.*?<![^>]*>',
r'<.--\[if [^>]*>.*?<.\[endif]-->',
r'<style>.*?</style>',
r'<(\w:[^>]*?)>.*</\1>',
r'class=".*?"',
r'<.--.*?-->',
r'<!--.*?-->',
#r'<p[^>]*> </p[^>]*>',
#r'<p[^>]*>\s*</p[^>]*>',
r"""align=["'][^"']*["']""",
r"""style=["'][^"']*["']""",
r'{mso-[^}]*}',
r'<[^>]*>(( )|\s*)</[^>]*>',
]:
r=re.compile(i, re.DOTALL)
txt = r.sub('',txt)
if its>0:
return clean_word(txt, its-1)
r = re.compile(r'(<br\s?/?>\s*){1,9999}')
txt = r.sub("</p><p>",txt)
return txt
|
More like this
- Template tag - list punctuation for a list of items by shapiromatron 11 months, 2 weeks ago
- JSONRequestMiddleware adds a .json() method to your HttpRequests by cdcarter 11 months, 3 weeks ago
- Serializer factory with Django Rest Framework by julio 1 year, 6 months ago
- Image compression before saving the new model / work with JPG, PNG by Schleidens 1 year, 7 months ago
- Help text hyperlinks by sa2812 1 year, 8 months ago
Comments
The regex's are getting compiled once each time the function is called. They don't change between calls to clean_word so why not move them to the the module level and compile them only once there?
#
Please login first before commenting.