Python and Tidy.

2008 Dec 11 at 00:32 » Tagged as :python,

The other day, I was having a spot of bother parsing HTML fragments using the python DOM. Though I did overcome that problem I immediately ran into other problems due to the odd missing tag and the odd special character. Rather than get into a mess of regular expressions, I thought to try the HTML Tidy API for python. This is not my first run in with Tidy. I have used it with Java and PHP for a long time. before you can make use of it you have to install it and that can be done by the following commands.

yum install tidy

yum install python-tidy

Then it's time to visit the documentation page, which isn't really very useful. Here is how you filter a document through tidy.

options = dict(output_xhtml=1, add_xml_decl=1, indent=1, tidy_mark=0) tidyDoc = tidy.parse(basedir + file, **options) Tidy's output can be used as input to create a DOM Document. domDoc = parseString(tidyDoc.__str__()) One thing I found rather annoying about tidy is that it doesn't recognize the <dt> tag.