• Python and Tidy.

    The other day, I was having a spot of bother parsing HTML fragments using the python DOM. Though I did overcome that problem I immediately ran into other problems due to the odd missing tag and the odd special character. Rather than get into a mess of regular expressions, I thought to try the HTML Tidy API for python. This is not my first run in with Tidy. I have used it with Java and PHP for a long time.

    before you can make use of it you have to install it and that can be done by the following commands.

    yum install tidy

    yum install python-tidy

    Then it’s time to visit the documentation page, which isn’t really very useful. Here is how you filter a document through tidy.

    options = dict(output_xhtml=1, add_xml_decl=1, indent=1, tidy_mark=0)
    tidyDoc = tidy.parse(basedir + file, **options)

    Tidy’s output can be used as input to create a DOM Document.

    domDoc = parseString(tidyDoc.__str__())

    One thing I found rather annoying about tidy is that it doesn’t recognize the <dt> tag.

    Thursday, December 11th, 2008 at 06:02
No comments yet.

Leave a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>
TOP