Parsing HTML with Python.

2008 Dec 8 at 01:08 » Tagged as :python,

Needed to make a small change to my photoblog - a change that should be reflected on every single post. Thought I might be able to use python DOM to get it done. With PHP it's quite easy, I have done that sort of thing many times in the past. Can't that be handled by the template? no because this was a CSS change that really did involve changing the post's HTML. It was just a matter of removing an inline style (CSS) and replacing it with a CSS class. But each post only has a fragment so python minidom simply refused parse it. That forced me to look at HTMLParser which I didn't like because it's an old fashion event driven parser. There was a time when use to swear by event driven parsers (expat for example) but that was long ago. Thus I had no option but to revert to the minidom API and to make my HTML fragment well formed by adding a new start and end tag to enclose the whole fragment. These enclosing tags can be stripped out later when I am saving the modified post back to the database. With this approach the code is short and sweet. But of course I need to flesh it out by adding support for retrieving posts from the database and writing them back in, instead of working with the hardcoded bit of HTML as done during testing.

#!/usr/bin/python

from xml.dom.minidom import *

domNode = parseString('<xml><p align="center"><a href="/images/comingup.jpg"><img src="/images/comingup-t.jpg" title="dawn" alt="Sunrise close to Nuwara Eliya" style="border-color: #505050; border-width: 7px" /></a></p>Great Western, Nuwara Eliya at Dawn.</xml>')

ele = domNode.getElementsByTagName('img') ele.item(0).removeAttribute('style') ele.item(0).setAttribute("class","photo");

subNode = domNode.firstChild;

if subNode.hasChildNodes(): children = subNode.childNodes for child in children: print child.toxml()