Saturday, July 20, 2019

Problem with lxml.html.clean.Cleaner

If you are a Gmail user, have you ever wondered how Google did to the raw HTML of infinite types of email messages so that they can all displayed properly in your browser and mobile phone?

In case you don’t know but email messages you receive from various senders include the multiple kinds of HTML tags including style, script etc.   On a web mail website such as Gmail’s,, HTML of an email is displayed within HTML of parent webmail page, so email message script and style  HTML tags will interfere with the web mail parent html, therefore Gmail needs to process the raw HTML in the email message, so does Buytition Web Mail.

So the requirements are given a HTML page strip out interactive tags such as script and style but to leave the remaining parts.  To satisfy these requirements, the best Python package in open source space is lxml.html.clean.Cleaner  which does exactly what the requirements ask for.   So we tried this  it as Solution 1 below but a few months later we found a problem

Solution 1: lxml.html.clean.Cleaner
Result: DID NOT WORK
This solution works in most cases but in few cases make damaging errors.   One of these few cases are that when an entire table is wrapped by an a tag, in this case cleaner will wrongfully consider the a tag as unclosed tag and wrongfully modify the cleaner html by immediately appending a closing a tag after it, thus making the Originally linked table unlinked in the clean HTML

Solution 2: BeautifulSoup

Result: DID NOT WORK
This solution does not work because it does not retain HTML tags in the output

Finally we chose a third a solution which used none of the open source packages. The lesson learned from this fire drill is that even though many open source packages are available but their quality are usually unknown to developers and whether they fit your use case will need ta lot of testing and effort from you to find out.

If you run into similar problems or would like to learn more details of this story, please contact buytition@gmail.com

No comments:

Post a Comment