If you are a Gmail user, have you ever wondered how Google did to the raw HTML of infinite types of email messages so that they can all displayed properly in your browser and mobile phone?
In case you don’t know but email messages you receive from various senders include the multiple kinds of HTML tags including style, script etc. On a web mail website such as Gmail’s,, HTML of an email is displayed within HTML of parent webmail page, so email message script and style HTML tags will interfere with the web mail parent html, therefore Gmail needs to process the raw HTML in the email message, so does Buytition Web Mail.
So the requirements are given a HTML page strip out interactive tags such as script and style but to leave the remaining parts. To satisfy these requirements, the best Python package in open source space is lxml.html.clean.Cleaner which does exactly what the requirements ask for. So we tried this it as Solution 1 below but a few months later we found a problem
Solution 1: lxml.html.clean.Cleaner
Result: DID NOT WORK
This solution works in most cases but in few cases make damaging errors. One of these few cases are that when an entire table is wrapped by an a tag, in this case cleaner will wrongfully consider the a tag as unclosed tag and wrongfully modify the cleaner html by immediately appending a closing a tag after it, thus making the Originally linked table unlinked in the clean HTML
Solution 2: BeautifulSoup
Result: DID NOT WORK
This solution does not work because it does not retain HTML tags in the output
Finally we chose a third a solution which used none of the open source packages. The lesson learned from this fire drill is that even though many open source packages are available but their quality are usually unknown to developers and whether they fit your use case will need ta lot of testing and effort from you to find out.
If you run into similar problems or would like to learn more details of this story, please contact buytition@gmail.com
Saturday, July 20, 2019
Sunday, July 14, 2019
Our Story with Flask, Google OAuth2 Library and Gunicorn
Earlier this year, we announced that buytition.com was integrated with Gmail, in order to achieve this integration, we were required to roll out an API endpoint on our side to receive call from Google after user completes authorization flow steps in popup window.
Challenge: Flask Native Web Server vs Gunicorn
The API framework we chose to use was Flask because of its simplicity. Then I had to make a choice of using one of 2 web servers to serve the Flask API:
Although it's been widely heard that choice 2 is better than choice 1 such as this article, as choice 1 is for development purpose and choice 2 is a more of a real web server. I still went with choice 1 first because when it comes to unknown technical choices, one of my guiding principles is: never believe hearsay, go with most simple and straightforward solution first, unless there is verifiable evidence to support alternative choice is better.
Problem with Flask Native Web Server
A few months after using Flask's native web server, I do observe a huge disadvantage: the API process may become dead after error of
Fire: Problem with Gunicorn
Naturally, now the hearsay I heard a few months ago is proven to have its validity, with this understanding, I felt comfortable to switch to choice 2 of gunicorn. However, just as things go, nothing is perfect, everything has its pros and cons, a few weeks after, a nasty error blocking users from linking their Gmail accounts started to surface, and this is at last step of Google OAuth2 flow. Google OAuth2 Server-side Process which includes 5 steps of complex interactions among 3 parties: Google, user and Application Web Server. The error happens at last step: Exchange authorization code for refresh and access tokens, The error is InsecureTransportError: (insecure_transport) OAuth 2 MUST utilize https. and rises at fetch_token of the following code
What's strange about this error is it happened specifically to gunicorn web server, the same piece of code worked fine under Flask native web server mode. Now I felt having gone a round trip and back to the initial challenge: Flask Native Web Server vs Gunicorn, Both options have pros and cons, now Gunicorn has a hard stopper, should I go back to Flask web server approach?
Fire Drill
Out of fear of IOError, I decided to stick with Gunicorn approach and tackle the problem of InsecureTransportError.
The first challenge I faced was to get visibility into redirect URL that was passed to fetch_token function, Getting to know content of this URL string is key first step to diagnose this problem since the error indicates this URL is http rather than https protocol. However, strangely enough, for unknown reason, logging calls such as print in this function does not print out the string content like in any other Flask API function calls, in addition, this error cannot be replicated locally as well. So I used an unconventional solution by logging the debug information into a DB table at end of API function call and it worked.
After getting visibility into URL string passed to fetch_token function, I tested running Google OAuth2 process using Flask web server and Gunicorn and compared values of that URL of the two, To my surprise, values are same for both options, something like this: http://0.0.0.0:5000/oauth2callback?state=..., again for some unknown reason, Gunicorn option run into InsecureTransportError which Flask web server does not experience. I don't want to explore why Flask web server can tolerate this but Gunicorn cannot, I just went ahead and do the following
After doing the above replacement, I tested both options and both of them work fine.
If you run into similar problems or would like to learn more details of this story, please contact buytition@gmail.com
Challenge: Flask Native Web Server vs Gunicorn
The API framework we chose to use was Flask because of its simplicity. Then I had to make a choice of using one of 2 web servers to serve the Flask API:
- Flask native web server
- one of wsgi web servers such as gunicorn
Although it's been widely heard that choice 2 is better than choice 1 such as this article, as choice 1 is for development purpose and choice 2 is a more of a real web server. I still went with choice 1 first because when it comes to unknown technical choices, one of my guiding principles is: never believe hearsay, go with most simple and straightforward solution first, unless there is verifiable evidence to support alternative choice is better.
Problem with Flask Native Web Server
A few months after using Flask's native web server, I do observe a huge disadvantage: the API process may become dead after error of
IOError: [Error 32] Broken pipe
,, usually the error comes up when multiple API requests are made at same time. And this error is quite nasty because: 1st, I don't get a notice when this error happens; 2nd, Flask server needs to be manually restarted which is time-consuming.Fire: Problem with Gunicorn
Naturally, now the hearsay I heard a few months ago is proven to have its validity, with this understanding, I felt comfortable to switch to choice 2 of gunicorn. However, just as things go, nothing is perfect, everything has its pros and cons, a few weeks after, a nasty error blocking users from linking their Gmail accounts started to surface, and this is at last step of Google OAuth2 flow. Google OAuth2 Server-side Process which includes 5 steps of complex interactions among 3 parties: Google, user and Application Web Server. The error happens at last step: Exchange authorization code for refresh and access tokens, The error is InsecureTransportError: (insecure_transport) OAuth 2 MUST utilize https. and rises at fetch_token of the following code
state = flask.session['state'] flow = google_auth_oauthlib.flow.Flow.from_client_secrets_file( 'client_secret.json', scopes=['https://www.googleapis.com/auth/youtube.force-ssl'], state=state) flow.redirect_uri = flask.url_for('oauth2callback', _external=True) authorization_response = flask.request.urlflow.fetch_token(authorization_response=authorization_response)
What's strange about this error is it happened specifically to gunicorn web server, the same piece of code worked fine under Flask native web server mode. Now I felt having gone a round trip and back to the initial challenge: Flask Native Web Server vs Gunicorn, Both options have pros and cons, now Gunicorn has a hard stopper, should I go back to Flask web server approach?
Fire Drill
Out of fear of IOError, I decided to stick with Gunicorn approach and tackle the problem of InsecureTransportError.
The first challenge I faced was to get visibility into redirect URL that was passed to fetch_token function, Getting to know content of this URL string is key first step to diagnose this problem since the error indicates this URL is http rather than https protocol. However, strangely enough, for unknown reason, logging calls such as print in this function does not print out the string content like in any other Flask API function calls, in addition, this error cannot be replicated locally as well. So I used an unconventional solution by logging the debug information into a DB table at end of API function call and it worked.
After getting visibility into URL string passed to fetch_token function, I tested running Google OAuth2 process using Flask web server and Gunicorn and compared values of that URL of the two, To my surprise, values are same for both options, something like this: http://0.0.0.0:5000/oauth2callback?state=..., again for some unknown reason, Gunicorn option run into InsecureTransportError which Flask web server does not experience. I don't want to explore why Flask web server can tolerate this but Gunicorn cannot, I just went ahead and do the following
authorization_response = authorization_response.replace(
"http://0.0.0.0:5000", "https://buytition.com")
After doing the above replacement, I tested both options and both of them work fine.
If you run into similar problems or would like to learn more details of this story, please contact buytition@gmail.com
Subscribe to:
Posts (Atom)