Google (News
– Alert) has long been a supporter of HTML5 and is also known to be a proponent of open source. With that in mind, it’s not too surprising that the company recently open sourced its HTML parsing library, Gumbo. Written in C, Gumbo adheres to the HTML5 parsing algorithm, allowing it to pass all html5lib-0.95 tests.
Google’s long-running support of the latest HTML revision goes back a long way, with one of the biggest steps being the decision to support only browsers capable of supporting HTML5 back in 2011. The goal with this was to help Web applications develop quickly to a point where they could compete with traditional software — a goal that has been more or less realized.
More recently, Google — along with Microsoft and Netflix — began efforts to get proper DRM (digital rights management) incorporated within HTML5. While DRM seems contrary to Google’s tendency toward openness, it is necessary for media companies to use the Web standard.
As for Gumbo, it has been tested on 2.5 billion pages indexed by Google at the time of its open sourcing, providing developers with a lightweight and dependable HTML parsing library with no outside dependencies that can be called from most languages. Some examples of where Gumbo could be useful include webpage validators, static analyzers, templating languages and refactoring tools, to name a few.
While Google has described Gumbo as being “robust and resilient to bad input,” the company still doesn’t recommend maintaining pointers to parts of its internal data structures, as the API is still being worked on and as such will likely change in the future. Despite its relatively early state, though, the API is considered stable and is only waiting on comments from users before being released as version 1.0, which is likely to happen soon.
Future features to look forward to in Gumbo include support for recent HTML5 spec changes to support the template tag (News – Alert), support for fragment parsing, full-featured error reporting, and bindings in other languages.
Edited by
Alisen Downey