I recently completed some work on the GarinDriver to research a more flexible model for database schema change management. This work gave me a chance to dig a little deeper into the nuts and bolts of Java database interactions. Working to track schema changes and support a wide range of databases increased my appreciating for powerful open source database platforms available today. In particular, working with H2 Database gave me a chance to explore what I now think is one of the hidden treasures of the open source DB marketplace. Overall it was a fun bit of hobby coding and I believe it will be useful in future real world projects.
Working on the driver, one challenge that went beyond my expectation was parsing the various dialects of SQL. In a review of current open source projects, I did not find any public license parsers that can parse statements from ALL database systems. In a sense, we are supposed to have a standard but it is really well followed by the vendors and as such writing a universal parse is a bit difficult. I had some free time this weekend so I took on the challenge since it is an area where the open source community really does not currently have a good solution. For example the tools that ship with Eclipse can not parse the $tag$ style strings from PostgreSQL or the q’[O'Brian]‘ style supported by Oracle. In any case, I had some free time and interest so I decided to build my own parser. Without such a parser it would be impossible to provide full support for the statement execution model needed to support the GarinDriver desgin. Workarounds were possible but a proper parser felt like the best approach.
Building a parser is complex and often involves specialized tools (i.e. ANTLR, BISON, JAVACC, FLEX, LEX, YACC, etc). There are benefits and drawbacks to the traditional grammar definition and parser generator approach. My initial review it seemed that supporting the often contradictory approaches used by different systems would be quite difficult with these tools.
Since all that I needed for the GarinDriver/LiquiBase project was “shallow parsing” my approach was to use a small framework of Java classes. The use of my own framework allows me to share logic across the SQL dialects and provided a chance to explore idea for a more dynamic approach to “shallow parsing”. So far the approach seems to be working out nicely although I am considering building a deep parser at some point later on with the Eclipse XTEXT project but for now the flexiiblity of a home grown hand coded parser has proved to be a viable approach for “shallow parsing”.
Parser Fun:
- Standard SQL comments have EOL style (–) and the block style (/* */).
- Block style comments nest within each other so parser must count the nesting levels.
- MySQL supports pound sign comments in addition to the standard forms.
- HSQLDB supports // style comments in addition to the standard forms.
- Standard SQL uses double quotes for identifiers and single quotes for text strings.
- The parser must support both including quote symbol doubling for escapes (i.e. O”Brian or My “”big”" table.
- Oracle supports q’[O'Brian]‘ in addition to the standard style.
- MySQL uses C style string encoding (i.e. ‘O\’Brian’ ).
- PostgreSQL supports $tag$O’Brian$tag$ style in addition to the standard forms AND the mysql format.
My parser uses object oriented techniques to define a base token concept, extend the base concept for SqlStatements, and then extend the SqlStatements to define the dialect variants in what I hope will be an extendable framework. This approach provides more flexibility for future growth vs. more static models and parser generator tools. In any case the approach seems to work and the ability to use object orientation to extend the parser in new directions seems like a good thing.
As with other hobby projects, my work in this area is EPL and hosted on Google code. The documentation is sparse but if you are looking for SQL parser code you may find that this code can help you develop new and interesting tools for working with SQL systems.
http://code.google.com/p/garinparser/
If you decide to use the parser please let me know. Also, if you can define a legal SQL statement that does not parse correctly with my parser let me know since I like to eat tasty bugs.