This page last modified: Jun 09 2010
keywords:sqlite,sql,data,proprietary,science,scientific,biology,bioinformatics,molecular,genomics,proteomics, description:Why SQLite is the best format for data files and warehousing. title:SQLite as data file format Please feel free to use this information to argue in favor of rational data formats. Please contact me with questions. http://laudeman.com/tom_mail/wmail.pl Table of contents ----------------- Introduction Proprietary is bad SQL is good SQLite as format, warehouse, exchange, and tool Pros of SQLite Cons of SQLite Introduction ------------ In my (roughly) ten years as a bioinformatics software engineer I've seen a fair amount of data in proprietary formats, open formats, and in SQL databases. Proprietary data formats have wasted time and money that should have been spend curing disease and feeding the hungry. Some of the proprietary data I worked with early in my career is already unreadable on current computers. Not including Illumina Solexa data sets, the largest sets of data I've worked with are genomic sequences and BLAST results generated as part Mike Timko's research in plants. I've also had the privilege of working on tandem mass spec proteomics data with Brian Balgley. I have drawn two critical conclusions: 1) Proprietary data formats are inappropriate for scientific data and should be universally rejected by scientists, journals, and funding agencies. 2) SQLite is the ideal format for storing, transporting, and accessing data. It is open source, portable across all platforms, and performs well. Below I'll lay out the background, as well as some recommendations for SQLite in science and especially bioinformatics. Proprietary is bad ------------------ The concept that data can only be accessed by certain licensed software on certain operating systems is anathema to the wide sharing of science and discovery. Platform "lock-in" is grossly unethical. It also turns out that proprietary formats are also universally unnecessary. The information technology world does not need another new format. XML has happily overcome some of this since XML is fairly easy to both create and manipulate. However. XML is a poor database solution, and is too bloated for large data sets. SQL is good ----------- Over the past few decades we have seen an increase in portable, standard technologies allowing all programming languages easy access to SQL databases. Examples are Perl DBI/DBD, ODBC, and JDBC. Relational databases are a well defined way to store, access, and update data. In both theory and practice, relational databases are both robust and scalable. It is well understood how to make SQL perform well with very large amounts of data. However, SQL has never been a "file format" because the software architecture is client-server where the database server is a large, complex, and somewhat difficult-to-administer tool. The "database" is typically an entire directory tree of operating system specific binary files. Database dumps are trivial to create with the usual SQL engines, and the dumps are fairly portable. The dumps can even be converted from one database to another. However, database dumps are inconvenient, and could not be considered a "file format" in the normal sense. SQLite as format, warehouse, exchange, and tool ----------------------------------------------- The bulkiness and administrative overhead of a traditional Relational Database Management System (RDBMS) was overcome many years ago with the advent of SQLite. SQLite is a fully transactional RDBMS which is a single, tiny, executable with a single file database. Performance with small numbers of transactions per second is the same as PostgreSQL, MySQL or Oracle. I have dropped millions of records into a SQLite database running on commodity hardware, with the same speed as PostgreSQL. In some cases, SQLite was faster. A SQLite database is a single file. SQLite is capable of opening several databases (each in it's own file), but typically, all necessary tables are in one database and that one database is in one file. Amazingly, that file is portable across Linux, Windows, and OSX. The database is essentially a platform independent data file. The SQLite engine is required to read the data, but the engine is very small, and freely available. SQLite is only 45K (yes, 45,000 bytes). As I understand the history, SQLite was designed as an embedded database, and as such had to be very small. There are SQLite drivers for nearly every programming language. The drivers are free, which means that application developers can use SQLite as their "file format" for free, with no licensing. The result is a "file" which is trivial to parse, standards compliant, and as nearly future-proof as possible. SQLite has the ability to make an ASCII text database dump, as an added future-proofing and portability feature. SQLite will gladly store binary data. The SQLite executable is also the command line tool. It is only slightly less sophisticated than PostgreSQL's "psql" command line (which I argue is the gold standard in RDBMS command lines). The SQLite engine/command line means that anyone in possession of a SQLite file has ready access to the data. No other tools are required. (Granted, that getting real work done is likely to require programming languages and other packages.) Each SQLite database is self describing, and can even include comments from the creator. When sharing data for download, we have the problem of what format to choose. Ideally, we would make the data available in several formats. XML, tab separated text, and various open and closed formats are often used. SQLite is an ideal format for download since the database is a single file, it is portable to all computers, and once downloaded the data is trivial to access. No other format is so readily used as SQLite. Tab separated columns are trivial to parse. However, if there are any relations in the data, extra programming code is necessary, and the problem can quickly become unmanageable. XML is readily portable, but the tools are tricky, the XML file is not a database, and XML tends to be somewhat bloated. (Which is easily solved by compression, but that's another step.) Common text formats like "fasta" (used for gene and protein sequences) can be easy to work with, but they have serious limitations. The fasta header can only be one line, and therefore is very limited in terms of descriptive fields. Anything beyond a bare minimum description requires some sort of implicit header format and there are no standards for that format. I store sequence data in SQL. The description and other meta data is well formed. A single, very simple SQL query serves to export the sequence data as fasta. Over the years, I've downloaded many fasta files. Every one of them had a slightly different format. While not intractable, this means time is wasted parsing and fixing formats. Sequence data is a perfect example of data which should be in SQLite. The database could contain the sequences, meta data for each sequence where appropriate, and meta data about the entire collection of sequences. All in a beautiful, normalized, all in a form that is easy to export to fasta, or process with various tools. Pros of SQLite -------------- - open source, standards compliant - single file, portable across operating systems - tiny, single executable RDBMS with true transactions - standard, full featured SQL - easy import and export - zero configuration, serverless, self-contained (single executable binary, single file database) - drivers for nearly all programming languages and for many packages (I haven't used it, but an example is the R interface to SQLite.) - available for an extremely wide variety of operating systems and hardware platforms from the largest servers to tiny, embedded systems. - ability to store any scientific data - schema is inherent to the database, so the "file" is largely self-documenting - very good performance - can handle very large data sets - SQL is standard and well understood - relational data modeling is arguably the most robust data model - extensive API for embedded use - existing software is easily modified to use SQLite as an optional file format Cons of SQLite -------------- - single executable is somewhat unsuited to distributed systems; the SQLite engine has no ability to accept network connections, therefore networked situations require additional middle ware - lacks a suite of enterprise tools, although standard SQL tools and administrative practices apply. - multi-user access is handled by record and table locking, which limits the number simultaneous users - some of the more sophisticated SQL features are not available - no administrative features; relies on the operating system for everything