Defindit Docs and Howto Home

This page last modified: Jun 09 2010
keywords:sqlite,sql,data,proprietary,science,scientific,biology,bioinformatics,molecular,genomics,proteomics,
description:Why SQLite is the best format for data files and warehousing.
title:SQLite as data file format

Please feel free to use this information to argue in favor of rational
data formats. Please contact me with questions.

http://laudeman.com/tom_mail/wmail.pl



Table of contents
-----------------
Introduction
Proprietary is bad
SQL is good
SQLite as format, warehouse, exchange, and tool
Pros of SQLite
Cons of SQLite



Introduction
------------

In my (roughly) ten years as a bioinformatics software engineer I've
seen a fair amount of data in proprietary formats, open formats, and
in SQL databases. Proprietary data formats have wasted time and money
that should have been spend curing disease and feeding the
hungry. Some of the proprietary data I worked with early in my career
is already unreadable on current computers.

Not including Illumina Solexa data sets, the largest sets of data I've
worked with are genomic sequences and BLAST results generated as part
Mike Timko's research in plants. I've also had the privilege of
working on tandem mass spec proteomics data with Brian Balgley.

I have drawn two critical conclusions: 

1) Proprietary data formats are inappropriate for scientific data and
should be universally rejected by scientists, journals, and funding
agencies.

2) SQLite is the ideal format for storing, transporting, and accessing
data. It is open source, portable across all platforms, and performs
well.

Below I'll lay out the background, as well as some recommendations for SQLite in
science and especially bioinformatics.



Proprietary is bad
------------------

The concept that data can only be accessed by certain licensed
software on certain operating systems is anathema to the wide
sharing of science and discovery. Platform "lock-in" is grossly
unethical. 

It also turns out that proprietary formats are also universally
unnecessary. The information technology world does not need another
new format. XML has happily overcome some of this since XML is fairly
easy to both create and manipulate. However. XML is a poor database
solution, and is too bloated for large data sets.



SQL is good
-----------

Over the past few decades we have seen an increase in portable,
standard technologies allowing all programming languages easy access
to SQL databases. Examples are Perl DBI/DBD, ODBC, and
JDBC. Relational databases are a well defined way to store, access,
and update data. In both theory and practice, relational databases are
both robust and scalable. It is well understood how to make SQL
perform well with very large amounts of data. However, SQL has never
been a "file format" because the software architecture is
client-server where the database server is a large, complex, and
somewhat difficult-to-administer tool. The "database" is typically an
entire directory tree of operating system specific binary files.

Database dumps are trivial to create with the usual SQL engines, and
the dumps are fairly portable. The dumps can even be converted from
one database to another. However, database dumps are inconvenient, and
could not be considered a "file format" in the normal sense.



SQLite as format, warehouse, exchange, and tool
-----------------------------------------------

The bulkiness and administrative overhead of a traditional Relational
Database Management System (RDBMS) was overcome many years ago with
the advent of SQLite. SQLite is a fully transactional RDBMS which is a
single, tiny, executable with a single file database. Performance with
small numbers of transactions per second is the same as PostgreSQL,
MySQL or Oracle. I have dropped millions of records into a SQLite
database running on commodity hardware, with the same speed as
PostgreSQL. In some cases, SQLite was faster.

A SQLite database is a single file. SQLite is capable of opening
several databases (each in it's own file), but typically, all
necessary tables are in one database and that one database is in one
file.

Amazingly, that file is portable across Linux, Windows, and OSX. The
database is essentially a platform independent data file. The SQLite
engine is required to read the data, but the engine is very small, and
freely available. SQLite is only 45K (yes, 45,000 bytes). As I
understand the history, SQLite was designed as an embedded database,
and as such had to be very small.

There are SQLite drivers for nearly every programming language. The
drivers are free, which means that application developers can use
SQLite as their "file format" for free, with no licensing. The result
is a "file" which is trivial to parse, standards compliant, and as
nearly future-proof as possible. SQLite has the ability to make an
ASCII text database dump, as an added future-proofing and
portability feature. SQLite will gladly store binary data.

The SQLite executable is also the command line tool. It is only
slightly less sophisticated than PostgreSQL's "psql" command line (which
I argue is the gold standard in RDBMS command lines). The SQLite
engine/command line means that anyone in possession of a SQLite file
has ready access to the data. No other tools are required. (Granted,
that getting real work done is likely to require programming languages
and other packages.) Each SQLite database is self describing, and can
even include comments from the creator.

When sharing data for download, we have the problem of what format to
choose. Ideally, we would make the data available in several
formats. XML, tab separated text, and various open and closed formats
are often used. SQLite is an ideal format for download since the
database is a single file, it is portable to all computers, and once
downloaded the data is trivial to access. No other format is so
readily used as SQLite. Tab separated columns are trivial to parse.
However, if there are any relations in the data, extra programming
code is necessary, and the problem can quickly become
unmanageable. XML is readily portable, but the tools are tricky, the
XML file is not a database, and XML tends to be somewhat
bloated. (Which is easily solved by compression, but that's another
step.)

Common text formats like "fasta" (used for gene and protein sequences)
can be easy to work with, but they have serious limitations. The fasta
header can only be one line, and therefore is very limited in terms of
descriptive fields. Anything beyond a bare minimum description
requires some sort of implicit header format and there are no
standards for that format. I store sequence data in SQL. The
description and other meta data is well formed. A single, very simple
SQL query serves to export the sequence data as fasta. Over the years,
I've downloaded many fasta files. Every one of them had a slightly
different format. While not intractable, this means time is wasted
parsing and fixing formats. Sequence data is a perfect example of data
which should be in SQLite. The database could contain the sequences,
meta data for each sequence where appropriate, and meta data about the
entire collection of sequences. All in a beautiful, normalized, all in
a form that is easy to export to fasta, or process with various tools.




Pros of SQLite
--------------

- open source, standards compliant

- single file, portable across operating systems

- tiny, single executable RDBMS with true transactions

- standard, full featured SQL

- easy import and export

- zero configuration, serverless, self-contained (single executable
  binary, single file database)

- drivers for nearly all programming languages and for many packages
  (I haven't used it, but an example is the R interface to SQLite.)

- available for an extremely wide variety of operating systems and
  hardware platforms from the largest servers to tiny, embedded
  systems.

- ability to store any scientific data

- schema is inherent to the database, so the "file" is largely
  self-documenting

- very good performance

- can handle very large data sets

- SQL is standard and well understood

- relational data modeling is arguably the most robust data model 

- extensive API for embedded use

- existing software is easily modified to use SQLite as an optional
  file format



Cons of SQLite
--------------

- single executable is somewhat unsuited to distributed systems; the
  SQLite engine has no ability to accept network connections,
  therefore networked situations require additional middle ware

- lacks a suite of enterprise tools, although standard SQL tools and
  administrative practices apply. 

- multi-user access is handled by record and table locking, which
  limits the number simultaneous users

- some of the more sophisticated SQL features are not available

- no administrative features; relies on the operating system for everything