Defindit Docs and Howto Home

This page last modified: May 09 2008
keywords:blastall,blast,ncbi,search,gene,nucleotide,peptide,protein,fasta,perl,sql,bioinformatics
description:The NCBI formatdb silently changes accession numbers if the numbers don't conform to a certain convention.
title:Fasta files converted to NCBI BLAST search libraries with formatdb.


Table of contents
-----------------
Introduction
Using non-NCBI database identifier syntax
A less desirable solution
An even worse example



Introduction
------------


Documentation for formatdb is in the blastall directory tree on your
server. You can also find it online. There are also extra notes online.

ftp://ftp.ncbi.nih.gov/blast/documents/formatdb.html

http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/formatdb_fastacmd.html


Using non-NCBI database identifier syntax
-----------------------------------------

Can formatdb not change the database name lcl? Yes, but you must use
one of the "database names" known to NCBI blast. You must also stricly
adhere to the format of the identifier syntax. Below are some examples
of what happens when you don't follow the rules, and examples of what
to do instead. The URLs above explain the fundamentals, but don't
mention that formatdb will modify your identifiers unless you follow
certain rules.

I'll call a blast database a "search library" since the word database
is most often taken to mean a "SQL database". Also, the word
"database" has too many diverse meanings. Even worse, the query
sequences are generally also in a "database" thus further confusing
"search" and "query". 

When you do a BLAST search you usually need to an NCBI formatted
search library. (Newer versions of blast can use a simple fasta file
as the search library.) We generally do large number of searches via a
pipeline, and because we want the searches to be performed
efficiently, we use NCBI formatted binary search libraries.

The problem arises when you have a fasta file with a header like that
found in the medicago protein library. (The "header" being the first
line of the fasta file.) Here is a sample header entry before modification:

>IMGA|AC135319_42.2 Protein kinase - null AC135319.28 87081-86445 E EGN_Mt071002 20080103
mqdifgsvrrslvfraspennedhhhsigvggtttlvdkigycirnsrvfskpsppspsppipppirwrkgeligcgafghvyvgmnldsgellavkqvliaassaskekaqahvke
leeevkllkdlshpnivrylgtvreedtlnillefvpggsissllgkfgafpeavirtyteqillgleylhkngimhrdikganilvdnkgcikladfgaskqvvelatmsgaksmk
gtpywmapevilqtghsfsadiwsvgctviematgkppwsqqyqqevaalfhigttkshppipdhlssgakdfllkclqkepilrlsasellqgsfgasspsscapnvesllcstvn
pqnsgnnhlwgmnndddmcviddneefsecdikykslmpdniesfnpmsdpsddwgckfdaspelenrevsldtdegymsrvqlesnkeqkdlpipcvpslseedeeltevkirafl
dekalelkklqtplyeeffnslnascsptvidspsddisgrkylklppkskspsrvpnsspskavdnagspgsngrstvgnvdshgsqdvpaspinerkgmtvdsqqepfspslsfs
erqrkwkeeldqelerkremmrqanmagktsspkdralhrqrertrfasps*


We are interested in the first portion of the header:

>IMGA|AC135319_42.2 Protein kinase

It turns out that formatdb does not "like" this header because IMGA is
know a known NCBI database. If you run formatdb, to create a binary
NCBI search library, formatdb will create a new accession number. Even
worse, if you use -o T to create indexes, formatdb will delete your
original accession number. 

The workaround is to use gnl| or lcl| as your database name. NCBI then
preserves your identifier, and is capable of parsing out the accession
number. In the following terminal session, I created a fasta file with
a single sequence from the medicago proteins fasta file. I then ran
formatdb, and finally I use fastacmd to dump the sequences as they appear
inside the binary search library. Here we see that
>gnl|IMGA|AC135319_42.2 has been preserved. 

We recoment that you use gnl| or lcl| depending on the format of your
identifier. 

[twl8n@xi00 medicago_protein_20080108]$ cat tmp.fa
>gnl|IMGA|AC135319_42.2 Protein kinase - null AC135319.28 87081-86445 E EGN_Mt071002 20080103
mqdifgsvrrslvfraspennedhhhsigvggtttlvdkigycirnsrvfskpsppspsppipppirwrkgeligcgafghvyvgmnldsgellavkqvliaassaskekaqahvke
leeevkllkdlshpnivrylgtvreedtlnillefvpggsissllgkfgafpeavirtyteqillgleylhkngimhrdikganilvdnkgcikladfgaskqvvelatmsgaksmk
gtpywmapevilqtghsfsadiwsvgctviematgkppwsqqyqqevaalfhigttkshppipdhlssgakdfllkclqkepilrlsasellqgsfgasspsscapnvesllcstvn
pqnsgnnhlwgmnndddmcviddneefsecdikykslmpdniesfnpmsdpsddwgckfdaspelenrevsldtdegymsrvqlesnkeqkdlpipcvpslseedeeltevkirafl
dekalelkklqtplyeeffnslnascsptvidspsddisgrkylklppkskspsrvpnsspskavdnagspgsngrstvgnvdshgsqdvpaspinerkgmtvdsqqepfspslsfs
erqrkwkeeldqelerkremmrqanmagktsspkdralhrqrertrfasps*
[twl8n@xi00 medicago_protein_20080108]$ /xi.shared/blast-2.2.18/bin/formatdb -V -p T -o T -i tmp.fa
[twl8n@xi00 medicago_protein_20080108]$ cat formatdb.log

========================[ May 9, 2008 11:46 AM ]========================
Version 2.2.18 [Mar-02-2008]
Started database file "tmp.fa"
Formatted 1 sequences in volume 0
SUCCESS: formatted database tmp.fa
[twl8n@xi00 medicago_protein_20080108]$ /xi.shared/blast-2.2.18/bin/fastacmd -p T -d tmp.fa -D1
>gnl|IMGA|AC135319_42.2 Protein kinase - null AC135319.28 87081-86445 E EGN_Mt071002 20080103
MQDIFGSVRRSLVFRASPENNEDHHHSIGVGGTTTLVDKIGYCIRNSRVFSKPSPPSPSPPIPPPIRWRKGELIGCGAFG
HVYVGMNLDSGELLAVKQVLIAASSASKEKAQAHVKELEEEVKLLKDLSHPNIVRYLGTVREEDTLNILLEFVPGGSISS
LLGKFGAFPEAVIRTYTEQILLGLEYLHKNGIMHRDIKGANILVDNKGCIKLADFGASKQVVELATMSGAKSMKGTPYWM
APEVILQTGHSFSADIWSVGCTVIEMATGKPPWSQQYQQEVAALFHIGTTKSHPPIPDHLSSGAKDFLLKCLQKEPILRL
SASELLQGSFGASSPSSCAPNVESLLCSTVNPQNSGNNHLWGMNNDDDMCVIDDNEEFSECDIKYKSLMPDNIESFNPMS
DPSDDWGCKFDASPELENREVSLDTDEGYMSRVQLESNKEQKDLPIPCVPSLSEEDEELTEVKIRAFLDEKALELKKLQT
PLYEEFFNSLNASCSPTVIDSPSDDISGRKYLKLPPKSKSPSRVPNSSPSKAVDNAGSPGSNGRSTVGNVDSHGSQDVPA
SPINERKGMTVDSQQEPFSPSLSFSERQRKWKEELDQELERKREMMRQANMAGKTSSPKDRALHRQRERTRFASPS
[twl8n@xi00 medicago_protein_20080108]$




A less desirable solution
--------------------------

The solution below is somewhat workable, but we do not
recommend it, and it is presented here only in the interest of
completeness.

It is possible to create a new accession number where the pipe "|" is
replaced with an underscore "_". After being processed by formatdb,
this results in identifiers in the binary search library that is
prefixed with "lcl|" which seems to mean "local". 

There are several negative consequences of not following
standards. The most important is that the Hit_accession reported in
the blast results has the prefix IMGA_ which would have to be trimmed
off in order to get the real database name and accession number.

Here is a session transcript:

[twl8n@xi00 medicago_protein_20080108]$ cat tmp.fa
>IMGA_AC135319_42.2 IMGA|AC135319_42.2 Protein kinase - null AC135319.28 87081-86445 E EGN_Mt071002 20080103
mqdifgsvrrslvfraspennedhhhsigvggtttlvdkigycirnsrvfskpsppspsppipppirwrkgeligcgafghvyvgmnldsgellavkqvliaassaskekaqahvke
leeevkllkdlshpnivrylgtvreedtlnillefvpggsissllgkfgafpeavirtyteqillgleylhkngimhrdikganilvdnkgcikladfgaskqvvelatmsgaksmk
gtpywmapevilqtghsfsadiwsvgctviematgkppwsqqyqqevaalfhigttkshppipdhlssgakdfllkclqkepilrlsasellqgsfgasspsscapnvesllcstvn
pqnsgnnhlwgmnndddmcviddneefsecdikykslmpdniesfnpmsdpsddwgckfdaspelenrevsldtdegymsrvqlesnkeqkdlpipcvpslseedeeltevkirafl
dekalelkklqtplyeeffnslnascsptvidspsddisgrkylklppkskspsrvpnsspskavdnagspgsngrstvgnvdshgsqdvpaspinerkgmtvdsqqepfspslsfs
erqrkwkeeldqelerkremmrqanmagktsspkdralhrqrertrfasps*
[twl8n@xi00 medicago_protein_20080108]$ /xi.shared/blast-2.2.18/bin/formatdb -V -p T -o T -i tmp.fa
[twl8n@xi00 medicago_protein_20080108]$ cat formatdb.log

========================[ May 9, 2008 11:51 AM ]========================
Version 2.2.18 [Mar-02-2008]
Started database file "tmp.fa"
Formatted 1 sequences in volume 0
SUCCESS: formatted database tmp.fa
[twl8n@xi00 medicago_protein_20080108]$ /xi.shared/blast-2.2.18/bin/fastacmd -p T -d tmp.fa -D1
>lcl|IMGA_AC135319_42.2 IMGA|AC135319_42.2 Protein kinase - null AC135319.28 87081-86445 E EGN_Mt071002 20080103
MQDIFGSVRRSLVFRASPENNEDHHHSIGVGGTTTLVDKIGYCIRNSRVFSKPSPPSPSPPIPPPIRWRKGELIGCGAFG
HVYVGMNLDSGELLAVKQVLIAASSASKEKAQAHVKELEEEVKLLKDLSHPNIVRYLGTVREEDTLNILLEFVPGGSISS
LLGKFGAFPEAVIRTYTEQILLGLEYLHKNGIMHRDIKGANILVDNKGCIKLADFGASKQVVELATMSGAKSMKGTPYWM
APEVILQTGHSFSADIWSVGCTVIEMATGKPPWSQQYQQEVAALFHIGTTKSHPPIPDHLSSGAKDFLLKCLQKEPILRL
SASELLQGSFGASSPSSCAPNVESLLCSTVNPQNSGNNHLWGMNNDDDMCVIDDNEEFSECDIKYKSLMPDNIESFNPMS
DPSDDWGCKFDASPELENREVSLDTDEGYMSRVQLESNKEQKDLPIPCVPSLSEEDEELTEVKIRAFLDEKALELKKLQT
PLYEEFFNSLNASCSPTVIDSPSDDISGRKYLKLPPKSKSPSRVPNSSPSKAVDNAGSPGSNGRSTVGNVDSHGSQDVPA
SPINERKGMTVDSQQEPFSPSLSFSERQRKWKEELDQELERKREMMRQANMAGKTSSPKDRALHRQRERTRFASPS
[twl8n@xi00 medicago_protein_20080108]$


As it appears in blast results:
          <Hit_num>1</Hit_num>
          <Hit_id>lcl|IMGA_AC135319_42.2</Hit_id>
          <Hit_def>IMGA|AC135319_42.2 Protein kinase - null AC135319.28 87081-86445 E EGN_Mt071002 20080103</Hit_def>
          <Hit_accession>IMGA_AC135319_42.2</Hit_accession>
          <Hit_len>636</Hit_len>
 




An even worse example
---------------------

If your identifier includes a database name unknown to NCBI, such as
"IMGA", then formatdb will create a new identifier. This new
identifier is only useful with the binary search library. This is the
worst way to handle data.

As it appears in blast results:
          <Hit_num>1</Hit_num>
          <Hit_id>lcl|16480_20080103_imgag_prot.fa</Hit_id>
          <Hit_def>Protein kinase - null AC135319.28 87081-86445 E EGN_Mt071002 20080103</Hit_def>
          <Hit_accession>16480_20080103_imgag_prot.fa</Hit_accession>
          <Hit_len>636</Hit_len>