How to search NCBI in bulk for a list of accession numbers?

Question

I have a large ( >100 ) list of accession numbers I want to look up and match to searches in NCBI (nucleotide); mainly for getting a tentative organism to match to the accession number.

ex:

KJ841938.1 would match to  Setoptus koraiensis
...
FJ911852.1 would match to  Uncultured eukaryote
...

I googled for tools, and I found this site. However, it is not what I really want since it doesn't list my queries in the same order as my list, which means I cannot match.

I also attempted to write a script in biopython using Entrez E-tools, but was unsuccessful due to a lack of coding skill.

Does anyone have any way I can go about this?

EDIT: Via this tutorial, I attempted to use this code sample:

from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.efetch(db="nucleotide", id="AY851612", rettype="gb", retmode="text")
print(handle.readline().strip())
# expected output : LOCUS       AY851612                 892 bp    DNA     linear   PLN 10-APR-2007
handle.close()

But modified to take any list instead of the variable id, as below:

import Bio
print (Bio.__version__)
from Bio import Entrez
import time


Entrez.email = "Your.Name.Here@example.org"
id_list = ["KJ841938.1", "FJ911852.1"] # real list is about 500 elements

x = 0
while  x < len(id_list):
    handle = Entrez.efetch(db="nucleotide", id=id_list[x], rettype= "uilist", retmode="text")
    #print(handle.readline().strip())
    print(handle.readline())
    handle.close()
    x = x + 1

Output is:

However, I do not believe I am using the right "rettype" parameter in the .efetch function as I keep on getting GI numbers, where as I was hoping to get something like a species name directly. Unless I can then search with these GI numbers in batch with more code or a tool to produce a ordered list?

Can you post the BioPython code that you have tried and that didn't work? What specific error did you got? Have you modified the sample code from the [tutorial](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html)? — BioGeek, Nov 19 '17 at 13:27
Also, Accession number `KJ841938.1` matches [`Setoptus koraiensis`](https://www.ncbi.nlm.nih.gov/nuccore/KJ841938.1), not `Gaeolaelaps aculeifer`. So please update your question with correct sample input and expected results. — BioGeek, Nov 19 '17 at 13:34
@BioGeek I didnt know about that specific tutorial. I will try the example under "efetch" and see if it works. Thanks. — Ro Siv, Nov 19 '17 at 16:43
In your first attempt, when you use `handle.readline()`, you get only the first line from the genbank-formatted record. To get a list of all lines, you could use `handle.readlines()`. You might have gotten somewhere by this approach, but it appears that Biopython provides a way to parse the results more conveniently (see my answer). — bli, Nov 20 '17 at 13:00

score 1 · Answer 1 · answered Nov 20 '17 at 12:55

After some trial and error in an interactive python shell and some documentation checking, I found that the relevant information is present in the genbank-formatted output (rettype="gb"), and this can be parsed using Entrez.read provided it is returned in "xml" mode (retmode="xml").

The following code that seems to work:

#!/usr/bin/env python3

from Bio import Entrez


Entrez.email = "Your.Name.Here@example.org"
id_list = ["KJ841938.1", "FJ911852.1"] # real list is about 500 elements

for accession in id_list:
    handle = Entrez.efetch(
        db="nucleotide", id=accession, rettype="gb", retmode="xml")
    gb_record = Entrez.read(handle)
    handle.close()
    organism = gb_record[0]['GBSeq_organism']
    print("{}\t{}".format(accession, organism))

(You don't need a while loop here, by the way: python has the more convenient forsyntax to loop over the elements of a list.)

Output:

KJ841938.1  Setoptus koraiensis
FJ911852.1  Gaeolaelaps aculeifer

score 1 · Answer 2 · answered Dec 16 '18 at 17:06

If you don't want to bother with BioPython, you can use Entrez Direct for this as follows:

$ cat temp.txt
KJ841938.1
FJ911852.1
$ epost -db nuccore -input temp.txt \
    | esummary \
    | xtract -pattern DocumentSummary -element AccessionVersion,Organism
KJ841938.1  Setoptus koraiensis
FJ911852.1  Gaeolaelaps aculeifer

How to search NCBI in bulk for a list of accession numbers?

2 Answers2