2

I have a large ( >100 ) list of accession numbers I want to look up and match to searches in NCBI (nucleotide); mainly for getting a tentative organism to match to the accession number.

ex:

KJ841938.1 would match to  Setoptus koraiensis
...
FJ911852.1 would match to  Uncultured eukaryote
...

I googled for tools, and I found this site. However, it is not what I really want since it doesn't list my queries in the same order as my list, which means I cannot match.

I also attempted to write a script in biopython using Entrez E-tools, but was unsuccessful due to a lack of coding skill.

Does anyone have any way I can go about this?

EDIT: Via this tutorial, I attempted to use this code sample:

from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.efetch(db="nucleotide", id="AY851612", rettype="gb", retmode="text")
print(handle.readline().strip())
# expected output : LOCUS       AY851612                 892 bp    DNA     linear   PLN 10-APR-2007
handle.close()

But modified to take any list instead of the variable id, as below:

import Bio
print (Bio.__version__)
from Bio import Entrez
import time


Entrez.email = "Your.Name.Here@example.org"
id_list = ["KJ841938.1", "FJ911852.1"] # real list is about 500 elements

x = 0
while  x < len(id_list):
    handle = Entrez.efetch(db="nucleotide", id=id_list[x], rettype= "uilist", retmode="text")
    #print(handle.readline().strip())
    print(handle.readline())
    handle.close()
    x = x + 1

Output is:

1.69
673539906

283462561

However, I do not believe I am using the right "rettype" parameter in the .efetch function as I keep on getting GI numbers, where as I was hoping to get something like a species name directly. Unless I can then search with these GI numbers in batch with more code or a tool to produce a ordered list?

theforestecologist
  • 28,331
  • 10
  • 113
  • 197
Ro Siv
  • 1,279
  • 2
  • 16
  • 34
  • 1
    Can you post the BioPython code that you have tried and that didn't work? What specific error did you got? Have you modified the sample code from the [tutorial](http://biopython.org/DIST/docs/api/Bio.Entrez-module.html)? – BioGeek Nov 19 '17 at 13:27
  • 1
    Also, Accession number `KJ841938.1` matches [`Setoptus koraiensis`](https://www.ncbi.nlm.nih.gov/nuccore/KJ841938.1), not `Gaeolaelaps aculeifer`. So please update your question with correct sample input and expected results. – BioGeek Nov 19 '17 at 13:34
  • @BioGeek I didnt know about that specific tutorial. I will try the example under "efetch" and see if it works. Thanks. – Ro Siv Nov 19 '17 at 16:43
  • 1
    In your first attempt, when you use `handle.readline()`, you get only the first line from the genbank-formatted record. To get a list of all lines, you could use `handle.readlines()`. You might have gotten somewhere by this approach, but it appears that Biopython provides a way to parse the results more conveniently (see my answer). – bli Nov 20 '17 at 13:00

2 Answers2

1

After some trial and error in an interactive python shell and some documentation checking, I found that the relevant information is present in the genbank-formatted output (rettype="gb"), and this can be parsed using Entrez.read provided it is returned in "xml" mode (retmode="xml").

The following code that seems to work:

#!/usr/bin/env python3

from Bio import Entrez


Entrez.email = "Your.Name.Here@example.org"
id_list = ["KJ841938.1", "FJ911852.1"] # real list is about 500 elements

for accession in id_list:
    handle = Entrez.efetch(
        db="nucleotide", id=accession, rettype="gb", retmode="xml")
    gb_record = Entrez.read(handle)
    handle.close()
    organism = gb_record[0]['GBSeq_organism']
    print("{}\t{}".format(accession, organism))

(You don't need a while loop here, by the way: python has the more convenient forsyntax to loop over the elements of a list.)

Output:

KJ841938.1  Setoptus koraiensis
FJ911852.1  Gaeolaelaps aculeifer
bli
  • 2,185
  • 14
  • 19
1

If you don't want to bother with BioPython, you can use Entrez Direct for this as follows:

$ cat temp.txt
KJ841938.1
FJ911852.1
$ epost -db nuccore -input temp.txt \
    | esummary \
    | xtract -pattern DocumentSummary -element AccessionVersion,Organism
KJ841938.1  Setoptus koraiensis
FJ911852.1  Gaeolaelaps aculeifer
vkkodali
  • 186
  • 2