8

I've struggled in the past with designing data(bases) around gene-identifiers that evolve over time. For example, UniProt and Ensembl change their identifiers for certain proteins/genes and the IDs go stale after a while (years). HGNC seems possibly better for this? What I'd like is to have a single unique identifier for each human gene that will remain stable for many years. I know genes get re-classified occasionally, but you shouldn't end-up with a new identifier for the same locus.

Suggestions?

Thanks

roadnottaken
  • 181
  • 3

1 Answers1

1

Maintaining IDs consistency is a common problem. I've never seen any reasonable solutions for it and don't believe that there are any. However, there are a few shortcuts.

  1. You can create an extra table with aliases for later renamings and refer them to original ids. Pros - it is flexible. Cons - maintaining is exhausting and you need to figure a way to explain to users why they input one id and receive other.
  2. The solution is pretty obvious - if you'd like to have "a single unique identifier for each human gene that will remain stable for many years" you can just take a snapshot of current state of any database you like and don't touch it for many years. You can release version-specific releases of your database and specify versions of reference database you use. E.g. "Glioblastoma_drug_perturbation_2016_(GRCh38_ENS87)".
Maxim Kuleshov
  • 1,025
  • 10
  • 19