2016 bd2k bgood_wikidata

Please download to get full document.

View again

of 48
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Slides

Published:

Views: 5 | Pages: 48

Extension: PDF | Download: 0

Share
Related documents
Description
Download 2016 bd2k bgood_wikidata
Transcript
  • 1. Wikidata for biomedical knowledge integration and curation Benjamin Good The Scripps Research Institute @bgood bgood@scripps.edu
  • 2. “knowledge” • A lot • Important • Text
  • 3. What are the functions of Fibronectin? 37186 articles What are the functions of the 238 ‘significant’ genes that came up in my high throughput screen??
  • 4. What are the functions of Fibronectin? 37186 articles … Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy “knowledge integration” “curation” “knowledge base” Answers
  • 5. Knowledge Bases 5 1,500+ listed at http://www.oxfordjournals.org/nar/database/a/
  • 6. Applications of knowledge bases • Find information • Plan research • ”Known unknowns?” • Interpret data • Gene Ontology Enrichment Analysis
  • 7. Interesting Gene List Gene Ontology, Pathway, Network interpretation
  • 8. Knowledge bases are important tools and will only grow more important over time
  • 9. 9 Great!
  • 10. 10 BUT
  • 11. 11 1. Knowledge bases are not complete 2. Will get to later..
  • 12. Annotation missing from human GO annotation. Should be here! (‘5 HT Receptor’ means ‘Serotonin Receptor’) Circa 2010
  • 13. Added to GO Jan. 2016 First characterized 1996 (Kohen et al J Neurochem)
  • 14. Interesting Gene List Gene Ontology, Pathway, Network interpretation
  • 15. We don’t know what we are missing 15 inflammatory response defense response Serotonin receptor activity? ? response to wounding immune response Interesting Gene List
  • 16. “Gene Ontology, its great right ?” • “It sucks” • “I only use it out of desperation”
  • 17. WHY?!
  • 18. Process of building knowledge bases 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  • 19. why does he look so down?
  • 20. Many scientists, powerful tools, comparatively little reward for curating knowledge 100’s of thousands 100’s
  • 21. More than 2 articles published/minute
  • 22. Professional biocuration does not scale up to the rate of production 1. do science 2. publish it 3. Manually extract the knowledge Gene Property Value Fibronectin Biological Process Angiogenesis Fibronectin Cellular Localization Extracellular matrix Fibronectin Related Disease Glomerulopathy
  • 23. 23 1. Knowledge bases are not complete 2. Knowledge needs integration
  • 24. Knowledge is scattered, integration brings it together
  • 25. Merging knowledge bases: the language barrier “Methadone” Interacts with: “Moxifloxacin”May treat: Opioid-Related Disorders ID: N0000000174 ID: 4095 Molecular Weight: 309.44518 g/mol … = ? = ? = ? = ? = ? = ? ID: DB00333 Manufactured by: Roxane laboratories inc
  • 26. Good for business, bad for science Google Scholar search shows 469 papers about “identifier mapping” in bioinformatics
  • 27. What can we do?
  • 28. Global Knowledge Platform What would happen if everyone was literally working on the same database? 1. Split up work more effectively 2. Make integration the default behavior
  • 29. Is to data as Wikipedia is to text “Giving more people more access to more knowledge” A free and open repository of knowledge Managed by the MediaWiki foundation that operates Wikipedia
  • 30. It’s a knowledge base! • Anyone can edit • Anyone can use
  • 31. Item: Q84
  • 32. Item: Q414043 RELN Genomic start: 103471784 GenLoc assembly: GRCh38 Stated in: Ensembl Release 83 Retrieved: 19 January 2016 Value (numeric) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  • 33. Item: Q414043 RELN Encodes: Reelin (protein) Stated in: NCBI homo sapiens annotation release 107 Retrieved: 19 January 2016 Value (item) Property Claim Qualifiers References https://www.wikidata.org/wiki/Q414043 Statement
  • 34. A Giant Global Graph These statements link together into a queryable graph https://query.wikidata.org
  • 35. We are seeding it with biomedical data • All human, mouse genes and proteins • All Gene Ontology terms • All FDA approved drugs • 9,000+ human diseases Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
  • 36. Our seeds are largely concepts linked to many identifier systems N identifiers per item • Genes: 8 • Drugs: 18 • Diseases: 11 Burgstaller et al (2016) Database (preprint in BioRxiv) Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv) Facilitate integration with key external knowledge bases
  • 37. Nurturing a multi-community garden of biomedical knowledge Gene DrugDisease
  • 38. A Platform for knowledge integration and curation 38 Open data Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here! Your Apps Here!
  • 39. Application #1 (of many) Burgstaller et al (2016) Database (preprint in BioRxiv)
  • 40. Impact of wikidata on Wikipedia Gene Wiki Version 1. {{GNF_Protein_box | Name = Reelin| image = | image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 | MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 | IUPHAR = | ChEMBL = | OMIM = None | ECnumber = | Homologene = 9349 | GeneAtlas_image1 = | GeneAtlas_image2 = | GeneAtlas_image3 = | Protein_domain_image = | Function = {{GNF_GO|id=GO:0005515 |text = protein binding}} {{GNF_GO|id=GO:0016787 |text = hydrolase activity}} {{GNF_GO|id=GO:0046872 |text = metal ion binding}} | Component = {{GNF_GO|id=GO:0005739 |text = mitochondrion}} | Process = {{GNF_GO|id=GO:0008152 |text = metabolic process}} | Hs_EntrezGene = 51110 | Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA = NM_016027 | Hs_RefseqProtein = NP_057111 | Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 | Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174 | Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 | Mm_Ensembl = ENSMUSG00000025937 | Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein = NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr = 1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end = 13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}} = Gene Wiki Version 2. {{Infobox gene}} • All data in Wikidata • 1 Lua script works for all genes = (1 of these for every gene)
  • 41. Application #2 Web Apollo Genome Browser 41 • Genome annotation data retrieved from wikidata via SPARQL queries to https://query.wikidata.org • Prototype achieved at recent San Diego hackathon 1 Putman et al (2016) (under review) (preprint in BioRxiv)
  • 42. Microbial Genetic Data •Widely Distributed •Difficult to query •Not structured in meaningful way •A lot of interest from this community !
  • 43. Microbial Genetic Data
  • 44. Microbial genomes in Wikidata • Loading genes, proteins, annotations for 120 reference genomes. • Completed 21 genomes so far Putman et al (2016) (under review) (preprint in BioRxiv)
  • 45. Microbiome modeling in Wikidata Putman et al (2016) (under review) (preprint in BioRxiv)
  • 46. 46 1. Knowledge bases are not complete 2. Knowledge needs integration Can help
  • 47. Centralizing content while distributing labor 47 Open data Your Apps Here! Wikipedia(s) Your Apps Here! Your Apps Here! Your Apps Here!
  • 48. Thanks! Gene Wikidata Team Andra Waagmeester (Micelio) * Sebastian Burgstaller (Scripps) * Tim Putman (Scripps) * Elvira Mitraka (U Maryland) Julia Turner (Scripps) Justin Leong (UBC) Lynn Schriml (U Maryland) Paul Pavlidis (UBC) Andrew Su (Scripps) Ginger Tsueng (Scripps) Contact bgood@scripps.edu* First author on manuscript cited in this presentation Ben Tim Andra Elvira Sebastian Some Gene Wiki team members enjoying their best paper award at SWAT4LS, Dec. 2015 Adapted logo
  • We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks