News

Cross-infrastructure collaboration with ENA improves processing, quality of DNA-derived occurrences

Published 9/14/2021

Newly automated workflow relies on web services from the European Nucleotide Archive to create stable pipeline for accessing up-to-date sequence-based records

<em>Oscillatoria redekei</em>, observed in Rambla del Puerto del Garruchal, Murcia, Spain. Photo 2021 Vicente Franch Meneu via <a href="https://www.gbif.org/occurrence/3058725179">iNaturalist Research-grade Observations</a>, licensed under <a href="http://creativecommons.org/licenses/by-nc/4.0/legalcode">CC BY-NC 4.0</a>. — *Oscillatoria redekei*, observed in Rambla del Puerto del Garruchal, Murcia, Spain. Photo 2021 Vicente Franch Meneu via iNaturalist Research-grade Observations, licensed under CC BY-NC 4.0.

A collaboration between the European Nucleotide Archive (ENA) and the GBIF Secretariat has established automated processes for publishing better organized, more up-to-date datasets on GBIF. These datasets reuse the globally comprehensive DNA sequence data that ENA and its partners NCBI and DDBJ maintain in the International Nucleotide Sequence Database Collaboration (INSDC).

EMBL's European Bioinformatics Institute (EMBL-EBI) maintains ENA, which supplied the first DNA-derived dataset shared through GBIF in 2014. As a result of the recent collaboration, these records have been segmented into three different datasets containing sequence-based records, records associated with host organisms and records associated with environment sample identifiers.

"Sequencing is one of the most important data feeds for global biodiversity observation,” said Guy Cochrane, head of ENA. ”I am delighted that the GBIF and EMBL-EBI ENA teams are working together to extend and enhance the availability of comprehensive INSDC data through GBIF. Our continued work together on improving granularity and filtering of these data will provide an increasingly accurate and reliable body of openly available observations for the scientific community."

The improved connection between the two infrastructures processes data from two separate APIs (application programming interfaces) that ENA maintains. The first step retrieves sequences to provide a pool of records for GBIF to transform into occurrence data, while the second brings in higher-level taxonomic information, ensuring proper placement of the scientific names associated with sequence-based records in the GBIF taxonomic backbone.

The amount of data coming from the ENA has increased by nearly 1.5 million records, or 31 per cent. More important than raw numbers, however, is the fact that INSDC and other DNA-derived datasets provide significantly broader taxonomic coverage, making meaningful contributions toward reducing the delay between biodiversity detection and data discoverability as well as taxonomic biases like those often described in research.

"Most of the additional DNA sequences coming from ENA are from vouchered specimens from scientific collections," said Joe Miller, executive secretary of the GBIF Secretariat. "Using the clustering algorithm that GBIF deployed last year, we are now linking DNA-derived records with their these vouchers (see example), making both data occurrences more valuable. With the algorithm also starting to connect records' images and literature references, it's an exciting time in biodiversity data integration."

The update of the EMBL-EBI datasets is timely, given that colleagues from Naturalis, Royal Botanic Gardens, Kew, the Atlas of Living Australia and other institutions will meet shortly for an upcoming BioHackathon and explore how to improve the algorithm. (The hackathon is organized under the EU-funded BiCIKL project led by Pensoft Publishers, in which GBIF is a partner).

Establishing a dynamic connection between ENA and GBIF will make it easier to keep up with the growth of relevant INSDC data while enabling the introduction of repeatable filtering steps to improve both the organization and the quality of the data. Duplicate records from single organisms have been reduced by matching sample accession numbers to identify non-contiguous sets of overlapping DNA segments (or "non-contigs"). Other omitted records include human DNA sequences, likely duplicates that are missing sample accession numbers and records without either an associated specimen voucher or both a location and a date.

These improvements complement EMBL-EBI and GBIF's earlier efforts to improve the connections between metagenomics and species occurrence data. With additional support and encouragement from ELIXIR, that work has established MGnify, an analysis platform and database of analysed microbiome sequencing projects hosted at EMBL-EBI, as GBIF's tenth-largest data publisher. Meanwhile, DOI-based data citations have connected the reuse of MGnify's 21.7 million occurrence records in 80 peer-reviewed articles just two years after the platform shared its first records.

Technical notes

This work relies on two open-source software projects to make this workflow happen: