News

New data-clustering feature aims to improve data quality and reveal cross-dataset connections

Published 7/28/2020

Initial release of algorithm identifies potentially related records, pointing toward wider integration of data

<a href="https://www.gbif.org/occurrence/2238236963"><i>Cortinarius vesterholtii</i></a> observed in Højbjerg, Denmark by Thomas Stjernegaard Jeppesen. Photo via the Danish Mycological Society (used with permission). — *Cortinarius vesterholtii* observed in Højbjerg, Denmark by Thomas Stjernegaard Jeppesen. Photo via the Danish Mycological Society (used with permission).

GBIF has released a first version of a data-clustering algorithm that identifies potentially related records by matching similar entries in individual fields across different datasets. This experimental feature can improve data quality by detecting potential duplicates, revealing related type specimens and exposing links between records from different sources like natural history collections, DNA-derived sequences and materials examined in taxonomic treatments.

Available wherever a ‘cluster’ tab appears on an individual occurrence record the new feature is likely most effective when applied to preserved, fossil and living specimens. The initial implementation has identified 7.8 million clustered occurrences from a total of 190 million specimen records.

"Thanks to recent improvements to our infrastructure, we can now explore deeper relationships among individual occurrence records shared with GBIF," said Joe Miller, GBIF Executive Secretary. "By providing machine-generated assertions about records, the data clusters increase our understanding of data from collections and other sources and mark the first step towards a broader annotation system.”

Although clustered records may at first seem simply duplicative, closer examination can uncover richer and more complex connections between records that illuminate:

The data-sharing patterns and behaviours in different networks
Specimens collected at the same time and place but housed in multiple natural history collections or herbaria
Differences in how data derived from the same event is processed and represented by different organizations
DNA-based occurrences derived from specimens
The recataloguing of sub-material from a specimen
Citation of specimens in taxonomic literature

The similarities revealed in the clusters have important implications both for finding errors and enriching data. Anomaly detection routines running across data clusters can expose possible issues and report them back to publishers. In this example, a specimen from the Natural History Museum, London, carries a collection date of 2019, while others in the cluster from the Smithsonian’s National Museum of Natural History, Musée national d’Histoire naturelle and Royal Botanic Gardens, Kew, record a date of 1900.

Individual records in a cluster can also be used to enhance and improve related occurrences with information like dates, georeferenced locations, images and DNA sequences. If this approach sounds familiar, recall that research into such possibilities earned Kew’s Nicky Nicolson one of the two 2019 Young Researchers Awards.

Other kinds of relationships revealed in these machine-identified data clusters include:

“These machine-generated assertions may also help resolve potential confusion when records from different sources use a variety of local identifier systems, as is often the case," said Thomas Stjernegaard Jeppesen, developer at the GBIF Secretariat. "For example, with this sequence-based record from the International Nucleotide Sequence Database Collaboration (INSDC), we can start to discover its relationships with a museum specimen record, a record from the Barcode of Life and a database replica of INSDC, despite the use of a variety of local identifiers."

Although this experimental feature remains subject to change and limited to the first 100 records, the GBIF API now provides access to this content (see example). Data publishers can start harvesting information on these relationships and expose it through their own data portals. Future improvements could enable the GBIF network to share such information into collections management systems or other digital data objects.

Next steps in the implementation involve refinements to the algorithm, the inclusion of relationship assertions as annotations in downloads and the exploration of how best to expose these relationships through search functionality.

"The first release of data clustering gives us a foundation to build on with our data users and data publishers," said Tim Robertson, the GBIF Secretariat's head of informatics. "We welcome community feedback to refine future development and to explore additional services that extend the usefulness of this feature by alerting publishers of potential errors and related specimens.”