Biodiversity Information Science and Standards

Journal Information
EISSN : 2535-0897
Published by: Pensoft Publishers (10.3897)
Total articles ≅ 874
Current Coverage
LOCKSS
Archived in
SHERPA/ROMEO
Filter:

Latest articles in this journal

Papy Nsevolo
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75908

Abstract:
Insects play a vital role for humans. Apart from well-known ecosystem services (e.g., pollination, biological control, decomposition), they also serve as food for humans. An increasing number of research reports (Mitsuhashi 2017, Jongema 2018) indicate that entomophagy (the practice of eating insects by humans), is a long-standing practice in many countries around the globe. In Africa notably, more than 524 insects have been reported to be consumed by different ethnic groups, serving as a cheap, ecofriendly and renewable source of nutrients on the continent. Given the global recession due to the pandemic (COVID-19) and the threat induced to food security and food production systems, edible insects are of special interest in African countries, particularly the Democratic Republic of the Congo (DRC), where they have been reported as vital to sustain food security. Indeed, to date, the broadest lists of edible insects of the DRC reported (a maximum) 98 insects identified at species level (Monzambe 2002, Mitsuhashi 2017, Jongema 2018). But these lists are hampered by spelling mistakes or by redundancy. An additional problem is raised by insects only known by their vernacular names (ethnospecies) as local languages (more than 240 living ones) do not necessarily give rigorous information due to polysemy concerns. Based on the aforementioned challenges, entomophagy practices and edible insect species reported for DRC (from the independence year, 1960, to date) have been reviewed using four authoritative taxonomic databases: Catalogue of Life (CoL), Integrated Taxonomic Information System, Global Biodiversity Information Facility taxonomic backbone, and the Global Lepidoptera Names Index. Results confirm the top position of edible caterpillars (Lepidoptera, 50.8%) followed by Orthoptera (12.5%), Coleoptera and Hymenoptera (10.0% each). A total of 120 edible species (belonging to eighty genera, twenty-nine families and nine orders of insects) have been listed and mapped on a national scale. Likewise, host plants of edible insects have been inventoried after checking (using CoL, Plant Resources of Tropical Africa, and the International Union for Conservation of Nature's Red List of Threatened Species). The host plant diversity is dominated by multi-use trees belonging to Fabaceae (34.4%) followed by Phyllanthaceae (10.6%) and Meliaceae (4.9%). However, data indicated endangered (namely Millettia laurentii, Prioria balsamifera ) or critically endangered (Autranella congolensis) host plant species that call for conservation strategies. To the best of our knowledge, aforementioned results are the very first reports of such findings in Africa. Moreover, given issues encountered during data compilation and during cross-checking of scientific names, a call was made for greater collaboration between local people and expert taxonomists (through citizen science), in order to unravel unidentified ethnospecies. Given the challenge of information technology infrastructure in Africa, such a target could be achieved thanks to mobile apps. Likewise, a further call should be made for: bеtter synchronization of taxonomic databases, the need of qualitative scientific photographs in taxonomic databases, and additional data (i.e., conservational status, proteins or DNA sequences notably) as edible insects need to be rigorously identified and durably managed. bеtter synchronization of taxonomic databases, the need of qualitative scientific photographs in taxonomic databases, and additional data (i.e., conservational status, proteins or DNA sequences notably) as edible insects need to be rigorously identified and durably managed. Indeed, these complementary data are very crucial, given the limitations and issues of conventional/traditional identification methods based on morphometric or dichotomous keys and the lack of voucher specimens in many African museums and/or collections. This could be achieved by QR (Quick Response) coding insect species and centralizing data about edible insects in a main authoritative taxonomic database whose role is undebatable, as edible insects are today earmarked as nutrient-rich source of proteins, fat, vitamins and fiber to mitigate food insecurity and poor diets, which are an aggravating factor for the impact of COVID-19.
, Xiaojun Wang,
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75856

Abstract:
Over 1 billion biodiversity collection specimens ranging from fungi to fish to fossils are housed in more than 1,600 natural history collections across the United States. The digitization of these specimens has risen significantly within the last few decades and this is only likely to increase, as the use of digitized data gains more importance every day. Numerous experiments with automated image analysis have proven the practicality and usefulness of digitized biodiversity images by computational techniques such as neural networks and image processing. However, most of the computational techniques to analyze images of biodiversity collection specimens require a good curation of this data. One of the challenges in curating multimedia data of biodiversity collection specimens is the quality of the multimedia objects—in our case, two dimensional images. To tackle the image quality problem, multimedia needs to be captured in a specific format and presented with appropriate descriptors. In this study we present an analysis of two image repositories each consisting of 2D images of fish specimens from several institutions—the Integrated Digitized Biocollections (iDigBio) and the Great Lakes Invasives Network (GLIN). Approximately 70 thousand images have been processed from the GLIN repository and 450 thousand images have been processed from the iDigBio repository and their suitability assessed for use in neural network-based species identification and trait extraction applications. Our findings showed that images that came from the GLIN dataset were more successful for image processing and machine learning purposes. Almost 40% of the species have been represented with less than 10 images while only 20% have more than 100 images per species. We identified and captured 20 metadata descriptors that define quality and usability of the image. According to the captured metadata information, 70% of the GLIN dataset images were found to be useful for further analysis according to the overall image quality score. Quality issues with the remaining images included: curved specimens, non-fish objects in the images such as tags, labels and rocks that obstructed the view of the specimen; color, focus and brightness issues; folded or overlapping parts as well as missing parts. We used both the web interface and the API (Application Programming Interface) for downloading images from iDigBio. We searched for all fish genera, families and classes in three different searches with the images-only option selected. Then we combined all of the search results and removed duplicates. Our search on the iDigBio database for fish taxa returned approximately 450 thousand records with images. We narrowed this down to 90 thousand fish images aided by the multimedia metadata with the downloaded search results, excluding some non-fish images, fossil samples, X-ray and CT (computed tomography) scans and several others. Only 44% of these 90 thousand images were found to be suitable for further analysis. In this study, we discovered some of the limitations of biodiversity image datasets and built an infrastructure for assessing the quality of biodiversity images for neural network analysis. Our experience with the fish images gathered from two different image repositories has enabled describing image quality metadata features. With the help of these metadata descriptors, one can simply create a dataset for a desired image quality for the purpose of analysis. Likewise, the availability of the metadata descriptors will help advance our understanding of quality issues, while helping data technicians, curators and the other digitization staff be more aware of multimedia.
Beata Bramorska
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75783

Abstract:
Poland is characterised by a relatively high variety of living organisms attributed to terrestrial and water environments. Currently, close to 57.000 species of living organisms are described that occur in Poland (Symonides 2008), including lowland and mountain species, those attributed to oceanic and continental areas, as well as species from forested and open habitats. Poland comprehensively represents biodiversity of living organisms on a continental scale and thus, is considered to have an important role for biodiversity maintenance. The Mammal Research Institute of Polish Academy of Sciences (MRI PAS), located in Białowieża Forest, a UNESCO Heritage Site, has been collecting biodiversity data for 90 years. However, a great amount of data gathered over the years, especially old data, is gradually being forgotten and hard to access. Old catalogues and databases have never been digitalized or publicly shared, and not many Polish scientists are aware of the existence of such resources, not to mention the rest of the scientific world. Recognizing the need for an online, interoperable platform, following FAIR data principles (findable, accessible, interoperable, reusable), where biodiversity and scientific data can be shared, MRI PAS took a lead in creation of an Open Forest Data (OFD) repository. OpenForestData.pl (Fig. 1) is a newly created (2020) digital repository, designed to provide access to natural sciences data and provide scientists with an infrastructure for storing, sharing and archiving their research outcomes. Creating such a platform is a part of an ongoing development of life sciences in Poland, aiming for an open, modern science, where data are published as free-access. OFD also allows for the consolidation of natural science data, enabling the use and processing of shared data, including API (Application Programming Interface) tools. OFD is indexed by the Directory of Open Repositories (OpenDOAR) and Registry of Research Data Repositories (re3data). The OFD platform is based entirely on reliable, globally recognized open source software: DATAVERSE, an interactive database app which supports sharing, storing, exploration, citation and analysis of scientific data; GEONODE, a content management geospatial system used for storing, publicly sharing and visualising vector and raster layers, GRAFANA, a system meant for storing and analysis of metrics and large scale measurement data, as well as visualisation of historical graphs at any time range and analysis for trends; and external tools for database storage (Orthanc) and data visualisation (Orthanc plugin Osimis Web Viewer and Online 3D Viewer (https://3dviewer.net/), which were integrated with the system mechanism of Dataverse. Furthermore, according to the need for specimen description, Darwin Core (Wieczorek et al. 2012) metadata schema was decided to be the most suitable for specimen and collections description and mapped into a Dataverse additional metadata block. The use of Darwin Core is based on the same file format, the Darwin Core Archive (DwC-A) which allows for sharing data using common terminology and provides the possibility for easy evaluation and comparison of biodiversity datasets. It allows the contributors to OFD to optionally choose Darwin Core for object descriptions making it possible to share biodiversity datasets in a standardized way for users to download, analyse and compare. Currently, OFD stores more than 10.000 datasets and objects from the collections of Mammal Research Institute of Polish Academy of Sciences and Forest Science Institute of Białystok University of Technology. The objects from natural collections were digitalized, described, catalogued and made public in free-access. OFD manages seven types of collection materials: 3D and 2D scans of specimen in Herbarium, Fungarium, Insect and Mammal Collections, images from microscopes (including stereoscopic and scanning electron microscopes), morphometric measurements, computed tomography and microtomography scans in Mammal Collection, mammal telemetry data, satellite imagery, geospatial climatic and environmental data, georeferenced historical maps. 3D and 2D scans of specimen in Herbarium, Fungarium, Insect and Mammal Collections, images from microscopes (including stereoscopic and scanning electron microscopes), morphometric measurements, computed tomography and microtomography scans in Mammal Collection, mammal telemetry data, satellite imagery, geospatial climatic and environmental data, georeferenced historical maps. In the OFD repository, researchers have the possibility to share data in standardized way, which nowadays is often a requirement during the publishing process of a scientific article. Beside scientists, OFD is designed to be open and free for students and specialists in nature protection, but also for officials, foresters and nature enthusiasts. Creation of the OFD repository supports the development of citizen science in Poland, increases visibility and access to published data, improves scientific collaboration, exchange and reuse of data within and across borders.
Aurore Gourraud, Régine Vignes Lebbe, , Marc Pignal
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75752

Abstract:
The joint use of two tools applied to plant description, XPER3 and Recolnat Annotate, made it possible to study vegetative architectural patterns (Fig. 1) of the Dendrobium (Orchidaceae) in New Caledonia defined by N. Hallé (1977). This approach is not directly related to taxonomy, but to the definition of sets of species grouped according to a growth pattern. In the course of this work, the characters stated by N. Hallé were analysed and eventually amended to produce a data matrix and generate an identification key. Study materials: Dendrobium Sw. in New Caledonia New Caledonia is an archipelago in the Pacific Ocean, a French overseas territory located east of Australia. It is one of the 36 biodiversity hotspots in the world. The genus Dendrobium Sw. sensu lato is one of the largest in the family Orchidaceae and contains over 1220 species. In New Caledonia, it includes 46 species. In his revision of the family, N. Hallé (1977) defined 14 architectural groups, into which he divided the 31 species known at that time. These models are based on those defined by F. Hallé and Oldeman (1970). But they are clearly intended to group species together for identification purposes. Architectural pattern: A pattern is a set of vegetative or reproductive characters that define the general shape of an individual. Developed by mechanisms linked to the dominance of the terminal buds, the architectural groups are differentiated by the arrangement of the leaves, the position of the inflorescences or the shape of the stem (Fig. 1). Plants obeying a given pattern do not necessarily have phylogenetic relationships. These models have a useful application in the field for identifying groups of plants. Monocotyledonous plants, and in particular the Orchidaceae, lend themselves well to this approach, which produces stable architectural patterns. Recolnat Annotate Recolnat Annotate is a free tool for observing qualitative features and making physical measurements (angle, length, area) of images. It can be used offline and downloaded from https://www.recolnat.org/en/annotate. The software is based on the setting up observation projects that group together a batch of herbarium images to be studied, associating it with a descriptive model. A file of measurements can be exported in comma separated value (csv) format for further analysis (Fig. 2). XPER3 Usually used in the context of systematics in which the items studied are taxa, XPER3 can also be used to distinguish architectural groups that are not phylogenetically related. Developed by the Laboratoire d'Informatique et Systématique (LIS) of the Institut de Systématique, Evolution, Biodiversité in Paris, XPER3 is an online collaborative platform that allows the editing of descriptive data (https://www.xper3.fr/?language=en). This tool allows the cross-referencing of items (in this case architectural groups) and descriptors (or characters). It allows the development of free access identification keys (it means without fixed sequence of identification steps). The latter can be used directly online. But it also offers to produce single-access keys, with or without using character weighting and dependencies between characters. Links between XPER3 and Recolnat Annotate The descriptive model used by Recolnat Annotate can be developed within the framework of XPER3, which provides for characters and character states. Thus the observations made by the Recolnat Annotate measurement tool can be integrated into the XPER3 platform. Specimens can then be compared, or several descriptions can be merged to express the description of a species (Fig. 3). RESULTS The joint use of XPER3 and Recolnat Annotate to manage both herbarium specimens and architectural patterns has proven to be relevant. Moreover, the measurements on the virtual specimens are fast and reliable. N. Hallé (1977) had produced a dichotomous single-accesskey that allowed the identification and attribution of a pattern to a plant observed in the field or in a herbarium. The project to build a polytomous and interactive key with XPER3 required completing the observations to give a status for each character of each vegetative architectural model. Recolnat Annotate was used to produce observations from herbarium network in France. The use of XPER3 has allowed us to redefine these models in the light of new data from the herbaria and to publish the interactive key available at dendrobium-nc.identificationkey.org.
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75690

Abstract:
Plazi's TreatmentBank is a research infrastructure and partner of the recent European Union-funded Biodiversity Community Integrated Knowledge Library (BiCIKL) project to provide a single knowledge portal to open, interlinked and machine-readable, findable, accessible, interoperable and reusable (FAIR) data. Plazi is liberating published biodiversity data that is trapped in so-called flat formats, such as portable document format (PDF), to increase its FAIRness. This can pose a variety of challenges for both data mining and curation of the extracted data. The automation of such a complex process requires internal organization and a well established workflow of specific steps (e.g., decoding of the PDF, extraction of data) to handle the challenges that the immense variety of graphic layouts existing in the biodiversity publishing landscape can impose. These challenges may vary according to the origin of the document: scanned documents that were not initially digital, need optical character recognition in order to be processed. Processing a document can either be an individual, one-time-only process, or a batch process, in which a template for a specific document type must be produced. Templates consist of a set of parameters that tell Plazi-dedicated software how to read and where to find key pieces of information for the extraction process, such as the related metadata. These parameters aim to improve the outcome of the data extraction process, and lead to more consistent results than manual extraction. In order to produce such templates, a set of tests and accompanying statistics are evaluated, and these same statistics are constantly checked against ongoing processing tasks in order to assess the template performance in a continuous manner. In addition to these steps that are intrinsically associated with the automated process, different granularity levels (e.g., low granularity level might consist of a treatment and its subsections versus a high granularity level that includes material citations down to named entities such as collection codes, collector, collecting date) were defined to accommodate specific needs for particular projects and user requirements. The higher the granularity level, the more thoroughly checked the resulting data is expected to be. Additionally, steps related to the quality control (qc), such as the “pre-qc”, “qc” and “extended qc” were designed and implemented to ensure data quality and enhanced data accuracy. Data on all these different stages of the processing workflow are constantly being collected and assessed in order to improve these very same stages, aiming for a more reliable and efficient operation. This is also associated with a current Data Architecture plan to move this data assessment to a cloud provider to promote real-time assessment and constant analyses of template performance and processing stages as a whole. In this talk, the steps of this entire process are explained in detail, highlighting how data are being used to improve these steps towards a more efficient, accurate, and less costly operation.
, Aurélie Jambon, Camille Monchicourt, Olivier Rovellotti
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75704

Abstract:
Huge improvements have been made throughout the years in collecting and standardising biodiversity data (Bisby 2000, Osawa 2019, Hardisty and Roberts 2013) and in overhauling how to make information in the field of biodiversity data management more FAIR (Findable, Accessible, Interoperable, Reusable) (Simons 2021), but there is still room for improvement. Most professionals working in protected areas, conservation groups, and research organisations lack the required know-how to improve the reuse ratio of their data. The GeoNature and GeoNature-Atlas (Monchicourt 2018, Corny et al. 2019) are a set of open-source software that facilitate data collection, management, validation, sharing (e.g., via Darwin Core standard) and visualisation. It is a powerful case study of collaborative work, which includes teams from private and public sectors with at least fifteen national parks and forty other organisations currently using and contributing to the package in France and Belgium (view it on github).
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75736

Abstract:
Specimens have long been viewed as critical to research in the natural sciences because each specimen captures the phenotype (and often the genotype) of a particular individual at a particular point in space and time. In recent years there has been considerable focus on digitizing the many physical specimens currently in the world’s natural history research collections. As a result, a growing number of specimens are each now represented by their own “digital specimen”, that is, a findable, accessible, interoperable and re-usable (FAIR) digital representation of the physical specimen, which contains data about it. At the same time, there has been growing recognition that each digital specimen can be extended, and made more valuable for research, by linking it to data/samples derived from the curated physical specimen itself (e.g., computed tomography (CT) scan imagery, DNA sequences or tissue samples), directly related specimens or data about the organism's life (e.g., specimens of parasites collected from it, photos or recordings of the organism in life, immediate surrounding ecological community), and the wide range of associated specimen-independent data sets and model-based contextualisations (e.g., taxonomic information, conservation status, bioclimatological region, remote sensing images, environmental-climatological data, traditional knowledge, genome annotations). The resulting connected network of extended digital specimens will enable new research on a number of fronts, and indeed this has already begun. The new types of research enabled fall into four distinct but overlapping categories. First, because the digital specimen is a surrogate—acting on the Internet for a physical specimen in a natural science collection—it is amenable to analytical approaches that are simply not possible with physical specimens. For example, digital specimens can serve as training, validation and test sets for predictive process-based or machine learning algorithms, which are opening new doors of discovery and forecasting. Such sophisticated and powerful analytical approaches depend on FAIR, and on extended digital specimen data being as open as possible. These analytical approaches are derived from biodiversity monitoring outputs that are critically needed by the biodiversity community because they are central to conservation efforts at all levels of analysis, from genetics to species to ecosystem diversity. Second, linking specimens to closely associated specimens (potentially across multiple disparate collections) allows for the coordinated co-analysis of those specimens. For example, linking specimens of parasites/pathogens to specimens of the hosts from which they were collected, allows for a powerful new understanding of coevolution, including pathogen range expansion and shifts to new hosts. Similarly, linking specimens of pollinators, their food plants, and their predators can help untangle complex food webs and multi-trophic interactions. Third, linking derived data to their associated voucher specimens increases information richness, density, and robustness, thereby allowing for novel types of analyses, strengthening validation through linked independent data and thus, improving confidence levels and risk assessment. For example, digital representations of specimens, which incorporate e.g., images, CT scans, or vocalizations, may capture important information that otherwise is lost during preservation, such as coloration or behavior. In addition, permanently linking genetic and genomic data to the specimen of the individual from which they were derived—something that is currently done inconsistently—allows for detailed studies of the connections between genotype and phenotype. Furthermore, persistent links to physical specimens, of additional information and associated transactions, are the building blocks of documentation and preservation of chains of custody. The links will also facilitate data cleaning, updating, as well as maintenance of digital specimens and their derived and associated datasets, with ever-expanding research questions and applied uses materializing over time. The resulting high-quality data resources are needed for fact-based decision-making and forecasting based on monitoring, forensics and prediction workflows in conservation, sustainable management and policy-making. Finally, linking specimens to diverse but associated datasets allows for detailed, often transdisciplinary, studies of topics ranging from local adaptation, through the forces driving range expansion and contraction (critically important to our understanding of the consequences of climate change), and social vectors in disease transmission. A network of extended digital specimens will enable new and critically important research and applications in all of these categories, as well as science and uses that we cannot yet envision.
Hong Cui, Bruce Ford, Julian Starr, , Anton Reznicek, Noah Giebink, Dylan Longert, Étienne Léveillé-Bourret, Limin Zhang
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75741

Abstract:
It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components: a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the...
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75686

Abstract:
GBIF (Global Biodiversity Information Facility) is the largest data aggregator of biological occurrences in the world. GBIF was officially established in 2001 and has since aggregated 1.8 billion occurrence records from almost 2000 publishers. GBIF relies heavily on Darwin Core (DwC) for organising the data it receives. GBIF Data Processing Pipelines Every single occurrence record that gets published to GBIF goes through a series of three processing steps until it becomes available on GBIF.org. source downloading parsing into verbatim occurrences interpreting verbatim values source downloading parsing into verbatim occurrences interpreting verbatim values Once all records are available in the standard verbatim form, they go through a set of interpretations. In 2018, GBIF processing underwent a significant rewrite in order to improve speed and maintainablility. One of the main goals of this rewrite was to improve the consistency between GBIF's processing and that of the Living Atlases. In connection with this, GBIF's current data validator fell out of sync with GBIF pipelines processing. New GBIF Data Validator The current GBIF data validator is a service that allows anyone with a GBIF-relevant dataset to receive a report on the syntactical correctness and the validity of the content contained within the dataset. By submitting a dataset to the validator, users can go through the validation and interpretation procedures usually associated with publishing in GBIF and quickly determine potential issues in data, without having to publish it. GBIF is planning to rework the current validator because the current validator does not exactly match current GBIF pipelines processing. Planned Changes The new validator will match the processing of the GBIF pipelines project. Validations will be saved and show up on user pages similar to the way downloads and derived datasets appear now (no more bookmarking validations!) A downloadable report of issues found will be produced. Validations will be saved and show up on user pages similar to the way downloads and derived datasets appear now (no more bookmarking validations!) A downloadable report of issues found will be produced. Suggested Changes/Ideas One of the main guiding philosophies for the new validator user interface will be avoiding information overload. The current validator is often quite verbose in its feedback, highlighting data issues that may or may not be fixable or particularly important. The new validator will: generate a map of record geolocations; give users issues by order of importance; give "What", "Where", "When" flags priority; give some possible solutions or suggested fixes for flagged records. generate a map of record geolocations; give users issues by order of importance; give "What", "Where", "When" flags priority; give some possible solutions or suggested fixes for flagged records. We see the hosted portal environment as a way to quickly implement a pre-publication validation environment that is interactive and visual. Potential New Data Quality Flags The GBIF team has been compiling a list of new data quality flags. Not all of the suggested flags are easy to implement, so GBIF cannot promise the flags will get implemented, even if they are a great idea. The advantage of the new processing pipelines is that almost any new data quality flag or processing step in pipelines will be available for the data validator. Easy new potential flags: country centroid flag: Country/province centroids are a known data quality problem. any zero coordinate flag: Sometimes publishers leave either the latitude or longitude field as zero when it should have been left blank or NULL. default coordinate uncertainty in meters flag: Sometimes a default value or code is used for dwc:coordinateUncertaintyInMeters, which might indicate that it is incorrect. This is especially the case for values 301, 3036, 999, 9999. no higher taxonomy flag: Often publishers will leave out the higher taxonomy of a record. This can cause problems for matching to the GBIF backbone taxonomy.. null coordinate uncertainty in meters flag: There has been some discussion that GBIF should encourage publishers more to fill in dwc:coordinateUncertaintyInMeters. This is because every record, even ones taken from a Global Positioning System (GPS) reading, have an associated dwc:coordinateUncertaintyInMeters country centroid flag: Country/province centroids are a known data quality problem. any zero coordinate flag: Sometimes publishers leave either the latitude or longitude field as zero when it should have been left blank or NULL. default coordinate uncertainty in meters flag: Sometimes a default value or code is used for dwc:coordinateUncertaintyInMeters, which might indicate that it is incorrect. This is especially the case for values 301, 3036, 999, 9999. no higher taxonomy flag: Often publishers will leave out the higher taxonomy of a record. This can cause problems for matching to the GBIF backbone taxonomy.. null coordinate uncertainty in meters flag: There has been some discussion that GBIF should encourage publishers more to fill in dwc:coordinateUncertaintyInMeters. This is because every record, even ones taken from a Global Positioning System (GPS) reading, have an associated dwc:coordinateUncertaintyInMeters It is also nice when a data quality flag has an escape hatch, such that a data publisher can get rid of false positives or remove a flag through filling in a value. Batch-type validations that are doable for pipelines, but probably not in the validator include: outlier: Outliers are a known data quality problem. There are generally two types of outliers: environmental outliers and distance outliers. Currently GBIF does not flag either type of outlier. record is sensitive species: A sensitive species would be a record where the species is considered vulnerable in some way. Usually this is due to poaching threat or the species is only found in one area. gridded...
Mathias Aloui, Gaëtan Duhamel, Manon Frédout, Olivier Rovellotti
Biodiversity Information Science and Standards, Volume 5; https://doi.org/10.3897/biss.5.75705

Abstract:
It is now well known that a healthy urban ecosystem is a crucial element to healthier citizens (Astell-Burt and Feng 2019), better air (Ning et al. 2016) and water quality (Livesley et al. 2016), and overall, to a more resilient urban environment (Huff et al. 2020). With ecoTeka, an open-source platform for tree management, we leverage the power of OpenStreetMap (Mooney 2015), Mappilary, and open data to allow decision makers to improve their urban forestry practices. To have the most comprehensive data about the ecosystems, we plan use all available sources from satellite imagery to LIDAR (light detection and ranging) and compute them with the DeepForest (Weinstein et al. 2020) learning algorithm. We also teamed with the French government to build an open standard for tree data to improve the interoperability of the system. Finally, we calculate a Shannon-Wiener diversity index (used by ecologists to estimate species diversity by their relative abundance in a habitat) to inform the decision making of urban ecosystems.
Back to Top Top