fdwc.github.io

Frictionless DarwinCore pages

View on GitHub

Date: 20 Sep 2019 Title: Goodtables

Today, we’ll see how FrictionlessDarwinCore and Goodtables can help GBIF users with a simple data flow: Download data flow

1. Download occurrences from GBIF portal

Login to GBIF.org Data portal. Create and refine your query with the appropriate filters. Download the resulting occurrences in DarwinCore format. Store this DwC archive on your HD, eg under 0000637-190918142434337.zip GBIF.org Download

2. Convert this DwCA into a DataPackage

$ fdwca 0000637-190918142434337.zip download_DP.zip
...

3. Validate downloaded occurrences with goodtables

$ goodtables download_DP.zip
DATASET
=======
{'error-count': 8560,
 'preset': 'nested',
 'table-count': 3,
 'time': 18.842,
 'valid': False}

TABLE [1]
=========
{'datapackage': 'download_DP.zip',
 'error-count': 8533,
 'format': 'inline',
 'headers': ['gbifID',
             'abstract',
             'accessRights',
             ...
             'lastParsed',
             'lastCrawled',
             'repatriated'],
 'row-count': 4416,
 'schema': 'table-schema',
 'source': '/var/folders/19/mrkkz1tx34l98n1rs3qr3xr00000gn/T/tmpb7ugs_ug-datapackage/occurrence.txt',
 'time': 18.468,
 'valid': False}
---------
[2,64] [enumerable-constraint] The value "PRESERVED_SPECIMEN" in row 2 and column 64 does not conform to the given enumeration: "['PreservedSpecimen', 'FossilSpecimen', 'LivingSpecimen', 'MaterialSample', 'Event', 'HumanObservation', 'MachineObservation', 'Taxon', 'Occurrence']"
[25,64] [enumerable-constraint] The value "HUMAN_OBSERVATION" in row 25 and column 64 does not conform to the given enumeration: "['PreservedSpecimen', 'FossilSpecimen', 'LivingSpecimen', 'MaterialSample', 'Event', 'HumanObservation', 'MachineObservation', 'Taxon', 'Occurrence']"
[25,135] [type-or-format-error] The value "3536.0" in row 25 and column 135 is not type "integer" and format "default"
...

TABLE [2]
=========
{'datapackage': 'download_DP.zip',
 'error-count': 27,
 'format': 'inline',
 'headers': ['gbifID',
             'abstract',
             'accessRights',
            ...
             'taxonomicStatus',
             'nomenclaturalStatus',
             'taxonRemarks'],
 'row-count': 4416,
 'schema': 'table-schema',
 'source': '/var/folders/19/mrkkz1tx34l98n1rs3qr3xr00000gn/T/tmpb7ugs_ug-datapackage/verbatim.txt',
 'time': 17.675,
 'valid': False}
---------
[103,123] [maximum-length-constraint] The value "BEL" in row 103 and column 123 does not conform to the maximum length constraint of "2"
[470,36] [maximum-length-constraint] The value "Various" in row 470 and column 36 does not conform to the maximum length constraint of "2"
[470,142] [type-or-format-error] The value "707.1" in row 470 and column 142 is not type "integer" and format "default"
[1440,64] [enumerable-constraint] The value "ObservedSpecimen" in row 1440 and column 64 does not conform to the given enumeration: "['PreservedSpecimen', 'FossilSpecimen', 'LivingSpecimen', 'MaterialSample', 'Event', 'HumanObservation', 'MachineObservation', 'Taxon', 'Occurrence']"
[1441,64] [enumerable-constraint] The value "ObservedSpecimen" in row 1441 and column 64 does not conform to the given enumeration: "['PreservedSpecimen', 'FossilSpecimen', 'LivingSpecimen', 'MaterialSample', 'Event', 'HumanObservation', 'MachineObservation', 'Taxon', 'Occurrence']"
[1618,123] [maximum-length-constraint] The value "BE-WNA" in row 1618 and column 123 does not conform to the maximum length constraint of "2"
[4416,64] [enumerable-constraint] The value "Museum specimen" in row 4416 and column 64 does not conform to the given enumeration: "['PreservedSpecimen', 'FossilSpecimen', 'LivingSpecimen', 'MaterialSample', 'Event', 'HumanObservation', 'MachineObservation', 'Taxon', 'Occurrence']"
...

TABLE [3]
=========
{'datapackage': 'download_DP.zip',
 'error-count': 0,
 'format': 'inline',
 'headers': ['gbifID',
             'type',
             'format',
             'identifier',
             'references',
             'title',
             'description',
             'created',
             'creator',
             'contributor',
             'publisher',
             'audience',
             'source',
             'license',
             'rightsHolder'],
 'row-count': 176,
 'schema': 'table-schema',
 'source': '/var/folders/19/mrkkz1tx34l98n1rs3qr3xr00000gn/T/tmpb7ugs_ug-datapackage/multimedia.txt',
 'time': 0.076,
 'valid': True}

4. If necessary, add user constraints

Open data_package.json (from download_DP.zip), and add your own user constraints such as: