Date: 30 Nov 2019 Title: Programming
Today we’ll write a simple Python program on top of FrictionlessDarwinCore and Goodtables libraries. The use case is the following: you want to validate a DarwinCore file using Goodtables tool. In a previous post, we’ve seen how to do that in two steps from a terminal:
$ fdwca 0000637-190918142434337.zip download_DP.zip
$ goodtables download_DP.zip
Now we’ll do the same but in a single Python CLI program(dwca_report.py).
1. Imports
Our program requires 3 libraries:
- FrictionlessDarwinCore converts DarwinCore archive into Data Package.
- goodtables analyzes the Data Package and produce the report.
- Click helps us to build a nice CLI.
Start a Python file (eg dwca_report.py) and write this:
import click
from FrictionlessDarwinCore import DwCArchive
from goodtables import validate
2. CLI Parameters
Our program will receive two parameters:
- dwca: the URL or local path to our DarwinCore archive
- oupath: the output path of the conveted data package.
The report itself being generated on default output. Thanks to Click library we can easily define these parameters:
@click.command()
@click.argument('dwca', type=str, required=True)
@click.argument('outpath', type=click.Path(writable=True),required=True)
def do_it(dwca, outpath):
...
if __name__ == '__main__':
do_it()
Our main program will simply invoke a do_it() function. Note that the dwca parameter is a string (URL or path) while outpath paramter is a writable path.
3. Do_it()
To produce the report, do_it() function will successively:
- create a DwCArchive object
- infer its structure from underlying data/metadata files
- save it as a Data Package
- validate this Data Package with goodtables
Each step may fail, so we will check that everything went well before processing the next step. At the core of our program, the do_it() function, will now look like this:
def do_it(dwca, outpath):
da = DwCArchive(dwca)
if not da.valid:
print('invalid dwca after init()')
return
da.infer()
if not da.valid:
print('invalid dwca after infer()')
return
da.save(outpath)
if not da.valid:
print('invalid dwca after save()')
return
report = validate(outpath)
print(report)
Alternatively, you might consider using try except.
4. From your terminal
Now, we can run our CLI program directly from a terminal, analyse the report and if necessary fix the data errors.
> Python dwca_report.py https://ipt.biodiversity.be/archive.do?r=rbins_saproxilyc_beetles DP1.zip
downloading https://ipt.biodiversity.be/archive.do?r=rbins_saproxilyc_beetles as /var/folders/19/mrkkz1tx34l98n1rs3qr3xr00000gn/T/tmpfao8xjbj
Adding additional_id_field
saving zipfile to DP.zip
{'time': 0.712, 'valid': True, 'error-count': 0, 'table-count': 1, 'tables': [{'datapackage': 'DP.zip', 'time': 0.536, 'valid': True, 'error-count': 0, ...}