programmatic access to GEOROC data
Project description
pygeoroc
Python library to access GEOROC data as archived at https://data.goettingen-research-online.de/dataverse/digis.
Cite GEOROC data as specified
for each pre-compiled dataset
and pygeoroc as
Robert Forkel. (2022). pofatu/pygeoroc: Programmatic access to GEOROC data. Zenodo. http://doi.org/10.5281/zenodo.3744586
Install
Install pygeoroc from the Python Package Index running
pip install pygeoroc
preferably in a separate virtual environment, to keep your system's Python installation unaffected.
Installing pygeoroc will also install a command line program georoc, which provides
functionality to curate a local copy of the GEOROC data. Run
georoc -h
for details on usage.
Overview
GEOROC provides its downloadable content in precompiled files organized in datasets.
pygeoroc provides functionality to
- download this data to a local directory, called repository,
- access the data in the repository programmatically from Python code,
- load the data from the repository into a SQLite database for scalable and performant analysis.
Downloading GEOROC data
Running
$ georoc --repos tmp download
will create or update a local repository, i.e. a directory with the following layout:
$ tree -L 1 .
.
├── csv
└── datasets.json
The repository contains a "table of contents" in datasets.json. The checksum recorded per file in this table is checked
when running georoc download again, making sure only new versions of files are fetched.
The local repository can be inspected running georoc ls, e.g.
$ georoc --repos tmp/ ls --samples --references --format pipe
will output the table included below
| file | dataset | size | last modified | # samples | # references | path |
|---|---|---|---|---|---|---|
| doi:10.25625/1KRR1P/QIINIE | GEOROC Compilation: Archaean Cratons | 211.7KB | 2022-06-20 | 242 | 8 | 2022-06-1KRR1P_ALDAN_SHIELD_ARCHEAN.csv |
| doi:10.25625/1KRR1P/9YG7VJ | GEOROC Compilation: Archaean Cratons | 494.3KB | 2022-06-20 | 564 | 23 | 2022-06-1KRR1P_AMAZONIAN_CRATON.csv |
| doi:10.25625/1KRR1P/E7NI47 | GEOROC Compilation: Archaean Cratons | 3.3MB | 2022-06-20 | 3776 | 66 | 2022-06-1KRR1P_BALTIC_SHIELD_ARCHEAN.csv |
| doi:10.25625/1KRR1P/FY7PXY | GEOROC Compilation: Archaean Cratons | 53.8KB | 2022-06-20 | 61 | 5 | 2022-06-1KRR1P_BASTAR_CRATON_ARCHEAN.csv |
...
Loading GEOROC data into SQLite
Since GEOROC contains data about hundreds of thousands of samples, querying it is made a lot easier by loading it into a SQLite database:
$ georoc --repos tmp/ createdb
100%|██████████████████████████████████████████████████| 5/5 [00:00<00:00, 25.32it/s]
INFO tmp/georoc.sqlite
The resulting database has 4 tables:
file: Info about a CSV file, basically the data fromindex.csv.sample: Info about individual samples.reference: Info about sources of the datacitation: The association table relating samples with references.
The schema can be inspected running:
$ sqlite3 tmp/georoc.sqlite
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .schema
and looks as follows:
CREATE TABLE file (id TEXT PRIMARY KEY, date TEXT, section TEXT);
CREATE TABLE reference (id INTEGER PRIMARY KEY, reference TEXT);
CREATE TABLE sample (
id TEXT PRIMARY KEY,
file_id TEXT,
`AGE` TEXT,
`ALTERATION` TEXT,
`DRILL_DEPTHAX` TEXT,
`DRILL_DEPTH_MIN` TEXT,
`ELEVATION_MAX` TEXT,
`ELEVATION_MIN` TEXT,
`EPSILON_ND` TEXT,
`ERUPTION_DAY` TEXT,
`ERUPTION_MONTH` TEXT,
`ERUPTION_YEAR` TEXT,
`GEOL.` TEXT,
`LAND_OR_SEA` TEXT,
`LATITUDE_MAX` TEXT,
`LATITUDE_MIN` TEXT,
`LOCATION` TEXT,
`LOCATION_COMMENT` TEXT,
`LONGITUDE_MAX` TEXT,
`LONGITUDE_MIN` TEXT,
`MATERIAL` TEXT,
`MINERAL` TEXT,
`ROCK_NAME` TEXT,
`ROCK_TEXTURE` TEXT,
`ROCK_TYPE` TEXT,
`TECTONIC_SETTING` TEXT,
`AR40_K40` TEXT,
`HE3_HE4` TEXT,
`HE4_HE3` TEXT,
`HF176_HF177` TEXT,
`K40_AR40` TEXT,
`ND143_ND144` TEXT,
`ND143_ND144_INI` TEXT,
`OS184_OS188` TEXT,
`OS186_OS188` TEXT,
`OS187_OS186` TEXT,
`OS187_OS188` TEXT,
`PB206_PB204` TEXT,
`PB206_PB204_INI` TEXT,
`PB207_PB204` TEXT,
`PB207_PB204_INI` TEXT,
`PB208_PB204` TEXT,
`PB208_PB204_INI` TEXT,
`RE187_OS186` TEXT,
`RE187_OS188` TEXT,
`SR87_SR86` TEXT,
`SR87_SR86_INI` TEXT,
`AG(PPM)` TEXT,
`AL(PPM)` TEXT,
`AS(PPM)` TEXT,
`AU(PPM)` TEXT,
`B(PPM)` TEXT,
`BA(PPM)` TEXT,
`BE(PPM)` TEXT,
`BI(PPM)` TEXT,
`BR(PPM)` TEXT,
`C(PPM)` TEXT,
`CA(PPM)` TEXT,
`CAO(WT%)` TEXT,
`CD(PPM)` TEXT,
`CE(PPM)` TEXT,
`CL(PPM)` TEXT,
`CL(WT%)` TEXT,
`CO(PPM)` TEXT,
`CR(PPM)` TEXT,
`CS(PPM)` TEXT,
`CU(PPM)` TEXT,
`DY(PPM)` TEXT,
`ER(PPM)` TEXT,
`EU(PPM)` TEXT,
`F(PPM)` TEXT,
`F(WT%)` TEXT,
`FE(PPM)` TEXT,
`FEO(WT%)` TEXT,
`FEOT(WT%)` TEXT,
`GA(PPM)` TEXT,
`GD(PPM)` TEXT,
`GE(PPM)` TEXT,
`HE(CCM/G)` TEXT,
`HE(CCMSTP/G)` TEXT,
`HE(NCC/G)` TEXT,
`HF(PPM)` TEXT,
`HG(PPM)` TEXT,
`HO(PPM)` TEXT,
`I(PPM)` TEXT,
`IN(PPM)` TEXT,
`IR(PPM)` TEXT,
`K(PPM)` TEXT,
`LA(PPM)` TEXT,
`LI(PPM)` TEXT,
`LOI(WT%)` TEXT,
`LU(PPM)` TEXT,
`MAX._AGE_(YRS.)` TEXT,
`MG(PPM)` TEXT,
`MGO(WT%)` TEXT,
`MIN._AGE_(YRS.)` TEXT,
`MN(PPM)` TEXT,
`MNO(WT%)` TEXT,
`MO(PPM)` TEXT,
`NA(PPM)` TEXT,
`NB(PPM)` TEXT,
`ND(PPM)` TEXT,
`NI(PPM)` TEXT,
`NIO(WT%)` TEXT,
`O(WT%)` TEXT,
`OH(WT%)` TEXT,
`OS(PPM)` TEXT,
`OTHERS(WT%)` TEXT,
`P(PPM)` TEXT,
`PB(PPM)` TEXT,
`PD(PPM)` TEXT,
`PR(PPM)` TEXT,
`PT(PPM)` TEXT,
`RB(PPM)` TEXT,
`RE(PPM)` TEXT,
`RH(PPM)` TEXT,
`RU(PPM)` TEXT,
`S(PPM)` TEXT,
`S(WT%)` TEXT,
`SB(PPM)` TEXT,
`SC(PPM)` TEXT,
`SE(PPM)` TEXT,
`SM(PPM)` TEXT,
`SN(PPM)` TEXT,
`SR(PPM)` TEXT,
`TA(PPM)` TEXT,
`TB(PPM)` TEXT,
`TE(PPM)` TEXT,
`TH(PPM)` TEXT,
`TI(PPM)` TEXT,
`TL(PPM)` TEXT,
`TM(PPM)` TEXT,
`U(PPM)` TEXT,
`V(PPM)` TEXT,
`VOLATILES(WT%)` TEXT,
`W(PPM)` TEXT,
`Y(PPM)` TEXT,
`YB(PPM)` TEXT,
`ZN(PPM)` TEXT,
`ZR(PPM)` TEXT,
`AL2O3(WT%)` TEXT,
`B2O3(WT%)` TEXT,
`CH4(WT%)` TEXT,
`CL2(WT%)` TEXT,
`CO1(WT%)` TEXT,
`CO2(PPM)` TEXT,
`CO2(WT%)` TEXT,
`CR2O3(WT%)` TEXT,
`FE2O3(WT%)` TEXT,
`H2O(WT%)` TEXT,
`H2OM(WT%)` TEXT,
`H2OP(WT%)` TEXT,
`H2OT(WT%)` TEXT,
`HE3(AT/G)` TEXT,
`HE3(CCMSTP/G)` TEXT,
`HE3_HE4(R/R(A))` TEXT,
`HE4(AT/G)` TEXT,
`HE4(CCM/G)` TEXT,
`HE4(CCMSTP/G)` TEXT,
`HE4(MOLE/G)` TEXT,
`HE4(NCC/G)` TEXT,
`HE4_HE3(R/R(A))` TEXT,
`K2O(WT%)` TEXT,
`NA2O(WT%)` TEXT,
`P2O5(WT%)` TEXT,
`SIO2(WT%)` TEXT,
`SO2(WT%)` TEXT,
`SO3(WT%)` TEXT,
`SO4(WT%)` TEXT,
`TIO2(WT%)` TEXT,
FOREIGN KEY (file_id) REFERENCES file(id)
);
CREATE TABLE citation (
sample_id TEXT,
reference_id INTEGER,
FOREIGN KEY (sample_id) REFERENCES sample(id),
FOREIGN KEY (reference_id) REFERENCES reference(id)
);
Thus, information similar to what is reported by georoc ls can be obtained by
running SQL queries.
E.g. to determine the paper which contributed the highest number of samples, we can run
SELECT
r.reference, COUNT(c.sample_id) AS c
FROM
reference as r, citation as c
WHERE
r.id = c.reference_id
GROUP BY r.reference
ORDER BY c DESC
LIMIT 1;
which yields
FONTIJN K., MCNAMARA K., TADESSE A. Z., PYLE D. M., DESSALEGN F., HUTCHISON W., MATHER T. A., YIRGU G.: CONTRASTING STYLES OF POST-CALDERA VOLCANISM ALONG THE MAIN ETHIOPIAN RIFT: IMPLICATIONS FOR CONTEMPORARY VOLCANIC HAZARDS J. VOLCANOL. GEOTHERM. RES. 356 [2018] 90-113 doi: 10.1016/j.jvolgeores.2018.02.001|3978
Accessing GEOROC data programmatically
pygeoroc provides a Python API to access local GEOROC data:
>>> from pygeoroc import GEOROC
>>> api = GEOROC('tmp/')
>>> for sample in api.iter_samples():
>>> for sample, file in api.iter_samples():
... print(sample)
... print(file)
... break
...
Sample(id='138180', name='s_27-261-33,CC,PC. 2 [2118]', citations=['2118'], data={'SR(PPM)': '', ... })
File(name='Ocean_Basin_Flood_Basalts_comp__ARGO_ABYSSAL_PLAIN;INDIAN_OCEAN.csv', date=datetime.date(2020, 3, 9), section='Ocean Basin Flood Basalts')
Converters
For both access modes - SQLite and the python API - pygeoroc provides a
mechanism to specify "converters", i.e. python callables to convert the values
for specific columns in GEOROC CSV data. This mechanism can be used to fix
errata - e.g. missing negative signs in geographic coordinates - or cast data
to more suitable datatypes.
This mechanism is implemented in pygeoroc.errata
and works as follows: If the data repository contains a python module called
converters.py, this module is loaded and two python dicts are looked up
by name:
FIELDS:dictmapping CSV column names to converter functions.COORDINATES:dictmapping CSV filenames (as stored in the repository) todicts specifying converter functions for keyslatitudeandlongitude.
A "converter function" is a python callable with the following signature:
def conv(old, data, fname):
"""
@param old: old value for the respective field in the sample data
@param data: full `dict` of the sample data in one row in the GEOROC CSV
@param fname: name of the GEOROC CSV file containing the row
@return: the new value for "field" in "data"
"""
Some useful converter functions are available as attributes of
pygeoroc.errata.CONVERTERS.
So, to specify that values for the column LAND_OR_SEA must be all uppercase,
and latitudes in the file BISMARCK_ARC_NEW_BRITAIN_ARC.csv must be negative
(look up filenames in your repository's csv/ directory or by running SELECT id FROM file; in the SQLite db),
you would put the following python code in your repository's
converters.py:
from pygeoroc.errata import CONVERTERS
FIELDS = {
'LAND_OR_SEA': CONVERTERS.upper,
}
COORDINATES = {
"BISMARCK_ARC_NEW_BRITAIN_ARC.csv": {
'latitude': CONVERTERS.negative,
},
}
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pygeoroc-2.0.0.tar.gz.
File metadata
- Download URL: pygeoroc-2.0.0.tar.gz
- Upload date:
- Size: 21.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0e7b2d2a6e32fc8963fc149aa6b3556613a0caa9ae8d0c92c3c4893286773d25
|
|
| MD5 |
c27d0865fb5b0f9de3ce77f9fffc69da
|
|
| BLAKE2b-256 |
a8b96a8fb7743c2de1b9680c92097249824c5bfe593722319b66a3b3ca966314
|
File details
Details for the file pygeoroc-2.0.0-py2.py3-none-any.whl.
File metadata
- Download URL: pygeoroc-2.0.0-py2.py3-none-any.whl
- Upload date:
- Size: 19.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0c9a94eb05ffb564274411f536899a26e9f7594551419768148f946f55027a1
|
|
| MD5 |
54fc430dc1d454112587427b1d06e87b
|
|
| BLAKE2b-256 |
bc874d41f758084a317bb6a30b1ab64eeba9f8d24a78a5efcdec15addb935544
|