Skip to main content

programmatic access to GEOROC data

Project description

pygeoroc

Build Status codecov PyPI

Python library to access data in the GEOROC database. Cite GEOROC as

Sarbas, B., U.Nohl: The GEOROC database as part of a growing geoinformatics network. In: Brady, S.R., Sinha, A.K., and Gundersen, L.C. (editors): Geoinformatics 2008—Data to Knowledge, Proceedings: U.S. Geological Survey Scientific Investigations Report 2008-5172 (2008), pp. 42/43.

and pygeoroc as

Robert Forkel. (2021). pofatu/pygeoroc: Programmatic access to GEOROC data. Zenodo. http://doi.org/10.5281/zenodo.3744586

Install

Install pygeoroc from the Python Package Index running

pip install pygeoroc

preferably in a separate virtual environment, to keep your system's Python installation unaffected.

Installing pygeoroc will also install a command line program georoc, which provides functionality to curate a local copy of the GEOROC data. Run

georoc -h

for details on usage.

Overview

GEOROC provides its downloadable content in precompiled files organized in sections which correspond mostly to locations.

pygeoroc provides functionality to

  • download this data to a local directory, called repository,
  • access the data in the repository programmatically from Python code,
  • load the data from the repository into a SQLite database for scalable and performant analysis.

Downloading GEOROC data

Downloading GEOROC data will create or update a local repository. Since the volume of data provided per section differs greatly, downloading can be limited to individual sections. E.g. running

georoc --repos tmp/ download --section "Ocean Basin Flood Basalts"

will result in the following repository

tree tmp/
tmp/
├── csv
│   ├── Ocean_Basin_Flood_Basalts_comp__ARGO_ABYSSAL_PLAIN;INDIAN_OCEAN.csv
│   ├── Ocean_Basin_Flood_Basalts_comp__EAST_MARIANA_BASIN.csv
│   ├── Ocean_Basin_Flood_Basalts_comp__NAURU_BASIN.csv
│   ├── Ocean_Basin_Flood_Basalts_comp__PIGAFETTA_BASIN.csv
│   └── Ocean_Basin_Flood_Basalts_comp__WHARTON_BASIN.csv
└── index.csv

1 directory, 6 files

Note that the repository also contains a "table of contents" in index.csv. The timestamp recorded per file in this table is checked against the "Last Actualization" column on the web page to determine whether a file needs to be refreshed when georoc download is run again.

The local repository can be inspected running georoc ls, e.g.

$ georoc --repos tmp/ ls --samples --references --format pipe

will output the table included below

file section size last modified # samples # references path
ARGO_ABYSSAL_PLAIN;INDIAN_OCEAN Ocean Basin Flood Basalts 34.5KB 2020-03-09 38 6 tmp/csv/Ocean_Basin_Flood_Basalts_comp__ARGO_ABYSSAL_PLAIN;INDIAN_OCEAN.csv
EAST_MARIANA_BASIN Ocean Basin Flood Basalts 86.7KB 2020-03-09 93 11 tmp/csv/Ocean_Basin_Flood_Basalts_comp__EAST_MARIANA_BASIN.csv
NAURU_BASIN Ocean Basin Flood Basalts 273.9KB 2020-03-09 356 18 tmp/csv/Ocean_Basin_Flood_Basalts_comp__NAURU_BASIN.csv
PIGAFETTA_BASIN Ocean Basin Flood Basalts 349.0KB 2020-03-09 428 18 tmp/csv/Ocean_Basin_Flood_Basalts_comp__PIGAFETTA_BASIN.csv
WHARTON_BASIN Ocean Basin Flood Basalts 75.7KB 2020-03-09 94 8 tmp/csv/Ocean_Basin_Flood_Basalts_comp__WHARTON_BASIN.csv

Loading GEOROC data into SQLite

Since GEOROC contains data about hundreds of thousands of samples, querying it is made a lot easier by loading it into a SQLite database:

$ georoc --repos tmp/ createdb
100%|██████████████████████████████████████████████████| 5/5 [00:00<00:00, 25.32it/s]
INFO    tmp/georoc.sqlite

The resulting database has 4 tables:

  • file: Info about a CSV file, basically the data from index.csv.
  • sample: Info about individual samples.
  • reference: Info about sources of the data
  • citation: The association table relating samples with references.

The schema can be inspected running:

$ sqlite3 tmp/georoc.sqlite
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .schema

and looks as follows:

CREATE TABLE file (id TEXT PRIMARY KEY, date TEXT, section TEXT);
CREATE TABLE reference (id INTEGER PRIMARY KEY, reference TEXT);
CREATE TABLE sample (
    id TEXT PRIMARY KEY,
    file_id TEXT,
    `AGE` TEXT,
`ALTERATION` TEXT,
`DRILL_DEPTHAX` TEXT,
`DRILL_DEPTH_MIN` TEXT,
`ELEVATION_MAX` TEXT,
`ELEVATION_MIN` TEXT,
`EPSILON_ND` TEXT,
`ERUPTION_DAY` TEXT,
`ERUPTION_MONTH` TEXT,
`ERUPTION_YEAR` TEXT,
`GEOL.` TEXT,
`LAND_OR_SEA` TEXT,
`LATITUDE_MAX` TEXT,
`LATITUDE_MIN` TEXT,
`LOCATION` TEXT,
`LOCATION_COMMENT` TEXT,
`LONGITUDE_MAX` TEXT,
`LONGITUDE_MIN` TEXT,
`MATERIAL` TEXT,
`MINERAL` TEXT,
`ROCK_NAME` TEXT,
`ROCK_TEXTURE` TEXT,
`ROCK_TYPE` TEXT,
`TECTONIC_SETTING` TEXT,
`AR40_K40` TEXT,
`HE3_HE4` TEXT,
`HE4_HE3` TEXT,
`HF176_HF177` TEXT,
`K40_AR40` TEXT,
`ND143_ND144` TEXT,
`ND143_ND144_INI` TEXT,
`OS184_OS188` TEXT,
`OS186_OS188` TEXT,
`OS187_OS186` TEXT,
`OS187_OS188` TEXT,
`PB206_PB204` TEXT,
`PB206_PB204_INI` TEXT,
`PB207_PB204` TEXT,
`PB207_PB204_INI` TEXT,
`PB208_PB204` TEXT,
`PB208_PB204_INI` TEXT,
`RE187_OS186` TEXT,
`RE187_OS188` TEXT,
`SR87_SR86` TEXT,
`SR87_SR86_INI` TEXT,
`AG(PPM)` TEXT,
`AL(PPM)` TEXT,
`AS(PPM)` TEXT,
`AU(PPM)` TEXT,
`B(PPM)` TEXT,
`BA(PPM)` TEXT,
`BE(PPM)` TEXT,
`BI(PPM)` TEXT,
`BR(PPM)` TEXT,
`C(PPM)` TEXT,
`CA(PPM)` TEXT,
`CAO(WT%)` TEXT,
`CD(PPM)` TEXT,
`CE(PPM)` TEXT,
`CL(PPM)` TEXT,
`CL(WT%)` TEXT,
`CO(PPM)` TEXT,
`CR(PPM)` TEXT,
`CS(PPM)` TEXT,
`CU(PPM)` TEXT,
`DY(PPM)` TEXT,
`ER(PPM)` TEXT,
`EU(PPM)` TEXT,
`F(PPM)` TEXT,
`F(WT%)` TEXT,
`FE(PPM)` TEXT,
`FEO(WT%)` TEXT,
`FEOT(WT%)` TEXT,
`GA(PPM)` TEXT,
`GD(PPM)` TEXT,
`GE(PPM)` TEXT,
`HE(CCM/G)` TEXT,
`HE(CCMSTP/G)` TEXT,
`HE(NCC/G)` TEXT,
`HF(PPM)` TEXT,
`HG(PPM)` TEXT,
`HO(PPM)` TEXT,
`I(PPM)` TEXT,
`IN(PPM)` TEXT,
`IR(PPM)` TEXT,
`K(PPM)` TEXT,
`LA(PPM)` TEXT,
`LI(PPM)` TEXT,
`LOI(WT%)` TEXT,
`LU(PPM)` TEXT,
`MAX._AGE_(YRS.)` TEXT,
`MG(PPM)` TEXT,
`MGO(WT%)` TEXT,
`MIN._AGE_(YRS.)` TEXT,
`MN(PPM)` TEXT,
`MNO(WT%)` TEXT,
`MO(PPM)` TEXT,
`NA(PPM)` TEXT,
`NB(PPM)` TEXT,
`ND(PPM)` TEXT,
`NI(PPM)` TEXT,
`NIO(WT%)` TEXT,
`O(WT%)` TEXT,
`OH(WT%)` TEXT,
`OS(PPM)` TEXT,
`OTHERS(WT%)` TEXT,
`P(PPM)` TEXT,
`PB(PPM)` TEXT,
`PD(PPM)` TEXT,
`PR(PPM)` TEXT,
`PT(PPM)` TEXT,
`RB(PPM)` TEXT,
`RE(PPM)` TEXT,
`RH(PPM)` TEXT,
`RU(PPM)` TEXT,
`S(PPM)` TEXT,
`S(WT%)` TEXT,
`SB(PPM)` TEXT,
`SC(PPM)` TEXT,
`SE(PPM)` TEXT,
`SM(PPM)` TEXT,
`SN(PPM)` TEXT,
`SR(PPM)` TEXT,
`TA(PPM)` TEXT,
`TB(PPM)` TEXT,
`TE(PPM)` TEXT,
`TH(PPM)` TEXT,
`TI(PPM)` TEXT,
`TL(PPM)` TEXT,
`TM(PPM)` TEXT,
`U(PPM)` TEXT,
`V(PPM)` TEXT,
`VOLATILES(WT%)` TEXT,
`W(PPM)` TEXT,
`Y(PPM)` TEXT,
`YB(PPM)` TEXT,
`ZN(PPM)` TEXT,
`ZR(PPM)` TEXT,
`AL2O3(WT%)` TEXT,
`B2O3(WT%)` TEXT,
`CH4(WT%)` TEXT,
`CL2(WT%)` TEXT,
`CO1(WT%)` TEXT,
`CO2(PPM)` TEXT,
`CO2(WT%)` TEXT,
`CR2O3(WT%)` TEXT,
`FE2O3(WT%)` TEXT,
`H2O(WT%)` TEXT,
`H2OM(WT%)` TEXT,
`H2OP(WT%)` TEXT,
`H2OT(WT%)` TEXT,
`HE3(AT/G)` TEXT,
`HE3(CCMSTP/G)` TEXT,
`HE3_HE4(R/R(A))` TEXT,
`HE4(AT/G)` TEXT,
`HE4(CCM/G)` TEXT,
`HE4(CCMSTP/G)` TEXT,
`HE4(MOLE/G)` TEXT,
`HE4(NCC/G)` TEXT,
`HE4_HE3(R/R(A))` TEXT,
`K2O(WT%)` TEXT,
`NA2O(WT%)` TEXT,
`P2O5(WT%)` TEXT,
`SIO2(WT%)` TEXT,
`SO2(WT%)` TEXT,
`SO3(WT%)` TEXT,
`SO4(WT%)` TEXT,
`TIO2(WT%)` TEXT,
    FOREIGN KEY (file_id) REFERENCES file(id)
);
CREATE TABLE citation (
    sample_id TEXT,
    reference_id INTEGER,
    FOREIGN KEY (sample_id) REFERENCES sample(id),
    FOREIGN KEY (reference_id) REFERENCES reference(id)
);

Thus, information similar to what is reported by georoc ls can be obtained by running SQL queries.

E.g. to determine the paper which contributed the highest number of samples in the "Ocean Basin Flood Basalts" section, we can run

SELECT
    r.reference, COUNT(c.sample_id) AS c
FROM
    reference as r, citation as c
WHERE
    r.id = c.reference_id
GROUP BY r.reference
ORDER BY c DESC
LIMIT 1;

which yields

KELLEY KATHERINE A., PLANK T. A., LUDDEN J. N., STAUDIGEL H.:    COMPOSITION OF ALTERED OCEANIC CRUST AT ODP SITES 801 AND 1149  GEOCHEMISTRY GEOPHYSICS GEOSYSTEMS 4   [2003]    GeoReM-id: 305   doi: 10.1029/2002GC000435|123

Accessing GEOROC data programmatically

pygeoroc provides a Python API to access local GEOROC data:

>>> from pygeoroc import GEOROC
>>> api = GEOROC('tmp/')
>>> for sample in api.iter_samples():
>>> for sample, file in api.iter_samples():
...     print(sample)
...     print(file)
...     break
... 
Sample(id='138180', name='s_27-261-33,CC,PC. 2 [2118]', citations=['2118'], data={'SR(PPM)': '', ... })
File(name='Ocean_Basin_Flood_Basalts_comp__ARGO_ABYSSAL_PLAIN;INDIAN_OCEAN.csv', date=datetime.date(2020, 3, 9), section='Ocean Basin Flood Basalts')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pygeoroc-1.0.0.tar.gz (20.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pygeoroc-1.0.0-py2.py3-none-any.whl (17.9 kB view details)

Uploaded Python 2Python 3

File details

Details for the file pygeoroc-1.0.0.tar.gz.

File metadata

  • Download URL: pygeoroc-1.0.0.tar.gz
  • Upload date:
  • Size: 20.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.5

File hashes

Hashes for pygeoroc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 e1e3af7857d6be5b8b600aa24126aee9f7166afb938ac4f7ac33efe2edfef655
MD5 6e4abf3cd96b9040c1f386372a334679
BLAKE2b-256 294e610f73634baf30612d6388ed405a087e32b68f97d0a1bc86145afc18ed8b

See more details on using hashes here.

File details

Details for the file pygeoroc-1.0.0-py2.py3-none-any.whl.

File metadata

  • Download URL: pygeoroc-1.0.0-py2.py3-none-any.whl
  • Upload date:
  • Size: 17.9 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/44.0.0 requests-toolbelt/0.9.1 tqdm/4.58.0 CPython/3.8.5

File hashes

Hashes for pygeoroc-1.0.0-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 0a51079f97b212852d5d741fe29457a5971b1c861ad91404586bfac08582f63d
MD5 5042c66ef9b986c2b525c58c1609aa2e
BLAKE2b-256 367d2a1aa844e01fe21521f03f757d33b6d6a326a06aacb354958169c91ed057

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page