thredds_crawler

A Python library for crawling THREDDS servers

These details have been verified by PyPI

Maintainers

Bobfrat dfoster kwilcox lcampbell

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- POSIX :: Linux
Programming Language
- Python
Topic
- Scientific/Engineering

Project description

thredds_crawler
===============

A simple crawler/parser for THREDDS catalogs

Usage
------

### Select

You can select datasets based on their THREDDS ID using the 'select' parameter. Python regex is supported.

```python
> from thredds_crawler.crawl import Crawl
> c = Crawl("http://tds.maracoos.org/thredds/MODIS.xml", select=[".*-Agg"])
> print c.datasets
[
<LeafDataset id: MODIS-Agg, name: MODIS-Complete Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-2009-Agg, name: MODIS-2009 Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-2010-Agg, name: MODIS-2010 Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-2011-Agg, name: MODIS-2011 Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-2012-Agg, name: MODIS-2012 Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-2013-Agg, name: MODIS-2013 Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-One-Agg, name: 1-Day-Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-Three-Agg, name: 3-Day-Aggregation, services: ['OPENDAP', 'ISO']>,
<LeafDataset id: MODIS-Seven-Agg, name: 7-Day-Aggregation, services: ['OPENDAP', 'ISO']>
]
```

### Skip

You can skip datasets based on their `name` and catalogRefs based on their `xlink:title`. By default, the crawler
uses four regular expressions to skip lists of thousands upon thousands of individual files that are part of aggregations or FMRCs:

* .\*files/
* .\*Individual Files.\*
* .\*File_Access.\*
* .\*Forecast Model Run.\*

By setting the `skip` parameter to anything other than a superset of the default you run the risk of having some angry system admins after you.

```python
# Skipping everything!
from thredds_crawler.crawl import Crawl
c = Crawl("http://tds.maracoos.org/thredds/MODIS.xml", skip=[".*"])
assert len(c.datasets) == 0
```

## Dataset

You can get some basic information about a LeafDataset, including the services available.

```python
> from thredds_crawler.crawl import Crawl
> c = Crawl("http://tds.maracoos.org/thredds/MODIS.xml", select=[".*-Agg"])
> dataset = c.datasets[0]
> print dataset.id
MODIS-Agg
> print dataset.name
MODIS-Complete Aggregation
> print dataset.services
[
{
'url': 'http://tds.maracoos.org/thredds/dodsC/MODIS-Agg.nc',
'name': 'odap',
'service': 'OPENDAP'
},
{
'url': 'http://tds.maracoos.org/thredds/iso/MODIS-Agg.nc',
'name': 'iso',
'service': 'ISO'
}
]
```

If you have a list of datasets you can easily return all endpoints of a certain type:
```python
> from thredds_crawler.crawl import Crawl
> c = Crawl("http://tds.maracoos.org/thredds/MODIS.xml", select=[".*-Agg"])
> urls = [s.get("url") for d in c.datasets for s in d.services if s.get("service").lower() == "opendap"]
> print urls
[
'http://tds.maracoos.org/thredds/dodsC/MODIS-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-2009-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-2010-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-2011-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-2012-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-2013-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-One-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-Three-Agg.nc',
'http://tds.maracoos.org/thredds/dodsC/MODIS-Seven-Agg.nc'
]
```

## Metadata

The entire THREDDS catalog metadata record is saved along with the dataset object. It is an etree Element object ready for you to pull information out of. See the [THREDDS metadata spec](http://www.unidata.ucar.edu/projects/THREDDS/tech/catalog/v1.0.2/InvCatalogSpec.html#metadata)

```python
> from thredds_crawler.crawl import Crawl
> c = Crawl("http://tds.maracoos.org/thredds/MODIS.xml", select=[".*-Agg"])
> dataset = c.datasets[0]
> print dataset.metadata.find("{http://www.unidata.ucar.edu/namespaces/thredds/InvCatalog/v1.0}documentation").text
Ocean Color data are provided as a service to the broader community, and can be
influenced by sensor degradation and or algorithm changes. We make efforts to keep
this dataset updated and calibrated. The products in these files are experimental.
Aggregations are simple means of available data over the specified time frame. Use at
your own discretion.
```

## Known Issues

* Will not handle catalogs that reference themselves

Project details

These details have been verified by PyPI

Maintainers

Bobfrat dfoster kwilcox lcampbell

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
- Science/Research
License
- OSI Approved :: GNU General Public License v3 (GPLv3)
Operating System
- POSIX :: Linux
Programming Language
- Python
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

1.5.4

Jun 6, 2018

1.5.3

Dec 6, 2016

1.5.2

Oct 11, 2016

1.5.1

Sep 28, 2016

1.5.0

Jul 21, 2016

1.4.0

May 25, 2016

1.3.0

May 25, 2016

1.2.0

Feb 9, 2016

1.1.0

Oct 26, 2015

1.0.0

Mar 20, 2015

0.9

Jan 7, 2015

0.8

Nov 3, 2014

0.7

Oct 28, 2014

0.6

May 16, 2014

0.5

Aug 8, 2013

This version

0.4

Aug 2, 2013

0.3

Jul 29, 2013

0.2

Jul 25, 2013

0.1

Jul 25, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thredds_crawler-0.4.tar.gz (5.1 kB view hashes)

Uploaded Aug 2, 2013 Source

Hashes for thredds_crawler-0.4.tar.gz

Hashes for thredds_crawler-0.4.tar.gz
Algorithm	Hash digest
SHA256	`396d0bb73a8682f3ddc107b11fa6d9235b66a353185fc35b44c317370ad84818`
MD5	`5f898895f6229cabdd0965761953ba2f`
BLAKE2b-256	`75c78ff5ad88ba5e42f9c3c87978b28d2eefeb761a60196cc24b745e91ac78f1`