Skip to main content

Async crawler and datalake service for data.gouv.fr

Project description

udata-hydra 🦀

udata-hydra is an async metadata crawler for data.gouv.fr.

URLs are crawled via aiohttp, catalog and crawled metadata are stored in a PostgreSQL database.

CLI

Create database structure

Install udata-hydra dependencies and cli. make deps

udata-hydra init-db

Load (UPSERT) latest catalog version from data.gouv.fr

udata-hydra load-catalog

Crawler

udata-hydra-crawl

It will crawl (forever) the catalog according to config set in config.py.

BATCH_SIZE URLs are queued at each loop run.

The crawler will start with URLs never checked and then proceed with URLs crawled before SINCE interval. It will then wait until something changes (catalog or time).

There's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, BACKOFF_NB_REQ is exceeded in a period of BACKOFF_PERIOD seconds. It will sleep and retry until the backoff is lifted.

If an URL matches one of the EXCLUDED_PATTERNS, it will never be checked.

A curses interface is available via:

HYDRA_CURSES_ENABLED=True udata-hydra-crawl

API

Run

pip install -r requirements.txt
adev runserver udata-hydra/app.py

Get latest check

Works with ?url={url} and ?resource_id={resource_id}.

$ curl -s "http://localhost:8000/api/checks/latest/?url=http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv" | json_pp
{
   "status" : 200,
   "catalog_id" : 64148,
   "deleted" : false,
   "error" : null,
   "created_at" : "2021-02-06T12:19:08.203055",
   "response_time" : 0.830198049545288,
   "url" : "http://opendata-sig.saintdenis.re/datasets/661e19974bcc48849bbff7c9637c5c28_1.csv",
   "domain" : "opendata-sig.saintdenis.re",
   "timeout" : false,
   "id" : 114750,
   "dataset_id" : "5c34944606e3e73d4a551889",
   "resource_id" : "b3678c59-5b35-43ad-9379-fce29e5b56fe",
   "headers" : {
      "content-disposition" : "attachment; filename=\"xn--Dlimitation_des_cantons-bcc.csv\"",
      "server" : "openresty",
      "x-amz-meta-cachetime" : "191",
      "last-modified" : "Wed, 29 Apr 2020 02:19:04 GMT",
      "content-encoding" : "gzip",
      "content-type" : "text/csv",
      "cache-control" : "must-revalidate",
      "etag" : "\"20415964703d9ccc4815d7126aa3a6d8\"",
      "content-length" : "207",
      "date" : "Sat, 06 Feb 2021 12:19:08 GMT",
      "x-amz-meta-contentlastmodified" : "2018-11-19T09:38:28.490Z",
      "connection" : "keep-alive",
      "vary" : "Accept-Encoding"
   }
}

Get all checks for an URL or resource

Works with ?url={url} and ?resource_id={resource_id}.

$ curl -s "http://localhost:8000/api/checks/all/?url=http://www.drees.sante.gouv.fr/IMG/xls/er864.xls" | json_pp
[
   {
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 165107,
      "created_at" : "2021-02-06T14:32:47.675854",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null
   },
   {
      "timeout" : false,
      "deleted" : false,
      "response_time" : null,
      "error" : "Cannot connect to host www.drees.sante.gouv.fr:443 ssl:True [SSLCertVerificationError: (1, \"[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'www.drees.sante.gouv.fr'. (_ssl.c:1122)\")]",
      "domain" : "www.drees.sante.gouv.fr",
      "dataset_id" : "53d6eadba3a72954d9dd62f5",
      "created_at" : "2020-12-24T17:06:58.158125",
      "resource_id" : "93dfd449-9d26-4bb0-a6a9-ee49b1b8a4d7",
      "status" : null,
      "catalog_id" : 232112,
      "url" : "http://www.drees.sante.gouv.fr/IMG/xls/er864.xls",
      "headers" : {},
      "id" : 65092
   }
]

Get modification date on resources

This tries to find a modification date for a given resource, by order of priority:

  1. last-modified header if any
  2. content-length comparison over multiple checks if any (precision depends on crawling frequency)

Works with ?url={url} and ?resource_id={resource_id}.

$ curl -s "http://localhost:8000/api/changed/?resource_id=f2d3e1ad-4d7d-46fc-91f8-c26f02c1e487" | json_pp
{
   "changed_at" : "2014-09-15T14:51:52",
   "detection" : "last-modified"
}
$ curl -s "http://localhost:8000/api/changed/?resource_id=f2d3e1ad-4d7d-46fc-91f8-c26f02c1e487" | json_pp
{
   "changed_at" : "2020-09-15T14:51:52",
   "detection" : "content-length"
}

Get crawling status

$ curl -s "http://localhost:8000/api/status/" | json_pp
{
   "fresh_checks_percentage" : 0.4,
   "pending_checks" : 142153,
   "total" : 142687,
   "fresh_checks" : 534,
   "checks_percentage" : 0.4
}

Get crawling stats

$ curl -s "http://localhost:8000/api/stats/" | json_pp
{
   "status" : [
      {
         "count" : 525,
         "percentage" : 98.3,
         "label" : "ok"
      },
      {
         "label" : "error",
         "percentage" : 1.3,
         "count" : 7
      },
      {
         "label" : "timeout",
         "percentage" : 0.4,
         "count" : 2
      }
   ],
   "status_codes" : [
      {
         "code" : 200,
         "count" : 413,
         "percentage" : 78.7
      },
      {
         "code" : 501,
         "percentage" : 12.4,
         "count" : 65
      },
      {
         "percentage" : 6.1,
         "count" : 32,
         "code" : 404
      },
      {
         "code" : 500,
         "percentage" : 2.7,
         "count" : 14
      },
      {
         "code" : 502,
         "count" : 1,
         "percentage" : 0.2
      }
   ]
}

Using Kafka integration

** Set the environment variables ** Rename the .env.sample to .env and fill it with the right values.

REDIS_URL = redis://localhost:6380/0
REDIS_HOST = localhost
REDIS_PORT = 6380
KAFKA_HOST = localhost
KAFKA_PORT = 9092
KAFKA_API_VERSION = 2.5.0
MINIO_URL = https://object.local.dev/
MINIO_USER = sample_user
MINIO_BUCKET = benchmark-de
MINIO_PWD = sample_pwd
MINIO_FOLDER = data
MAX_FILESIZE_ALLOWED = 1e9
UDATA_INSTANCE_NAME = udata

The kafka_integration module retrieves messages with the topics resource.created, resource.modified and resource.deleted sent by udata. The Kafka instance URI, Hydra API URL and Data Gouv API URL to be used can be defined in udata-hydra/config or overwritten with env variables. It can be launched using the CLI: udata-hydra run_kafka_integration. This will mark the corresponding resources as highest priority for the next crawling batch.

TODO

  • non curse interface :sad:
  • tests
  • expose summary/status as API
  • change detection API on url / resource
  • handle GET request when 501 on HEAD
  • handle GET requests for some domains
  • denormalize interesting headers (length, mimetype, last-modified...)
  • some sort of dashboard (dash?), or just plug postgrest and handle that elsewhere
  • custom config file for pandas_profiling
  • move API endpoints to /api endpoints

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

udata-hydra-0.1.0.dev70.tar.gz (184.3 kB view details)

Uploaded Source

Built Distribution

udata_hydra-0.1.0.dev70-py2.py3-none-any.whl (20.7 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file udata-hydra-0.1.0.dev70.tar.gz.

File metadata

  • Download URL: udata-hydra-0.1.0.dev70.tar.gz
  • Upload date:
  • Size: 184.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.27.1

File hashes

Hashes for udata-hydra-0.1.0.dev70.tar.gz
Algorithm Hash digest
SHA256 99d87639cc5099db6664235f5f5eb157c392ca8a351c8f8db413ffa98e498a98
MD5 edf00b10e903cd0103abc15771e96235
BLAKE2b-256 b3850403841f6523768a7998200c7991baf86cdc3c56e3c8167ee5ac3ae1eb09

See more details on using hashes here.

File details

Details for the file udata_hydra-0.1.0.dev70-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for udata_hydra-0.1.0.dev70-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 cf6f1a548818d5cdd1380382de1310c8a2a812a8eb50fb833e6b5e21ae134223
MD5 86796834e1edecb3f9352e39a612c176
BLAKE2b-256 8f0a398ce5e9608e5c860cf44db86eceb39f97ccbf8ba37bed6b6e55aff77566

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page