Skip to main content

Data collection manager

Project description

aswan

Documentation Status codeclimate codecov pypi DOI

collect and organize data into a T1 data depot named after the Aswan Dam

Collect and compress data from the internet for later parsing

  • quick, parallel, customizable to collect
  • compressed to store
  • quick to sync with a remote store
    • sync to continue collecting
    • sync to parse
  • immutable collection

To Setup a Remote

set the environment variables ASWAN_AUTH_HEX and ASWAN_AUTH_PASS according to the zimmauth package, and ASWAN_REMOTE with the name of the default remote.

Concepts

  • objects
    • saved by collection events
  • events
    • collection
    • registration (v2: registration for parsing)
    • (v2) parsing
  • runs
    • manual run vs automated run
      • makes manual adding of urls easy but revertible
    • has unique id
    • generates events
    • linked to a specific version of the code
      • ideally commit hash + pip freeze
  • statuses
    • determined by base status + runs integrated
    • contains
      • what urls need to be collected
      • (v2) what collected objects need to be parsed
    • sqlite file, constantly trimmed

Structure

  • objects

    • 00, 01, ...
  • runs

    • run-hash
      • context.yaml
        • commit-hash, pip-freeze, ...
      • events.zip
  • statuses

    • status-hash
      • context.yaml
        • parent-status, integrated
      • db.sqlite.zip
  • current-run

    • context.yaml
    • events
      • these to be compressed into ../runs
    • status.sqlite
  • there is a 'TEST' status

    • cannot be integrated whatever is based on it
    • a test run can be made on it...

when starting a run:

  • check if current-run is empty
    • if not, fail with
  • find latest status
    • if it has not integrated all past runs, create a new status that has
  • start collection (+ registration)
  • either stops or breaks, all events and objects are saved to disk
  • if properly stops, move and compress stuff
    • based on one that was the starter, and current run id

Pre v1.0 laundry list

  • parallelize push / pull

  • parsing/connection/broken session error docs

  • transferring / ignoring cookies

  • template projects

    • oddsportal
      • updating thingy, based on latest match in season
    • footy
    • rotten
    • boxoffice

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aswan-0.5.15.tar.gz (45.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aswan-0.5.15-py3-none-any.whl (48.1 kB view details)

Uploaded Python 3

File details

Details for the file aswan-0.5.15.tar.gz.

File metadata

  • Download URL: aswan-0.5.15.tar.gz
  • Upload date:
  • Size: 45.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.32.3

File hashes

Hashes for aswan-0.5.15.tar.gz
Algorithm Hash digest
SHA256 4397c61d99bb062f759c060636af311d72ece21a026553422ddebe274b2e805a
MD5 bc77c2fa8fc77c80c281074a4b06b816
BLAKE2b-256 b481d178f76ce9225dd911e4f0c3e7439f5782aa84bd44321daf9b4c1820fc84

See more details on using hashes here.

File details

Details for the file aswan-0.5.15-py3-none-any.whl.

File metadata

  • Download URL: aswan-0.5.15-py3-none-any.whl
  • Upload date:
  • Size: 48.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-requests/2.32.3

File hashes

Hashes for aswan-0.5.15-py3-none-any.whl
Algorithm Hash digest
SHA256 8bdf0a4f8ad0758e321ace6b45d7f9e0049138b0420c1550d3f1ebdb8483e945
MD5 9cecd3157a5836809f6221128a25538f
BLAKE2b-256 3eeb3f4c71f362b3315934211cf022e61e7c55a50ed6773bbd93786df8a6d979

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page