Data collection manager
Project description
aswan
collect and organize data into a T1 data depot named after the Aswan Dam
Collect and compress data from the internet for later parsing
- quick, parallel, customizable to collect
- compressed to store
- quick to sync with a remote store
- sync to continue collecting
- sync to parse
- immutable collection
To Setup a Remote
set the environment variables ASWAN_AUTH_HEX
and ASWAN_AUTH_PASS
according to the zimmauth package, and ASWAN_REMOTE
with the name of the default remote.
Concepts
- objects
- saved by collection events
- events
- collection
- registration (v2: registration for parsing)
- (v2) parsing
- runs
- manual run vs automated run
- makes manual adding of urls easy but revertible
- has unique id
- generates events
- linked to a specific version of the code
- ideally commit hash + pip freeze
- manual run vs automated run
- statuses
- determined by base status + runs integrated
- contains
- what urls need to be collected
- (v2) what collected objects need to be parsed
- sqlite file, constantly trimmed
Structure
-
objects
- 00, 01, ...
-
runs
- run-hash
- context.yaml
- commit-hash, pip-freeze, ...
- events.zip
- context.yaml
- run-hash
-
statuses
- status-hash
- context.yaml
- parent-status, integrated
- db.sqlite.zip
- context.yaml
- status-hash
-
current-run
- context.yaml
- events
- these to be compressed into ../runs
- status.sqlite
-
there is a 'TEST' status
- cannot be integrated whatever is based on it
- a test run can be made on it...
when starting a run:
- check if current-run is empty
- if not, fail with
- find latest status
- if it has not integrated all past runs, create a new status that has
- start collection (+ registration)
- either stops or breaks, all events and objects are saved to disk
- if properly stops, move and compress stuff
- based on one that was the starter, and current run id
Pre v1.0 laundry list
-
parallelize push / pull
-
parsing/connection/broken session error docs
-
transferring / ignoring cookies
-
template projects
- oddsportal
- updating thingy, based on latest match in season
- footy
- rotten
- boxoffice
- oddsportal
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
aswan-0.5.14.tar.gz
(45.6 kB
view hashes)
Built Distribution
aswan-0.5.14-py3-none-any.whl
(48.0 kB
view hashes)