Initial Dev Upload version
Project description
Trough
Big data, small databases.
The overall thrust of the design of trough is to separate a very large dataset into small SQLite databases that can be quickly and easily queried. The system focuses on simplicity and ease of administration and setup.
The Processes
Configuration/Ansible Deployment system
configures hosts
deploys 4 types of worker nodes: a read worker, a write worker, a consul worker and a synchronizer worker.
Processes
Reader
performs read requests on sqlite files
receives POST requests of SQL, throws error on non-SELECT queries
throws an error in the case that it is queried for a non-existant database
one reader per httpd thread
Writer
performs write requests to sqlite files
receives POST requests of SQL, throws error on non-INSERT/UPDATE queries
Consul Agent (Server mode)
Vanilla consul agent running in server mode
3 dedicated server consul nodes
Every Read Node and Write Node runs consul agent (a gossip agent)
- By way of DNS, HTTP POST SQL queries are sent directly to a Read Node
for example: POST http://128100.trough.service.archive-it.org/ —– SELECT count(id) FROM crawled_urls WHERE host = “www.example.com” GROUP BY status_code; —– Response returned as JSON
- Health Checks:
make http
Consul Agent (Local mode)
Vanilla consul agent running in local mode
Synchronizer (Server mode)
A small process that runs on at least two nodes for failover redundancy (dedicated or shared with Read workers)
Synchronizers elect a leader “Lead Synchronizer” via Consul
Local Synchronizer pushes tags to consul when segments are available
Local Synchronizer pushes a K/V for the current host and stored vs quota
Lead Synchronizer assigns segments to other nodes based on stored/quota ratio
Lead Synchronizer assigns one or more Read Nodes for each read-only segment (replication is configurable)
Lead Synchronizer assigns one Write Node for each writable segment
Lead Synchronizer discovers total segment pool from HDFS (file listing?)
Synchronizer (Local mode)
the local synchronizer should always be run niced so that we interfere as little as possible with query times.
checks if databases are removed from consul manifest, deletes them.
reads a consul manifest, pulls down sqlite files from HDFS, checks against checksum
sets up files to respond on /health/[ID], caches comparison of checksum of local files against HDFS, server returns 200 or 500 depending on comparison
periodically reports the databases that are actually available on the local node, in case of partial failure.
The Nodes
Read Node
Runs a Read process
Runs a Write process
Runs a Synchronizer process (Local mode)
Runs a Consul Agent (Local mode)
Consul Node
Runs a Consul Agent (Server mode)
Synchronizer Node
Runs a Consul Agent (Local Mode)
Runs a Synchronizer process (Server mode)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Hashes for Trough-0.1.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 414d7d15cc8ae89b81da189f648f5bacdd355eee75c6b4cdfd8ff835d00edaca |
|
MD5 | 301b92d76386b22bb809ec38975a5521 |
|
BLAKE2b-256 | 5532f14853fceeea99c2c2bb767384144f19343cae6048378bffe52d00678799 |