Skip to main content

the basic modules used to design a new tentacle for the Algernon Leech platform

Project description

The Tentacle

mound-like tentacles groping from underground nuclei of polypous perversion...

overview

The Leech Platform works by sending out tentacles to extract (leech) data from a single identified source. These tentacles perform the extracting and processing to produce JSON objects representing the data, which they broadcast back to the Nucleus.

extraction flow

  1. Retrieve configuration regarding the IdSource and the DataSource from the tentacle's storage level
  2. Build the ExtractionConfiguration and the SourceConfiguration, which jointly contain the details of how to extract and process a specific type of extraction.
  3. Perform the extraction, the details of which are specific to each individual tentacle.
  4. Process the extracted data, transforming it to a list of standardized JSON objects.
  5. Broadcast the JSON objects to the Nucleus.

leeching data

To understand why the tentacle operates as it does, one must first understand how the Leech conceptualizes data. All data belongs to someone or something. In addition, all data must be stored somehow. In order for the leech to extract data, it must first have these two parameters defined. We refer to them as an IdSource and a DataSource.

IdSource

An IdSource is the who of the data. Who owns the data, and therefore dictates what goes into it? They are the source of the individually identifiable data assets, which will ultimately carry the identifier as id_source. An id_source could be a business, a health care organization, or a user on a mobile app.

DataSource

A DataSource is the how of the data. How is the data stored, and how will the tentacle extract it? A DataSource could be an API, a website that we scrape, a brand of IOT device, or a platform such as Google G Suite.

DataAsset

A DataAsset is a single identifiable entry from a DataSource, represented as a JSON object. It has a globally unique identifier, the asset_id, as well as a capture_timestamp, an asset_type, an id_value, id_source, and source_name (DataSource). In addition, it has asset_data, which contains all the extracted data for the asset.

Extraction

A single extraction may retrieve one DataAsset or it may retrieve thousands. Multiple extractions can capture the same asset_type. An Extraction is executed according to an ExtractionConfig and a SourceConfig.

ExtractionConfig

The blueprint for how to execute a single type of extraction against a single DataSource. It includes all the parameters needed to run the extraction and process the resulting data.

SourceConfig

Contains parameters for a given DataSource which are specific to an IdSource. For example, if multiple businesses all use one commercial database, the SourceConfig might contain a username and password for a single business.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leech_tentacle-0.0.1.tar.gz (18.0 kB view details)

Uploaded Source

File details

Details for the file leech_tentacle-0.0.1.tar.gz.

File metadata

  • Download URL: leech_tentacle-0.0.1.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.20.1 setuptools/28.8.0 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.0

File hashes

Hashes for leech_tentacle-0.0.1.tar.gz
Algorithm Hash digest
SHA256 251acf52484b5b2235ba48ff70eb660e45c73626c1404b7ae7c63a0b26489e20
MD5 f1232a9d3e8fd07ada8996d45af194dd
BLAKE2b-256 6693a78bba3541a2b0a94de3a6a1a826d004d01e183bbbc789da542670118d46

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page