Skip to main content

An efficient data pipeline

Project description

Bitflow Data Pipeline

PeTaL's data pipeline is responsible for building the neo4j database used by the main PeTaL website. It contains web scrapers and machine learning tools, which are chained together by defining type signatures on neo4j node labels. For example, the species catalog module generates Taxon nodes, and the wikipedia article module receives Taxon nodes and creates WikipediaArticle nodes.

Extending Mining Capability

The pipeline can be extending by creating a module, for instance in the modules/mining/ directory.
A module is defined by a type signature (type in) -> (type out, label from, label to) and a process(node : type in) function which creates a list of Transaction() objects from (type in) nodes.
Here, "types" are neo4j labels.

Independent

A basic skeleton of an "independent" module (with no inputs) looks like this:

from bitflow.utils.module import Module

class MyModule(Module):
    def __init__(self, in_label=None, out_label='Output', connect_labels=None, name='MyModule'):
        Module.__init__(self, in_label, out_label, connect_labels, name)

    def process(self):
    	for json_data in ...:
		yield self.default_transaction(json_data) # Create new nodes of type 'Output'

A good example of this is modules/mining/OptimizedCatalog.py

Dependent

A basic skeleton of a "dependent" module looks like this:

from bitflow.utils.module import Module

class MyModule(Module):
    def __init__(self, in_label='Input', out_label='Output', connect_labels=('to', 'from'), name='MyModule'):
        Module.__init__(self, in_label, out_label, connect_labels, name)

    def process(self, previous):
        data = previous.data # Get the neo4j JSON of a node with label 'Input'
	# new_data = ...
	yield self.default_transaction(new_data)

A good example of this is modules/mining/WikipediaModule.py

Within a Module's process() function, self.default_transaction(data) is used to create a Transaction() object from JSON for node properties. For more advanced data miners, see self.custom_transaction() and self.query_transaction() as they are all defined in modules/mining/module.py.

Machine Learning

Relevant base classes to machine learning live in bitflow.utils.
In particular, BatchLearner, BatchTorchLearner, OnlineLearner, and OnlineTorchLearner are worth looking at.

A basic skeleton of a neural-network based machine learning module in PeTaL looks like this:

from petal.bitflow.utils.BatchTorchLearner import BatchTorchLearner

class MyMLModule(BatchTorchLearner):
    def __init__(self, filename='data/models/my_ML_module.nn'):
    	# Change these based on the underlying ML model, see BatchTorchLearner documentation.
        BatchTorchLearner.__init__(self, nn.CrossEntropyLoss, optim.SGD, dict(lr=0.001, momentum=0.9), in_label='Input', name='MyMLModule', filename=filename)

    def init_model(self):
        self.model = TorchModel(..)

    def transform(self, node):
    	# Process node.data into inputs and outputs
        yield inputs, outputs

See modules/taxon_classifier/TaxonClassifier for an example of this.

A more advanced neural network example might look like this. Both examples use the same base class, but more fine-grained control is given by overloading more functions.

class MyMLModule(BatchTorchLearner):
    def __init__(self, filename='data/models/my_model.nn', name='MyMLModule'):
        BatchTorchLearner.__init__(self, filename=filename, epochs=2, train_fraction=0.8, test_fraction=0.2, validate_fraction=0.00, criterion=nn.MSELoss, optimizer=optim.Adadelta, optimizer_kwargs=dict(lr=1.0, rho=0.9, eps=1e-06, weight_decay=0), in_label='Input', name=name)

    def init_model(self):
        self.model = TorchModel(..)

    # def learn() inherited, uses transform()
    def transform(self, node):
	yield inputs, outputs

    def test(self, batch):
    	# Process a test batch (given 20% of the time, based on test_fraction parameter above)

    def val(self, batch):
    	# Process a validation batch (given 20% of the time, based on test_fraction parameter above)

Scheduler, Driver, and Pipeline classes

Behind the scenes, this is how the pipeline works at a very high level. This code is (if I may say so) well documented, because I saw it as being the hardest to understand or fix. See the top-level of the pipeline directory.

Scheduler

Scheduler will load any modules importable from the modules subdirectories. It expects a file containing a class of the same name. For example modules/mymodules/MyModule.py with class MyModule: ... within the file is a valid setup. Also, each module should derive from a base Module class (or another class that derives from Module. As documented above, these are located in bitflow.utils.

Scheduler reads the type signatures of modules, and runs them based on this.
For instance, OptimizedCatalog is "indepdent", because it generates Taxon nodes without any input, so this is run initially.
Then, once Species nodes are created, modules which rely on them are scheduled and eventually run, with respect to the amount of nodes available.
For instance, WikipediaModule, EOLModule, and JEBModule will all run after BackboneModule has generated 10 nodes.

Driver

Driver is just a connection to a neo4j database. Essentially it enables some useful abstraction over the neo4j api, specifically allowing the developer to worry only about the JSON containing in nodes, and their labels and connections. This is done by using the Transaction class, located in bitflow.utils. For further understanding, see the file-level documentation.

Pipeline

Pipeline is an interface which allows the server to dynamically load modules and settings (like how a Djano site supports changing files while the website is running). It's really that simple, but it's also documented at the file-level in the pipeline folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bitflow-0.4.0.tar.gz (18.8 kB view details)

Uploaded Source

Built Distribution

bitflow-0.4.0-py3-none-any.whl (22.0 kB view details)

Uploaded Python 3

File details

Details for the file bitflow-0.4.0.tar.gz.

File metadata

  • Download URL: bitflow-0.4.0.tar.gz
  • Upload date:
  • Size: 18.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.5

File hashes

Hashes for bitflow-0.4.0.tar.gz
Algorithm Hash digest
SHA256 334cbfd3ccbd560dcf9d3979dbb06d8579bb672c6733f1ea1cb082fc472f1390
MD5 5e0032a8fecaf5758dc67019a913f032
BLAKE2b-256 e5f7ad3e14f80dbb442b239bdd123cfefca58b6c6d2ba10a78093cd2ec66af27

See more details on using hashes here.

File details

Details for the file bitflow-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: bitflow-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 22.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.5

File hashes

Hashes for bitflow-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1f8a9b65dd8db74abce50e1d92f99aa44b1918da7f8162ddb7f03ca7950975ad
MD5 193624baf11e0bfc0656f6a37c5fd0bd
BLAKE2b-256 eb145f096efe4859723cd8d5b7e05c6cf4c3bc3305eb44f52e83fbcba42ee55f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page