An actor architecture for research software

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ream

A simple actor architecture for your research project. It helps addressing three problems so that you can focus on your main research:

Configuring hyper-parameters of your method
Speed-up the feedback cycles via easy & smart caching
Running each step in your method independently.

It's more powerful to combine with osin.

Introduction

Let's say you are developing a method, an algorithm, or a pipeline to solve a problem. In many cases, it can be viewed as a computational graph. So why not structure your code as a computational graph, where each node is a component in your method or a step in your pipeline? It made your code more modular, and easy to release, cache, and evaluate. To see how we can apply this architecture, let's take a look at a record linkage project (linking entities in a table). A record linkage system typically has the following steps:

Generate candidate entities in a table
Rank the candidate entities and select the best matches.

So naturally, we will have two actors for two steps: CandidateGeneration and CandidateRanking:

import pandas as pd
from typing import Literal
from ream.prelude import BaseActor
from dataclasses import dataclass

@dataclass
class CanGenParams:
    # type of query that will be sent to ElasticSearch
    query_type: Literal["exact-match", "fuzzy-match"]

class CandidateGeneration(BaseActor[pd.DataFrame, CanGenParams]):
    VERSION = 100

    def run(self, table: pd.DataFrame):
        # generate candidate entities of the given table
        ...

@dataclass
class CanRankParams:
    # ranking method to use
    rank_method: Literal["pairwise", "columnwise"]

class CandidateRanking(BaseActor[pd.DataFrame, CanRankParams]):
    VERSION = 100

    def __init__(self, params: CanRankParams, cangen_actor: CandidateGeneration):
        super().__init__(params, [cangen_actor])

    def run(self, table: pd.DataFrame):
        # rank candidate entities of the given table
        ...

The two actors make the code more modular and closer to releasable quality. To define the linking pipeline, we can use ActorGraph:

from ream.prelude import ActorGraph, ActorNode, ActorEdge

g = ActorGraph()
cangen = g.add_node(ActorNode.new(CandidateGeneration))
canrank = g.add_node(ActorNode.new(CandidateRanking))
g.add_edge(BaseEdge(id=-1, source=cangen, target=canrank))

If we provide type hints for arguments of actors, as in the examples above, you can automatically construct the graph by given the actor classes.

from ream.prelude import ActorGraph

g = ActorGraph.auto([CandidateGeneration, CandidateRanking])

This seems boring and does not offer much, but then you can pick whatever actor and its function you want to call without manually initializing and parsing command line arguments. For example, we want to trigger the evaluate method on each actor. The parameters of the actors will be obtained automatically from the command line arguments, thanks to the yada parser.

if __name__ == "__main__":
    g.run(actor_class="CandidateGeneration", actor_method="evaluate")

The evaluate method for each actor can be very useful. On the candidate generation actor, it can tell us the upper bound accuracy of our method so we know whether we need to improve the candidate generation or candidate ranking. If a dataset actor is introduced to the computational graph as demonstrated below, its evaluate method can tell us statistics about the dataset.

from ream.prelude import NoParams, BaseActor, DatasetQuery

class DatasetActor(BaseActor[str, NoParams]):
    VERSION = 100

    def run(self, query: str):
        # use a query so we can dynamically select a subset of the dataset for quickly test
        # for example: mnist[:10] -- select first 10 examples
        dsquery = DatasetQuery.from_string(query)

        # load the real dataset
        examples = ...
        return dsquery.select(examples)

    def evaluate(self, query: str):
        dsdict = self.run(query)
        for split, examples in dsdict.items():
            print(f"Dataset: {dsdict.name} - split {split} has {len(examples)} examples")

Let's talk about caching. Each actor when running will be uniquely identified by its name, version, and parameters (including the dependent actor parameters), and this is referred to as actor state which you can retrieve from BaseActor.get_actor_state function. From this, we can create a unique folder associated with that state that you can use to store your cache data (the folder can be retrieved from the function BaseActor.get_working_fs). Whenever the actor's dependency is updated, you will always get a new folder so no worry about managing the cache yourself! To set it up, in the file that defines the actor graph, init the ream workspace as follows:

from ream.prelude import ReamWorkspace, ActorGraph

ReamWorkspace.init("<folder>/<to>/<store>/<cache>")
g = ActorGraph()
...

Installation

pip install ream2  # not ream

Examples

Will be added later.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

4.3.1

Apr 21, 2024

4.2.3

Mar 1, 2024

4.2.2

Feb 1, 2024

4.2.1

Jan 31, 2024

4.2.0

Jan 16, 2024

4.1.0

Jan 4, 2024

4.0.0

Dec 31, 2023

3.8.2

Dec 22, 2023

3.8.1

Dec 17, 2023

3.8.0

Dec 4, 2023

3.7.1

Dec 1, 2023

3.7.0

Dec 1, 2023

3.6.1

Nov 17, 2023

3.6.0

Nov 16, 2023

3.5.0

Nov 5, 2023

3.4.1

Oct 8, 2023

3.4.0

Oct 3, 2023

3.3.0

Sep 25, 2023

3.2.3

Sep 24, 2023

3.2.2

Sep 24, 2023

3.2.1

Sep 23, 2023

3.2.0

Sep 11, 2023

3.1.0

Sep 7, 2023

3.0.1

Sep 5, 2023

3.0.0

Sep 4, 2023

2.12.1

Aug 16, 2023

2.12.0

Aug 8, 2023

2.11.1

Aug 4, 2023

2.11.0

Jul 27, 2023

2.10.2

Jun 27, 2023

2.10.1

Jun 24, 2023

2.10.0

Jun 24, 2023

2.9.1

Jun 24, 2023

2.9.0

Jun 20, 2023

2.8.1

Jun 19, 2023

2.8.0

Jun 13, 2023

2.7.0

Jun 11, 2023

2.6.2

Jun 10, 2023

2.6.1

Jun 7, 2023

2.6.0

May 19, 2023

2.5.2

Apr 19, 2023

2.3.0

Feb 28, 2023

2.2.0

Feb 26, 2023

2.1.3

Feb 22, 2023

1.8.2

Feb 13, 2023

1.8.1

Feb 7, 2023

1.7.0

Dec 31, 2022

1.6.7

Dec 27, 2022

1.6.6

Dec 27, 2022

1.6.5

Dec 26, 2022

1.6.4

Dec 26, 2022

1.6.2

Dec 24, 2022

1.6.1

Dec 24, 2022

1.6.0

Dec 20, 2022

1.5.7

Dec 15, 2022

1.5.6

Dec 7, 2022

1.5.5

Nov 30, 2022

1.5.4

Nov 8, 2022

1.5.3

Nov 8, 2022

1.5.2

Nov 7, 2022

1.4.0

Nov 1, 2022

1.3.0

Oct 27, 2022

1.2.0

Oct 18, 2022

1.1.0

Oct 16, 2022

1.0.1

Oct 15, 2022

1.0.0

Oct 15, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ream2-4.3.1.tar.gz (47.7 kB view hashes)

Uploaded Apr 21, 2024 Source

Built Distribution

ream2-4.3.1-py3-none-any.whl (55.8 kB view hashes)

Uploaded Apr 21, 2024 Python 3

Hashes for ream2-4.3.1.tar.gz

Hashes for ream2-4.3.1.tar.gz
Algorithm	Hash digest
SHA256	`f03d2a78823f201df679700010515d33dbef60a0fad303f000ec3eb5c09d70d2`
MD5	`6f60484a7b7189a9c0fcdc3732bc8edf`
BLAKE2b-256	`3b0e4a1072eb4025d6d70427dc372a8bd7a6c4d43fe9184975730b675eedc10c`

Hashes for ream2-4.3.1-py3-none-any.whl

Hashes for ream2-4.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a9fd98179ceaccf72b07d74c4067fbc7e1406fb7918bab3cb958ec8648264aa9`
MD5	`1a43bffc9e89f754be1a8838046bb8cf`
BLAKE2b-256	`9549ea8ea0c6eb4fff87bf31356d429e759fc7d80bb74e0dba63fabcf6d7f1cb`