Skip to main content

Large Data Processing Assistant

Project description

Planchet

Your large data processing personal assistant

CircleCI Maintainability Test Coverage License: MIT Documentation Status Contributors

About

Planchet (pronounced /plʌ̃ʃɛ/) is a data package manager suited for processing large arrays of data items. It supports natively reading and writing into CSV and JSONL data files and serving their content over a FastAPI service to clients that process the data. It is a tool for scientists and hackers, not production.

How it works

Planchet solves the controlled processing of large amounts of data in a simple and slightly naive way by controlling the reading and writing of the data as opposed to the processing. When you create a job with Planchet you tell the sevice where to read, where to write and what classes to use for that. Next, you (using the client or simple HTTP requests) ask the service for n data items, which your process works through locally. When your processing is done, it ships the items back to Planchet, who writes them to disk. All jobs and serving and receiving of items is logged in a Redis instance with persistence. This assures that if you stop processing you will only lose the processing of the data that was not sent back to Planchet. Planchet will automatically resume jobs and skip over processed items.

Caveat: Planchet is running in a single thread to avoid the mess of multiple processes writing in the same file. Until this is fixed (may be never) you should be careful with your batch sizes -- keep them not too big and not too small.

diagram

Read more about planchet on the documentation page.

Installation

Planchet works in two components: a service and a client. The service is the core that does all the work managing the data while the client is a light wrapper around requests that makes it easy to access the service API.

Service

You can use this repo and start streight away like this:

git clone git@github.com:savkov/planchet.git
export PLANCHET_REDIS_PWD=<some-password>
make install
make run-redis
make run

If you want to run Planchet on a different port, you can use the uvicorn command but note that you MUST use only one worker.

uvicorn app:app --reload --host 0.0.0.0 --port 5005 --workers 1

You can also run docker-compose from the git repo:

git clone git@github.com:savkov/planchet.git
export PLANCHET_REDIS_PWD=<some-password>
docker-compose up

Client

pip install planchet

Example

On the server

On the server we need to install Planchet and download some news headlines data in an accessible directory. Then we multiply the data 1000 times as there are only 200 lines originally. Don't forget to set your Redis password before you do make install-redis!

git clone https://github.com/savkov/planchet.git
cd planchet
mkdir data
wget https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl -O data/news_headlines.jsonl
python -c "news=open('data/news_headlines.jsonl').read();open('data/news_headlines.jsonl', 'w').write(''.join([news for _ in range(200)]))"
export PLANCHET_REDIS_PWD=<your-redis-password>
make install
make install-redis
make run

Note that planchet will run at port 5005 on your host machine.

On the client

On the client side we need to install the Planchet client and spaCy.

pip install planchet spacy tqdm
python -m spacy download en_core_web_sm
export PLANCHET_REDIS_PWD=<your-redis-password>

Then we write the following script in a file called spacy_ner.py making sure you fill in the placeholders.

from planchet import PlanchetClient
import spacy
from tqdm import tqdm

nlp = spacy.load("en_core_web_sm")

PLANCHET_HOST = 'localhost'  # <--- CHANGE IF NEEDED
PLANCHET_PORT = 5005

url = f'http://{PLANCHET_HOST}:{PLANCHET_PORT}'
client = PlanchetClient(url)

job_name = 'spacy-ner-job'
metadata = { # NOTE: this assumes planchet has access to this path
    'input_file_path': './data/news_headlines.jsonl',
    'output_file_path': './data/entities.jsonl'
}

# make sure you don't use the clean_start option here
client.start_job(job_name, metadata, 'JsonlReader', writer_name='JsonlWriter')

# make sure the number of items is large enough to avoid blocking the server
n_items = 100
headlines = client.get(job_name, n_items)

while headlines:
    ents = []
    print('Processing headlines batch...')
    for id_, item in tqdm(headlines):
        item['ents'] = [ent.text for ent in nlp(item['text']).ents]
        ents.append((id_, item))
    client.send(job_name, ents)
    headlines = client.get(job_name, n_items)

Finally, we want to do some parallel processing with 8 processes. We can start each process manually or we can use the parallel tool to start them all.

seq -w 0 8 | parallel python spacy_ner.py {}

Contributors


Sasho Savkov

Dilyan G.

Kristian Boda

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

planchet-0.4.0.tar.gz (14.4 kB view hashes)

Uploaded Source

Built Distribution

planchet-0.4.0-py3-none-any.whl (7.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page