Skip to main content

Large Data Processing Assistant

Project description

Planchet

Your large data processing personal assistant

CircleCI Maintainability Test Coverage License: MIT Documentation Status Contributors

About

Planchet (pronounced /plʌ̃ʃɛ/) is a data package manager suited for processing large arrays of data items. It supports natively reading and writing into CSV and JSONL data files and serving their content over a FastAPI service to clients that process the data. It is a tool for scientists and hackers, not production.

How it works

Planchet solves the controlled processing of large amounts of data in a simple and slightly naive way by controlling the reading and writing of the data as opposed to the processing. When you create a job with Planchet you tell the sevice where to read, where to write and what classes to use for that. Next, you (using the client or simple HTTP requests) ask the service for n data items, which your process works through locally. When your processing is done, it ships the items back to Planchet, who writes them to disk. All jobs and serving and receiving of items is logged in a Redis instance with persistence. This assures that if you stop processing you will only lose the processing of the data that was not sent back to Planchet. Planchet will automatically resume jobs and skip over processed items.

Caveat: Planchet is running in a single thread to avoid the mess of multiple processes writing in the same file. Until this is fixed (may be never) you should be careful with your batch sizes -- keep them not too big and not too small.

diagram

Read more about planchet on the documentation page.

Installation

Planchet works in two components: a service and a client. The service is the core that does all the work managing the data while the client is a light wrapper around requests that makes it easy to access the service API.

Service

You can use this repo and start streight away like this:

git clone git@github.com:savkov/planchet.git
export PLANCHET_REDIS_PWD=<some-password>
make install
make run-redis
make run

If you want to run Planchet on a different port, you can use the uvicorn command but note that you MUST use only one worker.

uvicorn app:app --reload --host 0.0.0.0 --port 5005 --workers 1

You can also run docker-compose from the git repo:

git clone git@github.com:savkov/planchet.git
export PLANCHET_REDIS_PWD=<some-password>
docker-compose up

Client

pip install planchet

Example

On the server

On the server we need to install Planchet and download some news headlines data in an accessible directory. Then we multiply the data 1000 times as there are only 200 lines originally. Don't forget to set your Redis password before you do make install-redis!

git clone https://github.com/savkov/planchet.git
cd planchet
mkdir data
wget https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl -O data/news_headlines.jsonl
python -c "news=open('data/news_headlines.jsonl').read();open('data/news_headlines.jsonl', 'w').write(''.join([news for _ in range(200)]))"
export PLANCHET_REDIS_PWD=<your-redis-password>
make install
make install-redis
make run

Note that planchet will run at port 5005 on your host machine.

On the client

On the client side we need to install the Planchet client and spaCy.

pip install planchet spacy tqdm
python -m spacy download en_core_web_sm
export PLANCHET_REDIS_PWD=<your-redis-password>

Then we write the following script in a file called spacy_ner.py making sure you fill in the placeholders.

from planchet import PlanchetClient
import spacy
from tqdm import tqdm

nlp = spacy.load("en_core_web_sm")

PLANCHET_HOST = 'localhost'  # <--- CHANGE IF NEEDED
PLANCHET_PORT = 5005

url = f'http://{PLANCHET_HOST}:{PLANCHET_PORT}'
client = PlanchetClient(url)

job_name = 'spacy-ner-job'
metadata = { # NOTE: this assumes planchet has access to this path
    'input_file_path': './data/news_headlines.jsonl',
    'output_file_path': './data/entities.jsonl'
}

# make sure you don't use the clean_start option here
client.start_job(job_name, metadata, 'JsonlReader', writer_name='JsonlWriter')

# make sure the number of items is large enough to avoid blocking the server
n_items = 100
headlines = client.get(job_name, n_items)

while headlines:
    ents = []
    print('Processing headlines batch...')
    for id_, item in tqdm(headlines):
        item['ents'] = [ent.text for ent in nlp(item['text']).ents]
        ents.append((id_, item))
    client.send(job_name, ents)
    headlines = client.get(job_name, n_items)

Finally, we want to do some parallel processing with 8 processes. We can start each process manually or we can use the parallel tool to start them all.

seq -w 0 8 | parallel python spacy_ner.py {}

Contributors


Sasho Savkov

Dilyan G.

Kristian Boda

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

planchet-0.4.0.tar.gz (14.4 kB view details)

Uploaded Source

Built Distribution

planchet-0.4.0-py3-none-any.whl (7.9 kB view details)

Uploaded Python 3

File details

Details for the file planchet-0.4.0.tar.gz.

File metadata

  • Download URL: planchet-0.4.0.tar.gz
  • Upload date:
  • Size: 14.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.6.8

File hashes

Hashes for planchet-0.4.0.tar.gz
Algorithm Hash digest
SHA256 b8c346a84d4b41045f5cdedaddd84f6de4d25ea59be98e4a312798b13e4dd3ef
MD5 40acaa9c2057b3f3802ec9ca2766c5f6
BLAKE2b-256 b4ea8b2f27ea23fe29f3d77015c231fb0a07f51fa71e968217b4dd4892fa43a5

See more details on using hashes here.

File details

Details for the file planchet-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: planchet-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 7.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.6.8

File hashes

Hashes for planchet-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3c1b19e8ba62d9a8e83171f0e0596daacf191b4801ec58c83c431914ff98d7df
MD5 c6e97f9a1902653c423b62b68a7e3507
BLAKE2b-256 2d22377e78bfe5caa2da1304a53f918eebc61885a9bab721cd14abe13a25f684

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page