Large Data Processing Assistant
Project description
Planchet
Your large data processing personal assistant
About
Planchet (pronounced /plʌ̃ʃɛ/) is a data package manager suited for processing large arrays of data items. It supports natively reading and writing into CSV and JSONL data files and serving their content over a FastAPI service to clients that process the data. It is a tool for scientists and hackers, not production.
How it works
Planchet solves the controlled processing of large amounts of data in a simple and slightly naive way by controlling the reading and writing of the data as opposed to the processing. When you create a job with Planchet you tell the sevice where to read, where to write and what classes to use for that. Next, you (using the client or simple HTTP requests) ask the service for n data items, which your process works through locally. When your processing is done, it ships the items back to Planchet, who writes them to disk. All jobs and serving and receiving of items is logged in a Redis instance with persistence. This assures that if you stop processing you will only lose the processing of the data that was not sent back to Planchet. Planchet will automatically resume jobs and skip over processed items.
Caveat: Planchet is running in a single thread to avoid the mess of multiple processes writing in the same file. Until this is fixed (may be never) you should be careful with your batch sizes -- keep them not too big and not too small.
Read more about planchet on the documentation page.
Installation
Planchet works in two components: a service and a client. The service is the
core that does all the work managing the data while the client is a light
wrapper around requests
that makes it easy to access the service API.
Service
You can use this repo and start streight away like this:
git clone git@github.com:savkov/planchet.git
export PLANCHET_REDIS_PWD=<some-password>
make install
make run-redis
make run
If you want to run Planchet on a different port, you can use the uvicorn
command but note that you MUST use only one worker.
uvicorn app:app --reload --host 0.0.0.0 --port 5005 --workers 1
You can also run docker-compose from the git repo:
git clone git@github.com:savkov/planchet.git
export PLANCHET_REDIS_PWD=<some-password>
docker-compose up
Client
pip install planchet
Example
On the server
On the server we need to install Planchet and download some news headlines data
in an accessible directory. Then we multiply the data 1000 times as there are
only 200 lines originally. Don't forget to set your Redis password before
you do make install-redis
!
git clone https://github.com/savkov/planchet.git
cd planchet
mkdir data
wget https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl -O data/news_headlines.jsonl
python -c "news=open('data/news_headlines.jsonl').read();open('data/news_headlines.jsonl', 'w').write(''.join([news for _ in range(200)]))"
export PLANCHET_REDIS_PWD=<your-redis-password>
make install
make install-redis
make run
Note that planchet will run at port 5005 on your host machine.
On the client
On the client side we need to install the Planchet client and spaCy.
pip install planchet spacy tqdm
python -m spacy download en_core_web_sm
export PLANCHET_REDIS_PWD=<your-redis-password>
Then we write the following script in a file called spacy_ner.py
making sure
you fill in the placeholders.
from planchet import PlanchetClient
import spacy
from tqdm import tqdm
nlp = spacy.load("en_core_web_sm")
PLANCHET_HOST = 'localhost' # <--- CHANGE IF NEEDED
PLANCHET_PORT = 5005
url = f'http://{PLANCHET_HOST}:{PLANCHET_PORT}'
client = PlanchetClient(url)
job_name = 'spacy-ner-job'
metadata = { # NOTE: this assumes planchet has access to this path
'input_file_path': './data/news_headlines.jsonl',
'output_file_path': './data/entities.jsonl'
}
# make sure you don't use the clean_start option here
client.start_job(job_name, metadata, 'JsonlReader', writer_name='JsonlWriter')
# make sure the number of items is large enough to avoid blocking the server
n_items = 100
headlines = client.get(job_name, n_items)
while headlines:
ents = []
print('Processing headlines batch...')
for id_, item in tqdm(headlines):
item['ents'] = [ent.text for ent in nlp(item['text']).ents]
ents.append((id_, item))
client.send(job_name, ents)
headlines = client.get(job_name, n_items)
Finally, we want to do some parallel processing with 8 processes. We can start
each process manually or we can use the parallel
tool to start them all.
seq -w 0 8 | parallel python spacy_ner.py {}
Contributors
Sasho Savkov |
Dilyan G. |
Kristian Boda |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file planchet-0.4.0.tar.gz
.
File metadata
- Download URL: planchet-0.4.0.tar.gz
- Upload date:
- Size: 14.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b8c346a84d4b41045f5cdedaddd84f6de4d25ea59be98e4a312798b13e4dd3ef |
|
MD5 | 40acaa9c2057b3f3802ec9ca2766c5f6 |
|
BLAKE2b-256 | b4ea8b2f27ea23fe29f3d77015c231fb0a07f51fa71e968217b4dd4892fa43a5 |
File details
Details for the file planchet-0.4.0-py3-none-any.whl
.
File metadata
- Download URL: planchet-0.4.0-py3-none-any.whl
- Upload date:
- Size: 7.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.8.0 tqdm/4.46.0 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3c1b19e8ba62d9a8e83171f0e0596daacf191b4801ec58c83c431914ff98d7df |
|
MD5 | c6e97f9a1902653c423b62b68a7e3507 |
|
BLAKE2b-256 | 2d22377e78bfe5caa2da1304a53f918eebc61885a9bab721cd14abe13a25f684 |