Describe your package in one sentence
Project description
Tweet Parser and Matcher
Requirements:
Python version >= 3.7.9
Steps to run program:
- Install Dependencies:
pip install -r requirements.txt
- In project directory install package with wheel
pip install .
- See runs with different commands:
# Parsing directly from gz using given sample
python -m tweet_matcher.matcher -nd ./sample/nodes_2/ -td ./sample/terms_2_3/ -tw ./sample/tweets/tweets.jsonl.gz -od ./output/
# Concurrent partitioning by 400 tweets files 10x sample with 10% random user ids added
python -m tweet_matcher.matcher -nd ./sample/nodes_2/ -td ./sample/terms_2_3/ -tw ./sample/tweets/tweets_x_10_r_10.jsonl.gz -od ./output/ -cc 1 -p 400
- Run command below for more options:
python -m tweet_matcher.matcher -h
- See logs folder to see details from tests and runs
Steps to run nightly:
- Verify with upstream developers what time should the files be expected to be SFTPed to endpoint. Say 9PM.
- Set up cron job for with required parameters:
0 21 * * * python tweet_matcher.matcher -tw <tweet_file_dir> -nd <nodes_dir> -td <terms_directory> -od <output_dir>
TODO:
- Explore stream for real time processing of tweet data and use part of implementation from concurrent solution to send async requests to another API.
- Implement API for nodes, terms update and matches.
- Add preprocessing step to filter tweets by node_id's.
- Improve benchmark to randomize groups of words in generated tweet text. Add generation of node_ids and terms.
- Benchmark more systems and get to best choice for batch work targeting nightly tweets processed, say 10% of daily tweets ~50M.
- In order to get more than just exact matches, apply TFIDF for more heuristic approaches over data being processed.
- Add more support for emojis.
Further considerations
- Storage Design:
a1) In most files, I suggest keeping it simple and just process it in batch, feeding to either a sqlite database, since in implementation it would only be done by 1 thread when under 10MB, or keeping it in a text file.
a2) For large files, where gz is greater than 10MB, I suggest a versatile database server, e.g. postgresql, so both sql and nosql data can be easily stored and since my implementation makes the processing of each new partition of jsonl.gz concurrent, it would not lock, allowing multiple threads to access the db concurrently. A lock would need to be implemented in the db in case we store number of occurrences of the term by node_id, or in case we make it nosql, append message_id's to a list. Since we are only recording the occurrence, racing threads would end up with the same result.
b) Create an index on date of occurrence using Y-m-d and another index for node_id, access it using composite of both indexes.
-
In order to take immediate action, I would implement a send email functionality when an exception is raised, with instructions about the error and how to fix it, initially sending it to a support email and providing the developer's email in case the one's in support cannot quickly figure it out. If a logging server (such as Splunk) is available, I would alert support by email based on the output in the log files. Please, see my implementation of the logger.
-
If the files are simply being SFTPed to a given data directory, there is no need for an API. But in case we are streaming this data, I would implement a simple sanic server that would allow PUT and DELETE. A simple SQL database would suffice but make the requests async. Endpoints: /node and /term Slugs: and <int node_id/hashed(term)> Tables: term and user Index: categories both, hashed term for term table and node_id for user table
-
I would implement a stream using a pub/sub design pattern to deal with tweets in real time if we are handling with more than 1% of twitter daily feed or in case batch processing is required, I would have a dedicated server with at least 64GB RAM and 1TB. These numbers are based on the jsonl.gz generated by my benchmark from sample tweets.jsonl.gz. Please see ./benchmark for implementation deatils. A concurrent approach did not prove to show many gains with local storage and due to Python limitation. In order to test this if your machine allows multithreading, generate large tweet files with ~8M tweets and 10% of node_ids randomly assigned to node ids that we have in the pool, this may take a few minutes for n > 50:
python benchmarks/create_random_tweets.py -n <multiplication_factor> -r <random_from_user_pool>
Generated tweets files are placed in ./sample/tweets/
folder
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tweet_matcher-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ee7c22b7c2dd4008ee7f72662f73442aa0d89722112fbe7b25e215865630f478 |
|
MD5 | 79ac1a1a236b33b7a8f0d54a105aa6eb |
|
BLAKE2b-256 | 4ffb2f5687a30375d3608b2717093fea8ebc0dfc414938cc5bc74b0d7194edaf |