Skip to main content

Describe your package in one sentence

Project description

Tweet Parser and Matcher

Requirements:

Python version >= 3.7.9

Steps to run program:

  1. Install Dependencies:
pip install -r requirements.txt 
  1. In project directory install package with wheel
pip install .
  1. See runs with different commands:
# Parsing directly from gz using given sample
python -m tweet_matcher.matcher -nd ./sample/nodes_2/ -td ./sample/terms_2_3/ -tw ./sample/tweets/tweets.jsonl.gz -od ./output/
# Concurrent partitioning by 400 tweets files 10x sample with 10% random user ids added
python -m tweet_matcher.matcher -nd ./sample/nodes_2/ -td ./sample/terms_2_3/ -tw ./sample/tweets/tweets_x_10_r_10.jsonl.gz  -od ./output/ -cc 1 -p 400

  1. Run command below for more options:
python -m tweet_matcher.matcher -h
  1. See logs folder to see details from tests and runs

Steps to run nightly:

  1. Verify with upstream developers what time should the files be expected to be SFTPed to endpoint. Say 9PM.
  2. Set up cron job for with required parameters:
0 21 * * * python tweet_matcher.matcher -tw <tweet_file_dir> -nd <nodes_dir> -td <terms_directory> -od <output_dir>

TODO:

  1. Explore stream for real time processing of tweet data and use part of implementation from concurrent solution to send async requests to another API.
  2. Implement API for nodes, terms update and matches.
  3. Add preprocessing step to filter tweets by node_id's.
  4. Improve benchmark to randomize groups of words in generated tweet text. Add generation of node_ids and terms.
  5. Benchmark more systems and get to best choice for batch work targeting nightly tweets processed, say 10% of daily tweets ~50M.
  6. In order to get more than just exact matches, apply TFIDF for more heuristic approaches over data being processed.
  7. Add more support for emojis.

Further considerations

  1. Storage Design:

a1) In most files, I suggest keeping it simple and just process it in batch, feeding to either a sqlite database, since in implementation it would only be done by 1 thread when under 10MB, or keeping it in a text file.

a2) For large files, where gz is greater than 10MB, I suggest a versatile database server, e.g. postgresql, so both sql and nosql data can be easily stored and since my implementation makes the processing of each new partition of jsonl.gz concurrent, it would not lock, allowing multiple threads to access the db concurrently. A lock would need to be implemented in the db in case we store number of occurrences of the term by node_id, or in case we make it nosql, append message_id's to a list. Since we are only recording the occurrence, racing threads would end up with the same result.

b) Create an index on date of occurrence using Y-m-d and another index for node_id, access it using composite of both indexes.

  1. In order to take immediate action, I would implement a send email functionality when an exception is raised, with instructions about the error and how to fix it, initially sending it to a support email and providing the developer's email in case the one's in support cannot quickly figure it out. If a logging server (such as Splunk) is available, I would alert support by email based on the output in the log files. Please, see my implementation of the logger.

  2. If the files are simply being SFTPed to a given data directory, there is no need for an API. But in case we are streaming this data, I would implement a simple sanic server that would allow PUT and DELETE. A simple SQL database would suffice but make the requests async. Endpoints: /node and /term Slugs: and <int node_id/hashed(term)> Tables: term and user Index: categories both, hashed term for term table and node_id for user table

  3. I would implement a stream using a pub/sub design pattern to deal with tweets in real time if we are handling with more than 1% of twitter daily feed or in case batch processing is required, I would have a dedicated server with at least 64GB RAM and 1TB. These numbers are based on the jsonl.gz generated by my benchmark from sample tweets.jsonl.gz. Please see ./benchmark for implementation deatils. A concurrent approach did not prove to show many gains with local storage and due to Python limitation. In order to test this if your machine allows multithreading, generate large tweet files with ~8M tweets and 10% of node_ids randomly assigned to node ids that we have in the pool, this may take a few minutes for n > 50:

python benchmarks/create_random_tweets.py -n <multiplication_factor> -r <random_from_user_pool>

Generated tweets files are placed in ./sample/tweets/ folder

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tweet_matcher-1.0.1.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tweet_matcher-1.0.1-py3-none-any.whl (8.6 kB view details)

Uploaded Python 3

File details

Details for the file tweet_matcher-1.0.1.tar.gz.

File metadata

  • Download URL: tweet_matcher-1.0.1.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.6

File hashes

Hashes for tweet_matcher-1.0.1.tar.gz
Algorithm Hash digest
SHA256 1ad0d3cb9c5ed01f112bdb747d82b125218f3586758967800b5b5334a0a1a093
MD5 8faad3192ca10de354599794abb76783
BLAKE2b-256 486cfcc9cd8e0686ba002d905cb16db85dde2af440742a23d7778d61efa6cfe2

See more details on using hashes here.

File details

Details for the file tweet_matcher-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: tweet_matcher-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 8.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.50.0 CPython/3.7.6

File hashes

Hashes for tweet_matcher-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ee7c22b7c2dd4008ee7f72662f73442aa0d89722112fbe7b25e215865630f478
MD5 79ac1a1a236b33b7a8f0d54a105aa6eb
BLAKE2b-256 4ffb2f5687a30375d3608b2717093fea8ebc0dfc414938cc5bc74b0d7194edaf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page