Skip to main content

Package to load the resources required by the document_tracking package.

Project description

document_tracking_resources

This package is part of the news_tracking project and aims at providing the necessary resources for algorithms to run and data preparation.

Data format

Document tracking algorithms use two classes, NewsCorpus and TwitterCorpus respectively for news articles data (with title and content) and tweets. A corpus is to be seen as a table of elements that comprises, at least, three categories of elements:

  • Document characteristics: the columns that correspond to the document itself: date of publication, title, text, source, etc. See the DOCUMENT_COLUMNS property.
  • Document features: the different computed features (either TF-IDF weightings or dense representation for instance) that represent the multiple dimensions of the documents (title, text or both) of the original corpus. See the FEATURE_COLUMNS property.
  • Document cluster: the ground truth cluster id of each document, in order to train and test the corpus. See the GROUND_TRUTH_COLUMNS property.

This data format relies on the pandas library, especially the DataFrame data structure. It is generally saved in .pickle format in order to make it easy to load and save the structure. The Corpus API provides functions to load DataFrame in .pickle format as a Corpus.

Features format

The features of the Corpus are column named after the FEATURES_COLUMNS list. This package allows manipulating two kind of data different features:

  • Sparse: generally produced via a TF-IDF weighting model, the sparse representation is saved as a mapping, for each dimension (title, text or both) between the feature to weight (the tokens for instance) and its weight. Each dimension is then a dictionary of all the terms and their weightings.
  • Dense: vectors of equal size that provide a representation, as a vector of number of each dimension of the original document (title, text or both).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

File details

Details for the file document_tracking_resources-1.0.0.202203292015.tar.gz.

File metadata

  • Download URL: document_tracking_resources-1.0.0.202203292015.tar.gz
  • Upload date:
  • Size: 23.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for document_tracking_resources-1.0.0.202203292015.tar.gz
Algorithm Hash digest
SHA256 b898b2d816cd228e1b5371c14c4ba1f70dd830c4a16948848d6c8a59324395d2
MD5 24e95b93201867a6319c3495b5d46f2c
BLAKE2b-256 a48de1772cf56ae4eddc722bbb6d9713d00576dd41c401118f9bb1ef113cb55a

See more details on using hashes here.

File details

Details for the file document_tracking_resources-1.0.0.202203292015-py3-none-any.whl.

File metadata

  • Download URL: document_tracking_resources-1.0.0.202203292015-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for document_tracking_resources-1.0.0.202203292015-py3-none-any.whl
Algorithm Hash digest
SHA256 5c8c0c36a47b5bc311bf3665ef1f8b0655f4f2059716729d7657f2a5dc4ae0a0
MD5 a38ebcd426a1d5665728eba4b409c737
BLAKE2b-256 af312943bfd7ff4315aa4508bfe740f5ede7524e3da12be4cee596ef33cf4951

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page