Skip to main content

Package to load the resources required by the document_tracking package.

Project description

document_tracking_resources

This package is part of the news_tracking project and aims at providing the necessary resources for algorithms to run and data preparation.

Installation

pip install document_tracking_resources

Data format

Document tracking algorithms use two classes, NewsCorpus and TwitterCorpus respectively for news articles data (with title and content) and tweets. A corpus is to be seen as a table of elements that comprises, at least, three categories of elements:

  • Document characteristics: the columns that correspond to the document itself: date of publication, title, text, source, etc. See the DOCUMENT_COLUMNS property.
  • Document features: the different computed features (either TF-IDF weightings or dense representation for instance) that represent the multiple dimensions of the documents (title, text or both) of the original corpus. See the FEATURE_COLUMNS property.
  • Document cluster: the ground truth cluster id of each document, in order to train and test the corpus. See the GROUND_TRUTH_COLUMNS property.

This data format relies on the pandas library, especially the DataFrame data structure. It is generally saved in .pickle format in order to make it easy to load and save the structure. The Corpus API provides functions to load DataFrame in .pickle format as a Corpus.

Features format

The features of the Corpus are column named after the FEATURES_COLUMNS list. This package allows manipulating two kind of data different features:

  • Sparse: generally produced via a TF-IDF weighting model, the sparse representation is saved as a mapping, for each dimension (title, text or both) between the feature to weight (the tokens for instance) and its weight. Each dimension is then a dictionary of all the terms and their weightings.
  • Dense: vectors of equal size that provide a representation, as a vector of number of each dimension of the original document (title, text or both).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

File details

Details for the file document_tracking_resources-1.0.1.202208310822-py3-none-any.whl.

File metadata

File hashes

Hashes for document_tracking_resources-1.0.1.202208310822-py3-none-any.whl
Algorithm Hash digest
SHA256 0be8b7ae868600113369b151f7cd948817171225bf579a643a109c92355d95bb
MD5 a5f49dba8078d30fd93336e343c29c25
BLAKE2b-256 fc4998a6e80ec11641c663d78cfc747bacb5dd70df30245296effce43798e12b

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page