Package to load the resources required by the document_tracking package.
Project description
document_tracking_resources
This package is part of the news_tracking
project and aims at providing the necessary resources for algorithms to run and data preparation.
Installation
pip install document_tracking_resources
Data format
Document tracking algorithms use two classes, NewsCorpus
and TwitterCorpus
respectively for news articles data (with title and content) and tweets. A corpus is to be seen as a table of elements that comprises, at least, three categories of elements:
- Document characteristics: the columns that correspond to the document itself: date of publication, title, text, source, etc. See the
DOCUMENT_COLUMNS
property. - Document features: the different computed features (either TF-IDF weightings or dense representation for instance) that represent the multiple dimensions of the documents (title, text or both) of the original corpus. See the
FEATURE_COLUMNS
property. - Document cluster: the ground truth cluster id of each document, in order to train and test the corpus. See the
GROUND_TRUTH_COLUMNS
property.
This data format relies on the pandas
library, especially the DataFrame
data structure. It is generally saved in .pickle
format in order to make it easy to load and save the structure. The Corpus
API provides functions to load DataFrame
in .pickle
format as a Corpus
.
Features format
The features of the Corpus
are column named after the FEATURES_COLUMNS
list. This package allows manipulating two kind of data different features:
- Sparse: generally produced via a TF-IDF weighting model, the sparse representation is saved as a mapping, for each dimension (title, text or both) between the feature to weight (the tokens for instance) and its weight. Each dimension is then a dictionary of all the terms and their weightings.
- Dense: vectors of equal size that provide a representation, as a vector of number of each dimension of the original document (title, text or both).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
File details
Details for the file document_tracking_resources-1.0.1.202208310822-py3-none-any.whl
.
File metadata
- Download URL: document_tracking_resources-1.0.1.202208310822-py3-none-any.whl
- Upload date:
- Size: 24.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0be8b7ae868600113369b151f7cd948817171225bf579a643a109c92355d95bb |
|
MD5 | a5f49dba8078d30fd93336e343c29c25 |
|
BLAKE2b-256 | fc4998a6e80ec11641c663d78cfc747bacb5dd70df30245296effce43798e12b |