Skip to main content

Utility to compute sparse TF-IDF vector representation for dataset in the document_tracking_resources format based on a feature file.

Project description

Compute TF-IDF weights for texts

Compute TF-IDF weights for tokens, lemmas, entities, etc. of your datasets from a file that contain features.

Pre-requisites

Dependencies

This project relies on two other packages: document_tracking_resources and document_processing. This code needs to have access to those packages.

Data

spaCy resources

As this package uses the document_processing package, spaCy models are given in order to process the documents of the corpus. You then need to install spaCy models for your languages.

The following mapping is used, and you should download the models if you want to process those languages.

model_names = {
    "deu": "de_core_news_md",
    "spa": "es_core_news_md",
    "eng": "en_core_web_md",
}

The feature files

This program computes TF-IDF vectors according to an external database of document features. You can get them by looking at the two projects below:

  • twitter_tf_idf_dataset: used to create a features document to compute vectors using a database of thousands of Tweets from many press agencies or online newspapers.
  • news_tf_idf_dataset: used to create a features document to compute vectors using thousands of scrapped Deutsche Welle articles from which the content is extracted.

Those two project will help you produce the document required by the --features_file option of the compute_tf_idf_weights_of_corpus.py program.

Please, note that all the language of the original corpus must be present in the same file, with a lang column to indicate which feature belong to which language.

For each document text content (either title, text or both) should have at least three features: tokens, entities and lemmas. If the corpus contains only a text feature, the features file will contain for instance tokens_text, lemmas_text, entitites_text.

Below is the header of the file of features:

,tokens_title,lemmas_title,entities_title,tokens_text,lemmas_text,entities_text,lang
[…]

The corpus to process

The script can process two different types of Corpus from document_tracking_resources. The one for News (NewsCorpusWithSparseFeatures), the other one for Tweets (TwitterCorpusWithSparseFeatures). The datafiles should be loaded by document_tracking_resources in order to have this project to work.

For instance, below an example of a TwitterCorpusWithSparseFeatures:

                                         date lang                                text               source  cluster
1218234203361480704 2020-01-17 18:10:42+00:00  eng  Q: What is a novel #coronavirus...      Twitter Web App   100141
1218234642186297346 2020-01-17 18:12:27+00:00  eng  Q: What is a novel #coronavirus...                IFTTT   100141
1219635764536889344 2020-01-21 15:00:00+00:00  eng  A new type of #coronavirus     ...            TweetDeck   100141
...                                       ...  ...                                 ...                  ...      ...
1298960028897079297 2020-08-27 12:26:19+00:00  eng  So you come in here WITHOUT A M...   Twitter for iPhone   100338
1310823421014573056 2020-09-29 06:07:12+00:00  eng  Vitamin and mineral supplements...            TweetDeck   100338
1310862653749952512 2020-09-29 08:43:05+00:00  eng  FACT: Vitamin and mineral suppl...  Twitter for Android   100338

And an example of a NewsCorpusWithSparseFeatures:

                              date lang                     title               text                     source  cluster
24290965 2014-11-02 20:09:00+00:00  spa  Ponta gana la prim   ...  Las encuestas...                    Publico     1433
24289622 2014-11-02 20:24:00+00:00  spa  La cantante Katie Mel...  La cantante b...          La Voz de Galicia      962
24290606 2014-11-02 20:42:00+00:00  spa  Los sondeos dan ganad...  El Tribunal  ...                    RTVE.es     1433
...                            ...  ...                       ...               ...                        ...      ...
47374787 2015-08-27 12:32:00+00:00  deu  Microsoft-Betriebssys...  San Francisco...               Handelsblatt      170
47375011 2015-08-27 12:44:00+00:00  deu  Microsoft-Betriebssy ...  San Francisco...               WiWo Gründer      170
47394969 2015-08-27 20:35:00+00:00  deu  Windows 10: Mehr als ...  In zwei Tagn ...                  gamona.de      170

Command line arguments

Once installed, the command compute_tf_idf_vectors can be used directly, as registered in your PATH.

usage: compute_tf_idf_vectors [-h] --corpus CORPUS --dataset-type {twitter,news} --features-file FEATURES_FILE --output-corpus OUTPUT_CORPUS

Take a document corpus (in pickle format) and perform TF-IDF lookup in order to extract the feature weights.

optional arguments:
  -h, --help            show this help message and exit
  --corpus CORPUS       Path to the pickle file containing the corpus to process.
  --dataset-type {twitter,news}
                        The kind of dataset to process. ‘twitter’ will use the ’TwitterCorpusWithSparseFeatures’ class, the ‘NewsCorpusWithSparseFeatures’ class otherwise
  --features-file FEATURES_FILE
                        Path to the CSV file that contains the learning document features in all languages.
  --output-corpus OUTPUT_CORPUS
                        Path where to export the new corpus with computed TF-IDF vectors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compute_tf_idf_vectors-1.0.2.tar.gz (20.0 kB view details)

Uploaded Source

Built Distribution

compute_tf_idf_vectors-1.0.2-py3-none-any.whl (20.4 kB view details)

Uploaded Python 3

File details

Details for the file compute_tf_idf_vectors-1.0.2.tar.gz.

File metadata

  • Download URL: compute_tf_idf_vectors-1.0.2.tar.gz
  • Upload date:
  • Size: 20.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for compute_tf_idf_vectors-1.0.2.tar.gz
Algorithm Hash digest
SHA256 a47f1361f52df551af0ed55497e9e4323ae6a4a413a3c2a38418000ad2e9df68
MD5 41328be259a33d63ab9c4469107256f7
BLAKE2b-256 4fd2fefe516321ce5e6888fce40b5d8aa2f08ae675487b268a006a17a283b56e

See more details on using hashes here.

File details

Details for the file compute_tf_idf_vectors-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: compute_tf_idf_vectors-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 20.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for compute_tf_idf_vectors-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 ee3d060db37e5f52cd488c33f1952e0107dd1fec7b07cd425acd9d40559a4082
MD5 61e2b6e3c0c641e439a79b62d3a72c0d
BLAKE2b-256 4a1ce86f3071974669f465622e157733816ba6719a97aa8b06c2563543ea6849

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page