Skip to main content

Utility to compute dense vector representation for dataset in the document_tracking_resources format base on dense transformers models.

Project description

Compute dense representation of texts

Compute dense vector representations for tokens, lemmas, entities, etc. of your datasets from a file that contain features.

The idea of computing dense representation of documents is inspired by some previous works:

Reimers, Nils, et Iryna Gurevych. 2019. ’Sentence-BERT: Sentence Embeddings 
Using Siamese BERT-Networks’. In Proceedings of the 2019 Conference on 
Empirical Methods in Natural Language Processing and the 9th International 
Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 3982‑92. 
Hong Kong, China: Association for Computational Linguistics. 
https://doi.org/10.18653/v1/D19-1410.
Linger, Mathis, et Mhamed Hajaiej. 2020. ’Batch Clustering for Multilingual 
News Streaming’. In Proceedings of Text2Story - Third Workshop on Narrative
Extraction From Texts Co-Located with 42nd European Conference on 
Information Retrieval, 2593:55‑61. CEUR Workshop Proceedings. Lisbon, 
Portugal. http://ceur-ws.org/Vol-2593/paper7.pdf.
Staykovski, Todor, Alberto Barron-Cedeno, Giovanni da San Martino, et Preslav 
Nakov. 2019. ‘Dense vs. Sparse Representations for News Stream Clustering’. 
In Proceedings of Text2Story - 2nd Workshop on Narrative Extraction From 
Texts, Co-Located with the 41st European Conference on Information, 
2342:47‑52. Cologne, Germany: CEUR-WS.org. 
https://ceur-ws.org/Vol-2342/paper6.pdf.

Pre-requisites

Dependencies

This project relies on two other packages: document_tracking_resources. This code needs to have access to this packages. It relies on sentence-transformers to compute dense representation of documents.

Transformers Models

To compute dense representation of documents, we use the sentence-transformers package is used with two multilingual models: paraphrase-multilingual-mpnet-base-v2 and distiluse-base-multilingual-cased-v1. According to the documentation, and at the time of writing, they are the two models to give the best results in multilingual semantic similarity.

The corpus to process

The script can process two different types of Corpus from document_tracking_resources. The one for News (NewsCorpusWithSparseFeatures), the other one for Tweets (TwitterCorpusWithSparseFeatures). The datafiles should be loaded by document_tracking_resources in order to have this project to work.

For instance, below an example of a TwitterCorpusWithSparseFeatures:

                                         date lang                                text               source  cluster
1218234203361480704 2020-01-17 18:10:42+00:00  eng  Q: What is a novel #coronavirus...      Twitter Web App   100141
1218234642186297346 2020-01-17 18:12:27+00:00  eng  Q: What is a novel #coronavirus...                IFTTT   100141
1219635764536889344 2020-01-21 15:00:00+00:00  eng  A new type of #coronavirus     ...            TweetDeck   100141
...                                       ...  ...                                 ...                  ...      ...
1298960028897079297 2020-08-27 12:26:19+00:00  eng  So you come in here WITHOUT A M...   Twitter for iPhone   100338
1310823421014573056 2020-09-29 06:07:12+00:00  eng  Vitamin and mineral supplements...            TweetDeck   100338
1310862653749952512 2020-09-29 08:43:05+00:00  eng  FACT: Vitamin and mineral suppl...  Twitter for Android   100338

And an example of a NewsCorpusWithSparseFeatures:

                              date lang                     title               text                     source  cluster
24290965 2014-11-02 20:09:00+00:00  spa  Ponta gana la prim   ...  Las encuestas...                    Publico     1433
24289622 2014-11-02 20:24:00+00:00  spa  La cantante Katie Mel...  La cantante b...          La Voz de Galicia      962
24290606 2014-11-02 20:42:00+00:00  spa  Los sondeos dan ganad...  El Tribunal  ...                    RTVE.es     1433
...                            ...  ...                       ...               ...                        ...      ...
47374787 2015-08-27 12:32:00+00:00  deu  Microsoft-Betriebssys...  San Francisco...               Handelsblatt      170
47375011 2015-08-27 12:44:00+00:00  deu  Microsoft-Betriebssy ...  San Francisco...               WiWo Gründer      170
47394969 2015-08-27 20:35:00+00:00  deu  Windows 10: Mehr als ...  In zwei Tagn ...                  gamona.de      170

Command line arguments

Once installed, the command compute_dense_vectors can be used directly, as registered in your PATH.

usage: compute_dense_vectors [-h] --corpus CORPUS --dataset-type {twitter,news} [--model-name MODEL_NAME] --output-corpus OUTPUT_CORPUS

Take a document corpus (in pickle format) and compute dense vectors for every feature

optional arguments:
  -h, --help            show this help message and exit
  --corpus CORPUS       Path to the pickle file containing the corpus to process.
  --dataset-type {twitter,news}
                        The kind of dataset to process. ‘twitter’ will use the ’TwitterCorpus’ class, the ‘Corpus’ class otherwise
  --model-name MODEL_NAME
                        The name of the model that can be used to encode sentences using the S-BERT library
  --output-corpus OUTPUT_CORPUS
                        Path where to export the new corpus with computed TF-IDF vectors.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compute_dense_vectors-1.0.2.tar.gz (17.9 kB view details)

Uploaded Source

Built Distribution

compute_dense_vectors-1.0.2-py3-none-any.whl (18.5 kB view details)

Uploaded Python 3

File details

Details for the file compute_dense_vectors-1.0.2.tar.gz.

File metadata

  • Download URL: compute_dense_vectors-1.0.2.tar.gz
  • Upload date:
  • Size: 17.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for compute_dense_vectors-1.0.2.tar.gz
Algorithm Hash digest
SHA256 5758f4589f41bb21b33f3db5d97ed9d325992cb3d20a006a09dac73890bae84c
MD5 419a0d423fd74d606bf96532687ee00e
BLAKE2b-256 d718e93775492cfc233f061341d53823bd0efdfa8fc73d5b815878c1aa054f15

See more details on using hashes here.

File details

Details for the file compute_dense_vectors-1.0.2-py3-none-any.whl.

File metadata

  • Download URL: compute_dense_vectors-1.0.2-py3-none-any.whl
  • Upload date:
  • Size: 18.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.26.0 requests-toolbelt/0.9.1 urllib3/1.26.7 tqdm/4.62.3 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.9.10

File hashes

Hashes for compute_dense_vectors-1.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 85593718986ed5028811da88ee6408b9583fdb783f773a4bb8c463a69b38f13f
MD5 ea85942086b1f96b8db8639ed96a0f2c
BLAKE2b-256 36881dadc43ff5689a02a58896cff66e9e30b5cc30fb79c695180c02df423c8d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page