Python implementation of graph community based word embeddings (SINr)

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

SINr

Build word embeddings based on community detection in graphs.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Background

Features

SINr is composed of two main modules :

Cooccurrence : a cython based module to efficiently compute a cooccurrence matrix from a given corpus
SINr : a module to compute cooccurence network based, sparse word embeddings

Installation

Launch a job on a slurm node -> srun -p gpu --gres "gpu:1" --time 1-0 --mem 5G --pty bash
Install conda
Clone repository -> git clone --branch nfm_sparse https://git-lium.univ-lemans.fr/tprouteau/sinr.git && cd sinr
Build conda environment -> conda env create -f environment.yml
Activate environment -> conda activate sinr_release
InstallSINr in development mode and SpaCy Transformer model for english -> cd src && python setup.py cythonize && pip install -e . && python -m spacy download en_core_web_trf
Use SINr!

Launch a Jupyter Notebook in jupyterlab

Activate your conda environment -> conda activate sinr_release
(upon first launch) install environment kernel in IPython -> ipython kernel install --name sinr_release --user
Launch a notebook on a node -> srun -p gpu --gres "gpu:1" --mem 80G -c15 -w "gpu15" jlaunch jupyter-lab #Use the -w option to choose the node one should not use a K20/K40 GPU as is it not supported by cupy anymore.
ctrl+click on the link displayed on the terminal and select the adequate kernel (sinr_release)

Usage

For additional examples see notebooks

Cooccurence

from sinr.cooccurrence import Cooccurrence
from sinr.pmi import pmi_filter

# Load your corpus as list of lists of tokens
sentences = [["sinr", "is", "fun"], ["sinr", "is", "a", "python", "package"]]
# Build cooccurrence matrix
c = Cooccurrence()
c.fit(sentences, window=2)

#Normalise cooccurrence matrix using PPMI
c.matrix = pmi_filter(c.matrix)
c.save("/path_to_output/matrix.pk")

SINr

The extraction of the embedding is currently greedy in terms of memory. When working with large corpora, do not hesitate to ask for rather large amounts of RAM (>100G)... This is currently being fixed.

from sinr.graph_embeddings import SINr

model = SINr.sinr("/path_to_output/matrix.pickle", output_path="path_to_output", n_jobs=4)  
#If an output_path is supplied, the model will be saved -- Embeddings are returned
#as a Model object comprised of a dictionnary for the vocabulary and a scipy.sparce.csr_matrix for the vectors

Contributing

Pull requests are welcome. For major changes, please open an issue first to disccuss the changes to be made.

Compile/Install from source

In order to compile and install SINr from source follow the procedure described below

git clone --branch nfm_sparse https://git-lium.univ-lemans.fr/tprouteau/sinr.git
cd sinr
conda env create -f environment.yml
conda activate sinr_release
python setup.py cythonize
pip install -e .

Evaluate Word Embeddings

In order to evaluate the word embeddings on the similarity task you may use the library Word Embedding Benchmarks developped by Stanislaw Jastrzebski : https://github.com/kudkudak/word-embeddings-benchmarks

⚠️ The embeddings returned by the model are of type Scipy.sparse.csr_matrix you will need to pass them as a dense matrix using the function

matrix = my_sparse_csr_matrix.todense()

Refer to the documentation and examples to know which format to use in input of the benchmarking library.

License

Project based on the

Project details

These details have not been verified by PyPI

Project links

Homepage

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.2.0

Jul 24, 2023

1.2.0a0 pre-release

Jul 24, 2023

1.1.1

Feb 20, 2023

1.1.0

Feb 18, 2023

1.0.9

Feb 6, 2023

1.0.8

Feb 6, 2023

1.0.7

Feb 6, 2023

1.0.6

Feb 6, 2023

1.0.5

Feb 6, 2023

1.0.4

Feb 6, 2023

1.0.2

Feb 6, 2023

0.1.7

Mar 24, 2022

0.1.6

Mar 24, 2022

0.1.5

Mar 24, 2022

This version

0.1.4

Mar 24, 2022

0.1.3

Mar 24, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sinr-0.1.4.tar.gz (214.8 kB view hashes)

Uploaded Mar 24, 2022 Source

Built Distribution

sinr-0.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (905.3 kB view hashes)

Uploaded Mar 24, 2022 CPython 3.9 manylinux: glibc 2.17+ x86-64

Hashes for sinr-0.1.4.tar.gz

Hashes for sinr-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`97707ab1a3c8da5178e87af5af2eddae3478a538d5f5a6796c568eb7d47be735`
MD5	`7007e2154d3acab24e765dd15341e65a`
BLAKE2b-256	`6f96841fa13b3a86a92f6b6ffb8df0a694731329a4a53d7c2b60f6df58c50b46`

Hashes for sinr-0.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

Hashes for sinr-0.1.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`3990743f498861ec1b37a3675b847a64b2b8a6e5f34b84958764c75fc957c13c`
MD5	`bb6d9195f359505f038d25c69fa053cd`
BLAKE2b-256	`8eec82f5e38c50cb586773dde1b88aa6b5d20060710e62dac11e6785fee6ae4f`