Build word and graph embeddings based on community detection in graphs.
Project description
SINr is an open-source tool to efficiently compute graph and word embeddings. Its aim is to provide sparse interpretable vectors from a graph structure. The dimensions of the vector produced are related to the community structure detected in the graph. By leveraging the relative connection of vertices to communities, SINr builds an interpretable space. SINr is focused on providing tools to build and interpret the embeddings produced.
SINr is a Python module relying on Networkit for the graph structure and community detection. SINr also provides efficient implementations to extract word co-occurrence graphs from large text corpora. One of the strength of SINr is its ability to work with text and produce interpretable word embeddings that are competitive with similar approaches. For more details on the performances of SINr on downstream evaluation tasks, please refer to the Publications section.
Requirements
As SINr relies on libraries implemented using C/C++, a modern C++ compiler is required.
OpenMP (required for Networkit and compiling SINr’s Cython
Python 3.9
Pip
Cython
Conda (recommended)
Install
SINr can be installed through pip or from source using poetry directives.
pip
conda activate sinr # activate conda environment
pip install sinr
from source
conda activate sinr # activate conda environment
git clone git@github.com:SINr-Embeddings/sinr.git
cd sinr
pip install poetry # poetry solves dependencies and installs SINr
poetry install # installs SINr based on the pyproject.toml file
Usage example
To get started using SINr to build graph and word embeddings, have a look at the notebook directory.
Here is a minimum working example of SINr
import urllib
import io
import gzip
import networkit as nk
import sinr.graph_embeddings as ge
url = "https://snap.stanford.edu/data/wiki-Vote.txt.gz"
graph_file = "wikipedia-votes.txt"
# Read a graph from SNAP
sock = urllib.request.urlopen(url) # open URL
s = io.BytesIO(sock.read()) # read into BytesIO "file"
sock.close()
with gzip.open(s, "rt") as f_in:
with open(graph_file, "wt") as f_out:
f_out.writelines(f_in.readlines())
# Initialize a networkit.Graph object from SNAP graph
G = nk.readGraph(graph_file, nk.Format.SNAP)
# Build a SINr model and extract embeddings
model = ge.SINr.load_from_graph(G)
model.run(algo=nk.community.PLM(G))
embeddings = model.get_nr()
print(embeddings)
Documentation
The documentation for SINr is available online.
Contributing
Pull requests are welcome. For major changes, please open an issue first to disccus the changes to be made.
License
Released under CeCILL 2.1, see LICENSE for more details.
Publications
SINr is currently maintained at the University of Le Mans. If you find SINr useful for your own research, please cite the appropriate papers from the list below. Publications can also be found on publications page in the documentation.
Initial SINr paper, 2021
Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021, Apr 2021, Porto, Portugal. pp.325-337, ⟨10.1007/978-3-030-74251-5_26⟩. ⟨hal-03197434⟩
Interpretability of SINr embedding
Thibault Prouteau, Nicolas Dugué, Nathalie Camelin, Sylvain Meignier. Are Embedding Spaces Interpretable? Results of an Intrusion Detection Evaluation on a Large French Corpus. LREC 2022, Jun 2022, Marseille, France. ⟨hal-03770444⟩
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for sinr-1.2.0-cp310-cp310-manylinux_2_35_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c382e034c16d29e4ccd4b89d7a6ec851d236501acd190f866b3a76c5ae2fbadc |
|
MD5 | 5ad7d17928ba89f8e0629c62d9d353dc |
|
BLAKE2b-256 | 17e7f049c65b49f0ab7c37594cfa5b429103f8d34ad1a12fd9208f870874e940 |