Skip to main content

A very simple graph-of-words library for NLP

Project description

gowpy

A very simple library for exploiting graph-of-words in NLP. Currently at version 0.2.0.

The graph-of-words model has been shown to be an interesting alternative to the widespread bag-of-words model by challenging its term independence assumption [1,2]. In particular, the graph-of-words model makes it possible to capture term dependence and term order. If you are looking for an efficient alternative to TF-IDF, give graph-of-words a try on your dataset!

gowpy leverages graph-of-words representation in order to do:

Detailed explanations, evaluations and discussions can be found in the reference section.

Quick Start

Requirements and Installation

This project is based on Python 3.6+, scikit-learn and NetworkX.

Installation from PyPI

pip install gowpy

Installation from the GitHub Source

First, clone the project:

git clone https://github.com/GuillaumeDD/gowpy.git

Then, cd to the project folder and run the install command:

cd gowpy/
python setup.py install

Example Usage

Example Notebooks

See the examples/ directory for example of notebooks:

Building a Graph-of-Words from a Document

from gowpy.gow.builder import GoWBuilder

# Creation of a graph-of-words builder
# Here:
# - the graph-of-words will be directed, and
# - an edge will link every tokens co-occurring in a sliding window of size 4
#
builder = GoWBuilder(directed=True, window_size=4)

text = """gowpy is a simple library for exploiting graph-of-words in nlp gowpy 
leverages graph-of-words representation for document classification and for keyword extraction 
from a document"""

# Here, a preprocessing step fitted to the need of the project should be carried out 

# Creation of the graph-of-words
gow = builder.compute_gow_from_document(text)

Then, it is possible to visualize the document as a graph-of-words:

import matplotlib.pyplot as plt
import networkx as nx

g = gow.to_labeled_graph()

options = {
    "font_weight" : 'normal',
    "font_color" : 'darkblue',
    #
    "edge_color" : 'lightgray',
    #
    "node_size" : 200,
    "node_color": 'white',
    "with_labels": True,
}
nx.draw(g, **options)

A graph-of-words example

Unsupervised Keywords Extraction

Graph-of-words can be leveraged to extract an automatically adaptative number of cohesive keywords from a text document in an unsupervised fashion [2,3].

gowpy implements the graph-of-words methods presented in [3]:

  • the three batch keyword extraction methods based on k-core, and
  • the word-level keyword extraction method CoreRank.

See this example notebook that repoduces the running example of the paper demonstrating all the methods.

For short and medium size documents, both the "density" and the "inflexion" methods have been showing strong performance. Here is how you can use the "density" method:

from gowpy.summarization.unsupervised import KcoreKeywordExtractor

extractor_kw = KcoreKeywordExtractor(directed=False, weighted=True, window_size=4,
                                     selection_method='density')

# 
# Note that preprocessing is particularly important for keyword extraction
# in order to keep and normalize important terms such as adjectives and nouns.
#
# An already preprocessed text in which to extract keywords

preprocessed_text = """gowpy simple library exploiting graph-of-words nlp gowpy 
leverages graph-of-words representation document classification keyword extraction 
document"""

extractor_kw.extract(preprocessed_text)

Returns:

[('gowpy', 4),
 ('simple', 4),
 ('library', 4),
 ('exploiting', 4),
 ('graph-of-words', 4),
 ('nlp', 4),
 ('leverages', 4),
 ('representation', 4),
 ('document', 4),
 ('classification', 4),
 ('keyword', 4),
 ('extraction', 4)]

For longer documents, the CoreRank method has been shown to be more appropriate. This method takes an optional parameter that can specify the number of keywords to extract:

from gowpy.summarization.unsupervised import CoreRankKeywordExtractor

extractor_kw_cr = CoreRankKeywordExtractor(directed=False, weighted=True, window_size=4)

preprocessed_text = """gowpy simple library exploiting graph-of-words nlp gowpy 
leverages graph-of-words representation document classification keyword extraction 
document"""

extractor_kw_cr.extract(preprocessed_text, n=5)

Returns:

[('graph-of-words', 36),
 ('gowpy', 28),
 ('representation', 24),
 ('document', 24),
 ('library', 20)]

Classification with TW-IDF: a graph-based term weighting score

TW-IDF [0] challenges the term independence assumption behind the bag-of-words model by (i) exploiting a graph-of-words representation of a document (here an unweighted directed graph of terms), and by (ii) leveraging this new representation to replace the term frequency (TF) by graph-based term weights (TW).

See this example notebook for a usage example of TW-IDF for a multi-class classification task.

TW-IDF is accessible via a dedicated vectorizer:

from gowpy.feature_extraction.gow import TwidfVectorizer

corpus = [
    'hello world !',
    'foo bar'
]

vectorizer_gow = TwidfVectorizer(                 
    # Graph-of-words specificities
    directed=True,
    window_size=4,
    # Token frequency filtering
    min_df=0.0,
    max_df=1.0,
    # Graph-based term weighting approach
    term_weighting='degree'
)

X = vectorizer_gow.fit_transform(corpus)
X

Returns:

<2x5 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

TW-IDF vectorizer fits seamlessly in a grid search:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('gow', TwidfVectorizer()),
    ('svm', SVC()),
])

parameters = {
    'gow__directed' : [True, False],
    'gow__window_size' : [2, 4, 8, 16],
    'gow__b' : [0.0, 0.003],
    'gow__term_weighting' : ['degree', 'pagerank'],
    'gow__min_df' : [0, 5, 10],
    'gow__max_df' : [0.8, 0.9, 1.0],
#
    'svm__C' : [0.1, 1, 10],
    'svm__kernel' : ['linear']
}

# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, 
                           parameters, 
                           cv=10,
                           n_jobs=-1)

Going further: classification based on frequent subgraphs

Frequent subgraphs corresponding to long range n-gram can be mined and subsequently used for document classification [1].

Classification with frequent subgraphs happens in a 3-step process:

  1. Conversion of the corpus of already preprocessed documents into a collection of graph-of-words
  2. Mining the frequent subgraphs
  3. Loading the frequent subgraphs and exploiting them for classification

See this example notebook for a usage example of frequent subgraphs for a multi-class classification task.

Conversion of the corpus into a collection of graph-of-words

The first step consists in turning the corpus into a graph-of-words and collection and then export that collection into a file format suited for frequent subgraph mining.

from gowpy.gow.miner import GoWMiner
import gowpy.gow.io

corpus = [
    'hello world !',
    'foo bar',
    # and many more...
]

# Conversation of the corpus into a collection of graph-of-words
gow_miner = GoWMiner(directed=False, window_size=4)
corpus_gows = gow_miner.compute_gow_from_corpus(corpus)

# Exportation of the collection of graph-of-words into a file for
# interoperability with other languages such as C++
with open("corpus_gows.data", "w") as f_output:
    data = gowpy.gow.io.gow_to_data(corpus_gows)
    f_output.write(data)
Mining the frequent subgraphs

Frequent subgraphs mining can be realized via the gSpan algorithm. This step is not included in this project and has to be carried out by another program.

This project supports the reimplementation from gBolt available at GitHub. Currently this implementation is limited to undirected graph. To mine frequent subgraphs (after having installed gBolt on your machine):

OMP_NUM_THREADS=1 ./gbolt --input corpus_gows.data --output gbolt-mining-corpus_gow --dfs --nodes --support 0.01

Notice the support parameter which defines the minimum frequency of a subgraph to be considered as frequent. Here it is set to 1% (0.01). This parameter is corpus specific and should be carefully tuned (see [1]).

Mining produces two files:

  • gbolt-mining-corpus_gow.t0: the frequent subgraphs with more than one node
  • gbolt-mining-corpus_gow.nodes: the frequent single nodes

These two files can be loaded by the same gow_miner used for exportation:

gow_miner.load_graphs('gbolt-mining-corpus_gow.t0', 
                      'gbolt-mining-corpus_gow.nodes')
gow_miner

Returns:

Graph-of-word miner:
        - is_directed: False
        - window_size: 4
        - edge_labeling: True

        - Number of tokens: 5
        - Number of links between tokens: 4

        - Number of loaded subgraph: 13
Classification with frequent subgraphs

Classification with frequent subgraphs is accessible via a dedicated vectorizer:

from gowpy.feature_extraction.gow import GoWVectorizer

vectorizer_gow = GoWVectorizer(gow_miner)
X = vectorizer_gow.fit_transform(corpus)
# X is a sparse matrix

Before tuning the min_df (the minimum being the support chosen during mining) and the max_df, it is possible the have a look at the normalized frequency distribution:

import pandas as pd
s_freq_per_pattern = pd.Series(gow_miner.stat_relative_freq_per_pattern())
s_freq_per_pattern.describe()

For instance, it can returns the following distribution:

count    10369.000000
mean         0.026639
std          0.046551
min          0.008333
25%          0.010000
50%          0.013333
75%          0.022778
max          0.865000
dtype: float64

GoW vectorizer fits nicely in a grid search:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('gow', GoWVectorizer(gow_miner)),
    ('tfidf', TfidfTransformer()),
    ('svm', SVC()),
])

parameters = {
    'gow__subgraph_matching' : ['partial', 'induced'],
    'gow__min_df' : [0.00833, 0.01, 0.013333],
    'gow__max_df' : [0.022778, 0.5, 0.865],
#
    'svm__C' : [0.1, 1, 10],
    'svm__kernel' : ['linear']
}

# find the best parameters for both the feature extraction and the
# classifier
grid_search = GridSearchCV(pipeline, 
                           parameters, 
                           cv=10,
                           n_jobs=-1)

References

Detailed explanations, evaluations and discussions can be found in these papers:

This library involves the following algorithms:

License

Released under the 3-Clause BSD license (see LICENSE file)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gowpy-0.2.0.tar.gz (21.9 kB view details)

Uploaded Source

Built Distributions

gowpy-0.2.0-py3.7.egg (59.3 kB view details)

Uploaded Source

gowpy-0.2.0-py3.6.egg (58.6 kB view details)

Uploaded Source

gowpy-0.2.0-py3-none-any.whl (27.9 kB view details)

Uploaded Python 3

File details

Details for the file gowpy-0.2.0.tar.gz.

File metadata

  • Download URL: gowpy-0.2.0.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.2.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for gowpy-0.2.0.tar.gz
Algorithm Hash digest
SHA256 12ae27287a5f03c6fb94e43c1b87d852ae0fea8b92fdbc0070388c42f9428761
MD5 d6393338e7e92a5475387b2d6643d47e
BLAKE2b-256 01a32b412623360e39cd5e44c5883a8473033dfcc37b6a97368dc18c4be743f1

See more details on using hashes here.

File details

Details for the file gowpy-0.2.0-py3.7.egg.

File metadata

  • Download URL: gowpy-0.2.0-py3.7.egg
  • Upload date:
  • Size: 59.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.2.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for gowpy-0.2.0-py3.7.egg
Algorithm Hash digest
SHA256 875e8e781e3deacdc07b5162c4b7736e215d9c2a2854ff17e3848de9ad27c379
MD5 205ca1f884007b560979622f47a9b26c
BLAKE2b-256 d5e54fc0cd69ad97da7e6225ad0670f1bdb9272186cf6dc69a6d38c6e29bf933

See more details on using hashes here.

File details

Details for the file gowpy-0.2.0-py3.6.egg.

File metadata

  • Download URL: gowpy-0.2.0-py3.6.egg
  • Upload date:
  • Size: 58.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.2.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for gowpy-0.2.0-py3.6.egg
Algorithm Hash digest
SHA256 062fa16ffccd8949d3d0e68f5b4f356721bbc02165609775f155f92854b9dcf8
MD5 2ee7f45537dab0539795f7b7c9031788
BLAKE2b-256 cae2145b9ac747317765e16edbb71accd817dc80d3f941cc483119fe4a044ed5

See more details on using hashes here.

File details

Details for the file gowpy-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: gowpy-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 27.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/40.2.0 requests-toolbelt/0.9.1 tqdm/4.26.0 CPython/3.6.7

File hashes

Hashes for gowpy-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 18a4b5ead25417a9ea80a3e322186f7514855afd354fe86686c2a4e9cf870d0b
MD5 cb22bc4c00083d5dcfc2d2a387936dc5
BLAKE2b-256 7c9fe8948bb2667e29f5c5fda955e20b1804444340d370a552e20505097f282d

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page