Skip to main content

compling is a Python module that provide some Natural Language Processing and Computational Linguistics functionality to work with human language data.

Project description

compling

Build Status

compling is a Python module that provide some Natural Language Processing and Computational Linguistics functionality to work with human language data. It incorporate various Data and Text Mining features from other famous library (e.g. spacy, NLTK, sklearn, ...) in order to arrange a pipeline for the analysis of corpora of JSON documents.

Documentation

See documentation: http://pycompling.altervista.org/.

Installation

pip install compling

You also need to download the spacy model based on your corpus language. See here the available models: https://spacy.io/models. By default, complig expects you to download sm models. You can still choose to download larger models, but remember to edit the confg.ini file, so it can work properly.

For example... If the language of your documents is English, you could run:

$ python -m spacy download en_core_web_sm

config.ini

The functionalities offered by compling may require a large variety of parameters. To facilitate their use, default values are provided for some parameters:

  • some can be changed directly in the function invocation. Many functions provide optional parameters;
  • others are stored in the config.ini file. This file is a configuration file that contains the values of some special parameters that characterize the processing of your corpora. (e.g. the language of documents in your corpus.)

See here a preview:

[Corpus]
;The language of documents in your corpus.
language = english

;The standard iso639 of 'language'.
;See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes .
iso639 = en

;Documents in your corpus store their text in this key.
text_key = text

;Documents in your corpus store their date values as string in this format.
;For a complete list of formatting directives, see: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior.
date_format = %d/%m/%Y

[Document_record]
;Document records metadata:

;If lower==1, A lowercase version will be stored for each document.
lower = 0

;If lemma==1, A version with tokens replace by their lemma will be stored for each document.
lemma = 0

;If stem==1, A version with tokens replace by their stem will be stored for each document.
stem = 0

;If negations==1, A version where negated token are preceded by 'NOT_' prefix will be stored for each document.
negations = 1

;If named_entities==1, the occurring named entities will be stored in a list for each document.
named_entities = 1
; ...

compling provide the ConfigManager class to help you handling it.

See here the available methods.

class ConfigManager:
    def __init__(self) -> None:
        """Constructor: creates a ConfigManager object."""

    def load(self) -> None:
        """Loads content of config.ini file."""

    def cat(self) -> None:
        """Shows the content of the config.ini file as plain-text."""

    def updates(self, config:dict) -> None:
        """Updates some values of some sections."""

    def update(self, section, k, v) -> None:
        """Update a k field with a v value in the s section."""

    def reset(self) -> None:
        """Reset the config.ini file to default conditions."""

    def whereisconfig(self) -> str:
        """Shows the config.ini file location."""

Example of usage

from compling.config import ConfigManager
cm = ConfigManager()

# documents of my corpora are italian
cm.updates({'Corpus': {'language': 'italian', 'iso639':'it'})

# I want to keep a lowercase version of each document
cm.update('Document_record', 'lower', '1')

# default conditions
cm.reset()

Tree structure

The compling tree structure is shown below. Different fonts are used: bold, for packages; italic, for files; Capitalized, for available classes.

Example of usage

As example let's use the Vatican Publication corpus.

import pkg_resources
corpus_path = pkg_resources.resource_filename('compling', 'example-corpus/vatican-publications')

def doc_iterator(path:str):
    """Yields json documents."""
    import os, json

    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith('.json'):
                with open(os.path.join(root, file), mode='r', encoding='utf-8') as f_json:
                    data = json.load(f_json)
                    yield data

See the documentation for more details.

Tokenization

The tokenization converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, etc.

compling provides a Tokenizer class that tokenizes a stream of json documents.

A Tokenizer object converts the corpus documents into a stream of:

  • tokens: tokens occurring in those documents. Each token is characterized by:
    • token_id: unique token identifier;
    • sent_id: unique sentence identifier. The id of the sentence the token occurs in;
    • para_id: unique paragraph identifier. The id of the paragraph the token occurs in;
    • doc_id: unique document identifier. The id of the document the token occurs in;
    • text: the text of the token;
    • a large variety of optional meta-information (e.g. PoS tag, dep tag, lemma, stem, ...);
  • sentences : sentences occurring in those documents. Each sentence is characterized by:
    • sent_id: unique sentence identifier;
    • para_id: unique paragraph identifier. The id of the paragraph the sentence occurs in;
    • doc_id: unique document identifier. The id of the document the sentence occurs in;
    • text: the text of the sentence;
    • a large variety of optional meta-information (e.g.lemma, stem, ...);
  • paragraphs: sentences occurring in those documents. Each paragraph is characterized by:
    • para_id: unique paragraph identifier;
    • doc_id: unique document identifier. The id of the document the paragraph occurs in;
    • text: the text of the paragraph;
    • a large variety of optional meta-information (e.g.lemma, stem, ...);
  • documents: Each document is characterized by:
    • doc_id: unique document identifier;
    • text: the text of the document;
    • a large variety of optional meta-information (e.g.lemma, stem, ...);

A Tokenizer object is also able to retrieve frequent n-grams to be considered as unique tokens.

Example of usage See config.ini section for doc_iterator function.

from compling.analysis.lexical.tokenization import Tokenizer

# new Tokenizer
json_docs_stream_input = doc_iterator(corpus_path)
json_docs_stream_output = doc_iterator(corpus_path)
t = Tokenizer()

# let's consider frequent bigrams as unique tokens
json_docs_stream_output = t.ngrams2tokens(n=2, json_docs_stream_input, json_docs_stream_output)

# run tokenization
tokenization_records = t.run(json_docs_stream_output)

token_records = list()
sentence_records = list()
paragraph_records = list()
document_records = list()

# you could store the records: if your corpus is large, tokenization could take a long time.
for doc in tokenization_records:
    token_records.extend(doc['tokens'])
    sentence_records.extend(doc['sentences'])
    paragraph_records.extend(doc['tokens'])
    document_records.extend(doc['paragraphs'])

Vectorization

The process of converting text into vector is called vectorization. The set of corpus documents vectorized corpus makes up the Vector Space Model, which can have a sparse or dense representation.

compling provides a Vectorizer class that, given corpus tokens records, vectorizes the corpus documents.

A Vectorizer object allows you to create vectors grouping tokens for an arbitrary field. E.g. grouping tokens by:

  • 'doc_id': you 're creating document vectors;
  • 'sent_id': you 're creating sentence vectors;
  • 'author': you're creating author vectors (each token must have an 'author' field);
  • ... You can also choose the text field the tokens will be grouped by too. E.g.
  • lemma
  • text
  • stem
  • ...

It offers several functions to set the vector components values, such as:

  • One-hot encoding
  • Tf
  • TfIdf
  • Mutual Information

You can specify the vectorization representation format: Term x Document matrix, Postings list.

You can also inspect the Vector Space Model.

compling provides a Vector Space Model class. It allows you to analyze the distance between each vectors.

Example of usage

from compling.analysis.lexical.vectorization import Vectorizer

# new Vectorizer
v = Vectorizer(token_field='lemma', group_by_field='author')

# stream of author vectors
vector_stream = v.run('tfidf', tokens_records)
from compling.analysis.lexical.vectorization import VSM

# stream to list
vector_list = list(vector_stream)

# new VSM objecy
v = VSM(vectors=vector_list, id_field='author')

# calculates the vector distance matrix between vectors.
v.distance(metric='euclidean')

# plot the distance matrix as a hitmap
v.plot()

# top n values for each vector id
v.topn(n=10)

Unsupervised Learning

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision.

compling provides these classes:

  • KMeans
  • Linkage
  • PCA
  • TruncateSVD

Example of usage

from compling.analysis.lexical.unsupervised_learning.clustering import KMeans

# new Kmeans object
kmeans = KMeans(vectors=vector_list, id_field='author')

# run kmeans: 4 clusters
clusters = kmeans.run(k=4)
from compling.analysis.lexical.unsupervised_learning.clustering import Linkage

# new Linkage object
linkage = Linkage(vectors=vector_list, id_field='author')

# run hierarchical clustering
linkage.run(method='complete')

# plot the dendrogram showing the set of all possible clusters
linkage.plot()
from compling.analysis.lexical.unsupervised_learning.dimensionality_reduction import PCA

# new PCA object
pca = PCA(vectors=vector_list, id_field='author')

# run PCA: reduction to 2 components.
pca.run(n=2)

# plot 2D vectors
pca.plot()
from compling.analysis.lexical.unsupervised_learning.dimensionality_reduction import TruncatedSVD

# new TruncatedSVD object
truncateSVD = TruncatedSVD(vectors=vector_list, id_field='author')

#run TruncatedSVD: reduction to 2 components.
truncateSVD.run(n=2)

# plot 2D vectors
truncateSVD.plot()

Sentiment Analysis

compling implements a SentimentAnalyzer class that allows you to perform sentiment analysis through a lexicon-based approach.

SentimentAnalyzer uses a summation strategy: the polarity level of a document is calculated as the sum of the polarities of all the words in the document.

The analysis detects negation pattern and reverses the negated tokens polarity.

Providing a regex, you can filter sentences/paragraphs/documents to analyze.

Providing a pos list and/or a dep list you can filter the words whose polarities will be summed.

At the moment, only the analysis for English documents is available. Example of usage

from compling.analysis.sentiment import SentimentAnalyzer
from compling.analysis.sentiment.lexicon import Vader

# new SentimentAnalyzer. 
# polarity of documents as sum of VERB, NOUN, PROPN, ADJ token polarities.
s = SentimentAnalyzer(token_records, text_field='lemma', group_by_field='author',
                      id_index_field='para_id', # you can filter some paragraphs
                      pos=('VERB', 'NOUN', 'PROPN', 'ADJ')) 

# polarity of documents as sum of VERB, NOUN, PROPN, ADJ token polarities occurring in paragraphs filtered by regex_pattern.
s.filter(paragraph_records, regex_pattern="^.*(work).*$")

# new Lexicon    
lexicon = Vader()

# run sentiment analysis 
polarities, words = s.run(lexicon=lexicon)

Embeddings

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words or documents.

compling incorporates some gensim class as Word2vec, Fasttext and Doc2vec.

Example of usage

from compling.embeddings.words import Word2vec

# new Word2vec
w = Word2vec(index=sentence_records, text_field='text')

# build Word2vec model
w.run()

love_sim = w.most_similar('love')
from compling.embeddings.documents import Doc2vec

# new Doc2vec
w = Doc2vec(index=sentence_records, id_field='author', text_field='text')

# build Doc2vec model
w.run()

paulvi_sim = w.most_similar("Paul VI")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compling-0.0.5.tar.gz (5.3 MB view hashes)

Uploaded Source

Built Distribution

compling-0.0.5-py3-none-any.whl (5.8 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page