Skip to main content

Computational Linguistic

Project description

compling

Computational Linguistic with Python

Build Status

compling is a Python module that provides some Natural Language Processing and Computational Linguistics functionalities to work with human language data. It incorporates various Data and Text Mining features from other famous libraries (e.g. spacy, nltk, sklearn, ...) in order to arrange a pipeline aimed at the analysis of corpora of JSON documents.

Documentation

See documentation here.

Installation

You can install compling with:

$ pip install compling

compling requires:

  • Python (>= 3.6)
  • numpy
  • spacy
  • nltk
  • gensim
  • tqdm
  • unicodedata2
  • unidecode
  • configparser_
  • vaderSentiment
  • wordcloud

You also need to download:

  • a ++spacy language model++
    See here the available models. You can choose based on the language of your corpus documents. By default, complig expects you to download sm models. You can still choose to download larger models, but remember to edit the confg.ini file, so it can work properly.

    Example
    Let's assume the language of your documents is English. You could download the spacy small english model:

    python -m spacy download en_core_web_sm
    
  • some ++nltk functionalities++:

    • stopwords
      $ python -m nltk.downloader stopwords
      
    • punkt
      $ python -m nltk.downloader punkt
      

config.ini

The functionalities offered by compling may require a large variety of parameters. To facilitate their use, default values are provided for some parameters:

  • some can be changed in the function invocation. Many functions provide optional parameters;
  • others are stored in the ++config.ini++ file. This file configures the processing of your corpora. It contains the values of some special parameters. (e.g. the language of documents in your corpus.)

You can see a preview below:

[Corpus]
;The language of documents in your corpus.
language = english

;The standard iso639 of 'language'.
;See: https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes .
iso639 = en

;Documents in your corpus store their text in this key.
text_key = text

;Documents in your corpus store their date values as string in this format.
;For a complete list of formatting directives, see: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior.
date_format = %d/%m/%Y

[Document_record]
;Document records metadata:

;If lower==1, A lowercase version will be stored for each document.
lower = 0

;If lemma==1, A version with tokens replace by their lemma will be stored for each document.
lemma = 0

;If stem==1, A version with tokens replace by their stem will be stored for each document.
stem = 0

;If negations==1, A version where negated token are preceded by 'NOT_' prefix will be stored for each document.
negations = 1

;If named_entities==1, the occurring named entities will be stored in a list for each document.
named_entities = 1
; ...
ConfigManager

compling provides the ConfigManager class to make it easier for you to edit the config.ini file and to help you handling the corpora processing .

You can see the available methods below:

class ConfigManager:
    def __init__(self) -> None:
        """Constructor: creates a ConfigManager object."""

    def load(self) -> None:
        """Loads content of config.ini file."""

    def cat(self) -> None:
        """Shows the content of the config.ini file as plain-text."""

    def updates(self, config:dict) -> None:
        """Updates some values of some sections."""

    def update(self, section, k, v) -> None:
        """Update a k field with a v value in the s section."""

    def reset(self) -> None:
        """Reset the config.ini file to default conditions."""

    def whereisconfig(self) -> str:
        """Shows the config.ini file location."""

Example of usage

from compling.config import ConfigManager
cm = ConfigManager()

# documents of my corpora are italian
cm.updates({'Corpus': {'language': 'italian', 'iso639':'it'})

# I want to keep a lowercase version of each document
cm.update('Document_record', 'lower', '1')

# I changed my mind: reset to default conditions
cm.reset()

Tree structure

The compling tree structure is shown below. Different fonts are used:

  • bold, for packages;
  • italic, for files;
  • Capitalized, for available classes.

example of usage (compling)

You can see a short example of usage below. See the documentation for more details.

As example let's use the Vatican Publication corpus provided by compling.

import pkg_resources
corpus_path = pkg_resources.resource_filename('compling', 'example-corpus/vatican-publications')

def doc_iterator(path:str):
    """Yields json documents."""
    import os, json

    for root, dirs, files in os.walk(path):
        for file in files:
            if file.endswith('.json'):
                with open(os.path.join(root, file), mode='r', encoding='utf-8') as f_json:
                    data = json.load(f_json)
                    yield data
Tokenization

The tokenization converts input text to streams of tokens, where each token is a separate word, punctuation sign, number/amount, date, etc.

compling provides a Tokenizer class that tokenizes a stream of json documents.

A Tokenizer object converts the corpus documents into a stream of:

  • tokens: tokens occurring in those documents. Each token is characterized by:
    • token_id: unique token identifier;
    • sent_id: unique sentence identifier. The id of the sentence the token occurs in;
    • para_id: unique paragraph identifier. The id of the paragraph the token occurs in;
    • doc_id: unique document identifier. The id of the document the token occurs in;
    • text: the text of the token;
    • a large variety of optional meta-information (e.g. PoS tag, dep tag, lemma, stem, ...);
  • sentences : sentences occurring in those documents. Each sentence is characterized by:
    • sent_id: unique sentence identifier;
    • para_id: unique paragraph identifier. The id of the paragraph the sentence occurs in;
    • doc_id: unique document identifier. The id of the document the sentence occurs in;
    • text: the text of the sentence;
    • a large variety of optional meta-information (e.g.lemma, stem, ...);
  • paragraphs: sentences occurring in those documents. Each paragraph is characterized by:
    • para_id: unique paragraph identifier;
    • doc_id: unique document identifier. The id of the document the paragraph occurs in;
    • text: the text of the paragraph;
    • a large variety of optional meta-information (e.g.lemma, stem, ...);
  • documents: each document is characterized by:
    • doc_id: unique document identifier;
    • text: the text of the document;
    • a large variety of optional meta-information (e.g.lemma, stem, ...);

A Tokenizer object is also able to retrieve frequent n-grams (in documents of your corpus) to be considered as unique tokens by tokenization.

from compling.analysis.lexical.tokenization import Tokenizer

# new Tokenizer
t = Tokenizer()

# json doc stream input: from these docs we will calculate frequent corpus n-grams.
docs_in = doc_iterator(corpus_path)

# json doc stream output: we will consider frequent corpus n-grams as unique tokens.
docs_out = doc_iterator(corpus_path)

# let's consider frequent bigrams as unique tokens
docs_out = t.ngrams2tokens(n=2, docs_in=docs_in, docs_out=docs_out)

# run tokenization for each docs
tokenization_records = t.run(docs_out)

token_records = list()
sentence_records = list()
paragraph_records = list()
document_records = list()

# we could store the records: if your corpus is large, tokenization could take a long time.
for doc in tokenization_records:
    token_records.extend(doc['tokens'])
    sentence_records.extend(doc['sentences'])
    paragraph_records.extend(doc['tokens'])
    document_records.extend(doc['paragraphs'])
Vectorization

The process of converting text into vector is called vectorization. The set of corpus documents vectorized corpus makes up the Vector Space Model, which can have a sparse or dense representation.

compling provides a Vectorizer class that, given corpus tokens records, vectorizes the corpus documents.

A Vectorizer object allows you to create vectors grouping tokens for arbitrary fields. E.g. grouping tokens by:

  • 'doc_id': you 're creating document vectors;
  • 'sent_id': you 're creating sentence vectors;
  • 'author': you're creating author vectors (each token must have an 'author' field);
  • ... You can also choose the text field the tokens will be grouped by.
  • lemma
  • text
  • stem
  • ...

It offers several functions to set the vector components values, such as:

  • One-hot encoding
  • Tf
  • TfIdf
  • Mutual Information

You can specify the vectorization representation format: Term x Document matrix, Postings list.

You can also inspect the Vector Space Model.

compling provides a Vector Space Model class. It allows you to analyze the distance between each vectors.

from compling.analysis.lexical.vectorization import Vectorizer

# new Vectorizer
v = Vectorizer(token_field='lemma', group_by_field='author')

# stream of author vectors
vector_stream = v.run('tfidf', tokens_records)
from compling.analysis.lexical.vectorization import VSM

# stream to list
vector_list = list(vector_stream)

# new VSM objecy
v = VSM(vectors=vector_list, id_field='author')

# calculates the vector distance matrix between vectors.
v.distance(metric='euclidean')

# plot the distance matrix as a hitmap
v.plot()

# top n values for each vector id
v.topn(n=10)

Unsupervised Learning

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision.

compling provides these classes:

  • KMeans
  • Linkage
  • PCA
  • TruncateSVD
from compling.analysis.lexical.unsupervised_learning.clustering import KMeans

# new Kmeans object
kmeans = KMeans(vectors=vector_list, id_field='author')

# run kmeans: 4 clusters
clusters = kmeans.run(k=4)
from compling.analysis.lexical.unsupervised_learning.clustering import Linkage

# new Linkage object
linkage = Linkage(vectors=vector_list, id_field='author')

# run hierarchical clustering
linkage.run(method='complete')

# plot the dendrogram showing the set of all possible clusters
linkage.plot()
from compling.analysis.lexical.unsupervised_learning.dimensionality_reduction import PCA

# new PCA object
pca = PCA(vectors=vector_list, id_field='author')

# run PCA: reduction to 2 components.
pca.run(n=2)

# plot 2D vectors
pca.plot()
from compling.analysis.lexical.unsupervised_learning.dimensionality_reduction import TruncatedSVD

# new TruncatedSVD object
truncateSVD = TruncatedSVD(vectors=vector_list, id_field='author')

#run TruncatedSVD: reduction to 2 components.
truncateSVD.run(n=2)

# plot 2D vectors
truncateSVD.plot()

Sentiment Analysis

compling implements a SentimentAnalyzer class that allows you to perform sentiment analysis through a lexicon-based approach.

SentimentAnalyzer uses a summation strategy: the polarity level of a document is calculated as the sum of the polarities of all the words in the document.

The analysis detects negation pattern and reverses the negated tokens polarity.

Providing a regex, you can filter sentences/paragraphs/documents to analyze.

Providing a pos list and/or a dep list you can filter the words whose polarities will be summed.

At the moment, only the analysis for English documents is available.

from compling.analysis.sentiment import SentimentAnalyzer
from compling.analysis.sentiment.lexicon import Vader

# new SentimentAnalyzer. 
# polarity of documents as sum of VERB, NOUN, PROPN, ADJ token polarities.
s = SentimentAnalyzer(token_records, text_field='lemma', group_by_field='author',
                      id_index_field='para_id', # you can filter some paragraphs
                      pos=('VERB', 'NOUN', 'PROPN', 'ADJ')) 

# polarity of documents as sum of VERB, NOUN, PROPN, ADJ token polarities occurring in paragraphs filtered by regex_pattern.
s.filter(paragraph_records, regex_pattern="^.*(work).*$")

# new Lexicon    
lexicon = Vader()

# run sentiment analysis 
polarities, words = s.run(lexicon=lexicon)

Embeddings

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words or documents.

compling incorporates some gensim class as Word2vec, Fasttext and Doc2vec.

from compling.embeddings.words import Word2vec

# new Word2vec
w = Word2vec(index=sentence_records, text_field='text')

# build Word2vec model
w.run()

love_sim = w.most_similar('love')
from compling.embeddings.documents import Doc2vec

# new Doc2vec
w = Doc2vec(index=sentence_records, id_field='author', text_field='text')

# build Doc2vec model
w.run()

paulvi_sim = w.most_similar("Paul VI")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

compling-0.0.9.tar.gz (5.3 MB view hashes)

Uploaded Source

Built Distribution

compling-0.0.9-py3-none-any.whl (5.8 MB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page