NLP library for linguists built on top of spaCy, Scikit-Learn, and vadersentiment.

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

vidito

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

VidiNLP Library

VidiNLP is a simple Natural Language Processing (NLP) library that combines various text analysis capabilities including sentiment analysis, topic modeling, readability assessment, and more. Built on top of spaCy, scikit-learn, and vaderSentiment, VidiNLP provides an easy-to-use interface for advanced text analysis.

For emotion analysis (not sentiment analysis), VidiNLP makes use of the NRC emotion lexicon, created by Dr Saif M. Mohammad at the National Research Council Canada."

Installation
Dependencies
Basic Usage
Core Features

Installation

# First install the required dependencies
pip install vidinlp


# Download the spaCy model
python -m spacy download en_core_web_sm

Dependencies

spacy
scikit-learn
vaderSentiment
numpy
pandas

Basic Usage

from vidinlp import VidiNLP

# Initialize the analyzer
nlp = VidiNLP()

Core Features

Text Preprocessing

Tokenization

# Tokenize text into individual words
tokens = nlp.tokenize("Hello world, how are you?")
print(tokens)
# Output: ['Hello', 'world', ',', 'how', 'are', 'you', '?']

Lemmatization

# Get base forms of words
lemmas = nlp.lemmatize("I am running and jumping")
print(lemmas)
# Output: ['I', 'be', 'run', 'and', 'jump']

POS Tagging

# Get part-of-speech tags
pos_tags = nlp.pos_tag("The quick brown fox jumps")
print(pos_tags)
# Output: [('The', 'DET'), ('quick', 'ADJ'), ('brown', 'ADJ'), ('fox', 'NOUN'), ('jumps', 'VERB')]

Text Cleaning

# Clean text with various filters. By default they are all False
cleaned = nlp.clean_text(
    "Hello! This is a test 123... <p> with HTML </p>",
    remove_stop_words=True,      # Remove stop words
    remove_none_alpha=True,     # Remove non-alphabetical words
    remove_punctuations=True,       # Remove punctuations
    remove_numbers=True,       # Remove numbers
    remove_html=True,       # Remove HTML
    remove_urls=True,       # Remove URLs
    remove_emojis=True       # Remove EMOJIs
)
print(cleaned)
# Output: "hello test HTML"

Sentiment and Emotion Analysis

Basic Sentiment Analysis

# Get sentiment scores
sentiment = nlp.analyze_sentiment("This movie was absolutely fantastic!")
print(sentiment)
# Output: {'neg': 0.0, 'neu': 0.227, 'pos': 0.773, 'compound': 0.6369}

Emotion Analysis

# Get emotion scores
emotions = nlp.analyze_emotions("I am so happy and excited about this!")
print(emotions)
# Output: {'joy': 2, 'anticipation': 1, 'trust': 1, 'surprise': 0, 'fear': 0, 'sadness': 0, 'anger': 0, 'disgust': 0}

Aspect-Based Sentiment Analysis

# Analyze sentiment for different aspects
absa = nlp.aspect_based_sentiment_analysis(
    "The phone's battery life is excellent but the camera quality is poor."
)
print(nlp.summarize_absa_results(absa))
# Output:
# The aspect 'battery life' has a positive sentiment with a confidence of 0.85.
# The aspect 'camera' has a negative sentiment with a confidence of 0.72.

Keyword Extraction

N-gram Analysis

# Get top n-grams. Uses Pyhton collections Counter class for fast processing
ngrams = nlp.get_ngrams("The quick brown fox jumps over the lazy dog", n=2, top_n=3, lowercase=True, ignore_punct=True)
print(ngrams)
# Output: [('quick brown', 1), ('brown fox', 1), ('fox jumps', 1)]
# If you want to get the TFIDF (important n-grams) in a corpus:
corpus = [
        "Machine learning is revolutionizing artificial intelligence",
        "Deep learning models improve computer vision tasks",
        "Natural language processing enables advanced text analysis"
        ]
tfidf_ngrams = nlp.get_tfidf_ngrams_corpus(corpus, n=2, top_n=10, filter_stop=False)
# give it a list o texts as corpus

TF-IDF Keywords

# Extract keywords using TF-IDF
keywords = nlp.extract_keywords("Machine learning is a subset of artificial intelligence", top_n=3)
print(keywords)
# Output: [('machine learning', 0.42), ('artificial intelligence', 0.38), ('subset', 0.20)]

Topic Modeling

Perform Latent Dirichlet Allocation (LDA) topic modeling on a corpus of texts. Extracts underlying topics by identifying co-occurring word groups across documents.

# topic model
documents = [
        "Machine learning is revolutionizing artificial intelligence",
        "Deep learning models improve computer vision tasks",
        "Natural language processing enables advanced text analysis"
        ]
topics =  nlp.topic_modelling( documents, num_topics = 5, min_df = 2, max_df = 0.95, min_word_length = 3)

for topic in topics:
        print(topic)
# Args:
#         texts (List[str]): Collection of text documents to analyze
#         num_topics (int, optional): Number of topics to extract. Defaults to 5.
#         min_df (int, optional): Minimum document frequency for terms. Defaults to 2.
#         max_df (float, optional): Maximum document frequency for terms. Defaults to 0.95.
#         min_word_length (int, optional): Minimum word length to consider. Defaults to 3.

#     Returns:
#         List of dictionaries containing top keywords for each extracted topic

Document Similarity

# Compare two documents
similarity = nlp.compute_document_similarity(
    "Machine learning is fascinating",
    "AI is amazing"
)
print(similarity)
# Output: 0.0

# Find similar documents
docs = ["AI is great", "Machine learning is cool", "Python programming"]
similar = nlp.find_similar_documents("AI and ML", docs, top_n=2)
print(similar)
# Output: [(0, 0.82), (1, 0.65)]

Readability Analysis

# Get readability metrics
readability = nlp.analyze_readability(
    "The quick brown fox jumps over the lazy dog. It was a simple sentence."
)
print(readability)
# Output: {
#     'flesch_reading_ease': 97.0,
#     'gunning_fog_index': 2.8,
#     'dale_chall_score': 5.1,
#     'avg_words_per_sentence': 7.0,
#     'avg_syllables_per_word': 1.2,
#     'complex_word_ratio': 0.0,
#     'lexical_density': 0.571,
#     'type_token_ratio': 0.929,
#     'avg_word_length': 3.93,
#     'named_entity_ratio': 0.0,
#     'verb_noun_ratio': 0.33,
#     'avg_sentence_length_syllables': 8.5
# }

Text Structure Analysis

# Analyze text structure
structure = nlp.analyze_text_structure(text)
print(structure)
# Output: {
#     'num_sentences': 33,
#     'avg_sentence_length': 10.33,
#     'sentence_length_variability': {'variance': 730.9917355371902, 'iqr': 37.5},
#     'num_paragraphs': 2,
#     'avg_paragraph_length': 5.5,
#     'paragraph_length_variability': {'variance': 1600.0, 'iqr': 40.0},
#     'discourse_markers': {'Although'},
#     'pronoun_reference_ratio': 0.027777777777777776,
#     'lexical_diversity': 0.76,
#     'pos_distribution': {'ADP': 63, 'DET': 45, 'ADJ': 46,},
#     'noun_verb_ratio': 4.0,
#     'noun_adj_ratio': 2.0,
#     'sentence_type_distribution': {'simple': 1, 'compound': 9, 'complex': 0, 'compound_complex': 1},
#     'complex_sentence_ratio': 0.0
# }
#  Higher variance indicates more variation in sentence lengths
# IQR focuses on the typical variation, ignoring extreme outliers. Measures the range where the middle 50% of the sentence lengths fall

# Analyze text patterns
structure = nlp.detect_linguistic_patterns(
    "The words have been spoken. If they answer, I will talk."
)
print(structure)
# Output: {'passive_voice': ['The words have been spoken],
# 'conditionals': ['If they answer, I will talk.']
# }

Named Entity Recognition

# Identify named entities
ner = nlp.get_named_entities('Norway is a big country!')
print(ner)
# Output: [('Norway', 'GPE')]

Text Classification

# Users can train a Naive Bayes classifier on their own dataset and use it for predictions.
# The texts should be provided in a CSV file (encoding='utf-8') with a text column and a label column.
# Training the Model:
from vidinlp import VidiNLP

# Initialize VidiNLP
nlp = VidiNLP()

# Train the model
nlp = VidiNLP()
nlp.train_text_classifier(
    csv_path="spam_ham.csv",
    text_column="text",
    label_column="label",
    split_ratio=0.8
)

# Predicting Text Class
predicted_label = nlp.predict_text("Hey what's up?")
print(predicted_label)

Export Functionality

# This functioonality needs improvement
# Export analysis in different formats
# JSON format
analysis_json = nlp.export_analysis(text, format='json')

# Pandas DataFrame
analysis_df = nlp.export_analysis(text, format='dataframe')
analysis_df.to_csv('analysis.csv', index=False)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

Homepage

GitHub Statistics

Maintainers

vidito

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

1.1.5.2

Jan 19, 2026

This version

1.1.5.1

Apr 17, 2025

1.1.0

Jan 14, 2025

0.0.8

Dec 29, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vidinlp-1.1.5.1.tar.gz (456.6 kB view details)

Uploaded Apr 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vidinlp-1.1.5.1-py3-none-any.whl (451.4 kB view details)

Uploaded Apr 17, 2025 Python 3

File details

Details for the file vidinlp-1.1.5.1.tar.gz.

File metadata

Download URL: vidinlp-1.1.5.1.tar.gz
Upload date: Apr 17, 2025
Size: 456.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vidinlp-1.1.5.1.tar.gz
Algorithm	Hash digest
SHA256	`5263c2552a2862da7b6cf2b59cb8fcec6b54ed3d4f5c984a87edece3a248560d`
MD5	`c1e20b2a6063e0dae7644544fda92344`
BLAKE2b-256	`bafb97cd7f37e9c42035ae3ddc2ae95b324ffb678cf7a4ebd37f5d03a8c00bf0`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vidinlp-1.1.5.1.tar.gz:

Publisher: python-publish.yml on Vidito/vidinlp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vidinlp-1.1.5.1.tar.gz
- Subject digest: 5263c2552a2862da7b6cf2b59cb8fcec6b54ed3d4f5c984a87edece3a248560d
- Sigstore transparency entry: 198581862
- Sigstore integration time: Apr 17, 2025
Source repository:
- Permalink: Vidito/vidinlp@2c10b34c0e80d6adbe5b6867e96ad309aeb7c54a
- Branch / Tag: refs/tags/1.1.5.1
- Owner: https://github.com/Vidito
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@2c10b34c0e80d6adbe5b6867e96ad309aeb7c54a
- Trigger Event: release

File details

Details for the file vidinlp-1.1.5.1-py3-none-any.whl.

File metadata

Download URL: vidinlp-1.1.5.1-py3-none-any.whl
Upload date: Apr 17, 2025
Size: 451.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for vidinlp-1.1.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1fc2c964c89c3030dd30677fface290e0eb754c89381da34dda9384cda3fffa5`
MD5	`fb0d6c51db22d919a17094c67e66efbb`
BLAKE2b-256	`0c583052292c9416f7f6f3f2be5f5bd990a995853533d0b1351fe235fb9e49bc`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vidinlp-1.1.5.1-py3-none-any.whl:

Publisher: python-publish.yml on Vidito/vidinlp

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vidinlp-1.1.5.1-py3-none-any.whl
- Subject digest: 1fc2c964c89c3030dd30677fface290e0eb754c89381da34dda9384cda3fffa5
- Sigstore transparency entry: 198581864
- Sigstore integration time: Apr 17, 2025
Source repository:
- Permalink: Vidito/vidinlp@2c10b34c0e80d6adbe5b6867e96ad309aeb7c54a
- Branch / Tag: refs/tags/1.1.5.1
- Owner: https://github.com/Vidito
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@2c10b34c0e80d6adbe5b6867e96ad309aeb7c54a
- Trigger Event: release

vidinlp 1.1.5.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Project description

VidiNLP Library

Table of Contents

Installation

Dependencies

Basic Usage

Core Features

Text Preprocessing

Tokenization

Lemmatization

POS Tagging

Text Cleaning

Sentiment and Emotion Analysis

Basic Sentiment Analysis

Emotion Analysis

Aspect-Based Sentiment Analysis

Keyword Extraction

N-gram Analysis

TF-IDF Keywords

Topic Modeling

Document Similarity

Readability Analysis

Text Structure Analysis

Named Entity Recognition

Text Classification

Export Functionality

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Meta

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance