A simple, accessible python toolkit for digital humanities scholars for basic and advanced text analysis, exploration, processing, parsing
Project description
sanalyse
sanalyse is an open-source, unified Python toolkit for Digital Humanities. It provides a simple, consistent API to perform core text‐analysis tasks across English, Hindi, and Urdu. Designed to evolve incrementally, the library currently offers a suite of basic functionalities, with a rich roadmap of advanced techniques slated for upcoming releases.
🚀 Features (v0.x)
Core Text‐Processing
- Normalization: Unicode normalization, lowercasing, diacritic removal.
- Tokenization: Language‐aware tokenizers for English, Hindi, and Urdu.
- Stopword Removal: Built-in stopword lists for all three languages.
- Stemming & Lemmatization:
- English: Porter & Snowball stemmers.
- Hindi/Urdu: Rule‐based light stemmer.
Exploratory Analysis
- Frequency Analysis: Compute word and n-gram frequencies.
- Concordance: KWIC (Key Word in Context) display.
- Collocations: Identify bigrams and trigrams with PMI scoring.
- Basic Readability: Flesch–Kincaid for English; placeholder metrics for Hindi/Urdu.
Utilities
- Language Detection: Fast heuristic language tagger.
- Text I/O: Read/write plain text, UTF-8 encoded; support for CSV/TSV corpora.
- Batch Processing: Apply any analyzer over a directory of text files.
🔭 Upcoming Roadmap
Advanced features are under active development and will land incrementally in upcoming
1.xand2.xreleases.
-
Named Entity Recognition: Pretrained models for people, places, organizations.
-
Network & Graph Analysis: Build and analyze co‐occurrence and social networks.
-
Topic Modeling: LDA, NMF, hLDA with cross‐lingual support.
-
Stylometry & Authorship Attribution: Feature extraction + modeling tools.
-
Sentiment & Emotion Analysis: Transformer‐based sentiment classifiers for all supported languages.
-
Stylometry & Authorship Attribution: Feature extraction + modeling tools.
-
OCR & Image‐to‐Text: Integrate Tesseract pipelines.
-
Geospatial Analysis: Map place‐name occurrences; generate time‐space visualizations.
-
Deep Learning & Embeddings: Multilingual BERT embeddings, topic‐aware embeddings.
-
Translation & Transliteration: Bidirectional transliteration between Devanagari, Perso‐Arabic scripts and Roman.
-
Web Based Interface to Access tools Streamlit based tool to do plug and play interface
📦 Installation
pip install sanalyse
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sanalyse-0.1.0.tar.gz.
File metadata
- Download URL: sanalyse-0.1.0.tar.gz
- Upload date:
- Size: 48.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a3fceb263f697ed14c55d46560335297fe6f0317bbdbee35596c7b9693f5544
|
|
| MD5 |
2ec92920b47c3fb7e531b5bb5eef138b
|
|
| BLAKE2b-256 |
580d17845e926b73e3e5ae5ec596bc2fc5184b9c30168c2ff1a3b1fac7e3fbac
|
File details
Details for the file sanalyse-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sanalyse-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6bb864fa88c01b70e56b98f30e455773d510096763670f5874328e7fb09f109
|
|
| MD5 |
aa1fcb5f1f7d476352697a4d4a4798b0
|
|
| BLAKE2b-256 |
0ad32e45feaa01e0daadb612ba95ebc2dc37850293195b129e705e5b3455895c
|