Skip to main content

A simple, accessible python toolkit for digital humanities scholars for basic and advanced text analysis, exploration, processing, parsing

Project description

sanalyse

PyPI version Build Status License

sanalyse is an open-source, unified Python toolkit for Digital Humanities. It provides a simple, consistent API to perform core text‐analysis tasks across English, Hindi, and Urdu. Designed to evolve incrementally, the library currently offers a suite of basic functionalities, with a rich roadmap of advanced techniques slated for upcoming releases.


🚀 Features (v0.x)

Core Text‐Processing

  • Normalization: Unicode normalization, lowercasing, diacritic removal.
  • Tokenization: Language‐aware tokenizers for English, Hindi, and Urdu.
  • Stopword Removal: Built-in stopword lists for all three languages.
  • Stemming & Lemmatization:
    • English: Porter & Snowball stemmers.
    • Hindi/Urdu: Rule‐based light stemmer.

Exploratory Analysis

  • Frequency Analysis: Compute word and n-gram frequencies.
  • Concordance: KWIC (Key Word in Context) display.
  • Collocations: Identify bigrams and trigrams with PMI scoring.
  • Basic Readability: Flesch–Kincaid for English; placeholder metrics for Hindi/Urdu.

Utilities

  • Language Detection: Fast heuristic language tagger.
  • Text I/O: Read/write plain text, UTF-8 encoded; support for CSV/TSV corpora.
  • Batch Processing: Apply any analyzer over a directory of text files.

🔭 Upcoming Roadmap

Advanced features are under active development and will land incrementally in upcoming 1.x and 2.x releases.

  • Named Entity Recognition: Pretrained models for people, places, organizations.

  • Network & Graph Analysis: Build and analyze co‐occurrence and social networks.

  • Topic Modeling: LDA, NMF, hLDA with cross‐lingual support.

  • Stylometry & Authorship Attribution: Feature extraction + modeling tools.

  • Sentiment & Emotion Analysis: Transformer‐based sentiment classifiers for all supported languages.

  • Stylometry & Authorship Attribution: Feature extraction + modeling tools.

  • OCR & Image‐to‐Text: Integrate Tesseract pipelines.

  • Geospatial Analysis: Map place‐name occurrences; generate time‐space visualizations.

  • Deep Learning & Embeddings: Multilingual BERT embeddings, topic‐aware embeddings.

  • Translation & Transliteration: Bidirectional transliteration between Devanagari, Perso‐Arabic scripts and Roman.

  • Web Based Interface to Access tools Streamlit based tool to do plug and play interface


📦 Installation

pip install sanalyse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanalyse-0.1.0.tar.gz (48.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sanalyse-0.1.0-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file sanalyse-0.1.0.tar.gz.

File metadata

  • Download URL: sanalyse-0.1.0.tar.gz
  • Upload date:
  • Size: 48.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sanalyse-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5a3fceb263f697ed14c55d46560335297fe6f0317bbdbee35596c7b9693f5544
MD5 2ec92920b47c3fb7e531b5bb5eef138b
BLAKE2b-256 580d17845e926b73e3e5ae5ec596bc2fc5184b9c30168c2ff1a3b1fac7e3fbac

See more details on using hashes here.

File details

Details for the file sanalyse-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sanalyse-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sanalyse-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b6bb864fa88c01b70e56b98f30e455773d510096763670f5874328e7fb09f109
MD5 aa1fcb5f1f7d476352697a4d4a4798b0
BLAKE2b-256 0ad32e45feaa01e0daadb612ba95ebc2dc37850293195b129e705e5b3455895c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page