Skip to main content

A simple, accessible python toolkit for digital humanities scholars for basic and advanced text analysis, exploration, processing, parsing

Project description

sanalyse

PyPI version Build Status License

sanalyse is an open-source, unified Python toolkit for Digital Humanities. It provides a simple, consistent API to perform core text‐analysis tasks across English, Hindi, and Urdu. Designed to evolve incrementally, the library currently offers a suite of basic functionalities, with a rich roadmap of advanced techniques slated for upcoming releases.


🚀 Features (v0.x)

Core Text‐Processing

  • Normalization: Unicode normalization, lowercasing, diacritic removal.
  • Tokenization: Language‐aware tokenizers for English, Hindi, and Urdu.
  • Stopword Removal: Built-in stopword lists for all three languages.
  • Stemming & Lemmatization:
    • English: Porter & Snowball stemmers.
    • Hindi/Urdu: Rule‐based light stemmer.

Exploratory Analysis

  • Frequency Analysis: Compute word and n-gram frequencies.
  • Concordance: KWIC (Key Word in Context) display.
  • Collocations: Identify bigrams and trigrams with PMI scoring.
  • Basic Readability: Flesch–Kincaid for English; placeholder metrics for Hindi/Urdu.

Utilities

  • Language Detection: Fast heuristic language tagger.
  • Text I/O: Read/write plain text, UTF-8 encoded; support for CSV/TSV corpora.
  • Batch Processing: Apply any analyzer over a directory of text files.

🔭 Upcoming Roadmap

Advanced features are under active development and will land incrementally in upcoming 1.x and 2.x releases.

  • Named Entity Recognition: Pretrained models for people, places, organizations.

  • Network & Graph Analysis: Build and analyze co‐occurrence and social networks.

  • Topic Modeling: LDA, NMF, hLDA with cross‐lingual support.

  • Stylometry & Authorship Attribution: Feature extraction + modeling tools.

  • Sentiment & Emotion Analysis: Transformer‐based sentiment classifiers for all supported languages.

  • Stylometry & Authorship Attribution: Feature extraction + modeling tools.

  • OCR & Image‐to‐Text: Integrate Tesseract pipelines.

  • Geospatial Analysis: Map place‐name occurrences; generate time‐space visualizations.

  • Deep Learning & Embeddings: Multilingual BERT embeddings, topic‐aware embeddings.

  • Translation & Transliteration: Bidirectional transliteration between Devanagari, Perso‐Arabic scripts and Roman.

  • Web Based Interface to Access tools Streamlit based tool to do plug and play interface


📦 Installation

pip install sanalyse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanalyse_dhpy-1.0.0.tar.gz (48.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sanalyse_dhpy-1.0.0-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file sanalyse_dhpy-1.0.0.tar.gz.

File metadata

  • Download URL: sanalyse_dhpy-1.0.0.tar.gz
  • Upload date:
  • Size: 48.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sanalyse_dhpy-1.0.0.tar.gz
Algorithm Hash digest
SHA256 cd0c3b227b0cc3ab18c38b1363389d6f79aa88de17f584ec61647e5de080eb40
MD5 b2bce632d8512f6f20cbd3f15f88954f
BLAKE2b-256 8d434a84507dcd445b8b054a7832046e0b87bfc6df4e01bcec34ece34faa25b9

See more details on using hashes here.

File details

Details for the file sanalyse_dhpy-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: sanalyse_dhpy-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sanalyse_dhpy-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d1a8726d143ba5e577b9531829af9d818cc78948bb1dbe38c26b0fea4454a07c
MD5 23f1c4fc61a2e47ca9a1e5dd80101779
BLAKE2b-256 5f5cba9af376f5d97bd3a611cc9e7c03fe159691dffb3ef24855b014eb4f50cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page