Skip to main content

A simple, accessible python toolkit for digital humanities scholars for basic and advanced text analysis, exploration, processing, parsing

Project description

sanalyse

PyPI version Build Status License

sanalyse is an open-source, unified Python toolkit for Digital Humanities. It provides a simple, consistent API to perform core text‐analysis tasks across English, Hindi, and Urdu. Designed to evolve incrementally, the library currently offers a suite of basic functionalities, with a rich roadmap of advanced techniques slated for upcoming releases.


🚀 Features (v0.x)

Core Text‐Processing

  • Normalization: Unicode normalization, lowercasing, diacritic removal.
  • Tokenization: Language‐aware tokenizers for English, Hindi, and Urdu.
  • Stopword Removal: Built-in stopword lists for all three languages.
  • Stemming & Lemmatization:
    • English: Porter & Snowball stemmers.
    • Hindi/Urdu: Rule‐based light stemmer.

Exploratory Analysis

  • Frequency Analysis: Compute word and n-gram frequencies.
  • Concordance: KWIC (Key Word in Context) display.
  • Collocations: Identify bigrams and trigrams with PMI scoring.
  • Basic Readability: Flesch–Kincaid for English; placeholder metrics for Hindi/Urdu.

Utilities

  • Language Detection: Fast heuristic language tagger.
  • Text I/O: Read/write plain text, UTF-8 encoded; support for CSV/TSV corpora.
  • Batch Processing: Apply any analyzer over a directory of text files.

🔭 Upcoming Roadmap

Advanced features are under active development and will land incrementally in upcoming 1.x and 2.x releases.

  • Named Entity Recognition: Pretrained models for people, places, organizations.

  • Network & Graph Analysis: Build and analyze co‐occurrence and social networks.

  • Topic Modeling: LDA, NMF, hLDA with cross‐lingual support.

  • Stylometry & Authorship Attribution: Feature extraction + modeling tools.

  • Sentiment & Emotion Analysis: Transformer‐based sentiment classifiers for all supported languages.

  • Stylometry & Authorship Attribution: Feature extraction + modeling tools.

  • OCR & Image‐to‐Text: Integrate Tesseract pipelines.

  • Geospatial Analysis: Map place‐name occurrences; generate time‐space visualizations.

  • Deep Learning & Embeddings: Multilingual BERT embeddings, topic‐aware embeddings.

  • Translation & Transliteration: Bidirectional transliteration between Devanagari, Perso‐Arabic scripts and Roman.

  • Web Based Interface to Access tools Streamlit based tool to do plug and play interface


📦 Installation

pip install sanalyse

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sanalyse_dhpy-0.1.0.tar.gz (48.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sanalyse_dhpy-0.1.0-py3-none-any.whl (34.7 kB view details)

Uploaded Python 3

File details

Details for the file sanalyse_dhpy-0.1.0.tar.gz.

File metadata

  • Download URL: sanalyse_dhpy-0.1.0.tar.gz
  • Upload date:
  • Size: 48.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sanalyse_dhpy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 1f2ab1cad862d3bd52702ff49db1362878324547c754cf3d561d2a77690f0d5c
MD5 ece366bc2c9636e7688a74fb43c955c5
BLAKE2b-256 405af7e6078046ef3fca8d99f68e400d09be27cd14223422d1ce99546b941b90

See more details on using hashes here.

File details

Details for the file sanalyse_dhpy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sanalyse_dhpy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for sanalyse_dhpy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 061a30a37b291206b0378a4f1e8a5202f976dd6d033aa5879d9a6da46e85997a
MD5 796ce4598d6d82d9306b86375dd8716f
BLAKE2b-256 3338a96cf934f2d25db90fe6589b559968b4a2741c549b99b6f3169bdf3c4977

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page