Skip to main content

Lightweight Natural Language Processing for Indonesian Language.

Project description

pySastra

Lightweight Natural Language Processing for Indonesian Language.

Design Plan

Planned Pipeline Description
🟠 Language A text-processing pipeline.
🟡 Tokenizer Segment text, and create Doc objects with the discovered segment boundaries.
🟠 Lemmatizer Determine the base forms of words.
🟡 Morphology Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag.
🟠 Tagger Annotate part-of-speech tags on Doc objects.
🔄 DependencyParser Annotate syntactic dependencies on Doc objects.
🔄 EntityRecognizer Annotate named entities, e.g. persons or products, on Doc objects.
🔄 TextCategorizer Assign categories or labels to Doc objects.
🔄 Matcher Match sequences of tokens, based on pattern rules, similar to regular expressions.
🔄 PhraseMatcher Match sequences of tokens based on phrases.
🔄 EntityRuler Add entity spans to the Doc using token-based rules or exact phrase matches.
🔄 Sentencizer Implement custom sentence boundary detection logic that doesn’t require the dependency parse.

🟢 Completed With Test 🟡 Completed 🟠 On Progress 🔄 Planned

reference : spaCy language pipeline

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysastra-0.1.0.tar.gz (3.0 kB view hashes)

Uploaded source

Built Distribution

pysastra-0.1.0-py3-none-any.whl (4.2 kB view hashes)

Uploaded py3

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Facebook / Instagram Facebook / Instagram PSF Sponsor Fastly Fastly CDN Google Google Object Storage and Download Analytics Huawei Huawei PSF Sponsor Microsoft Microsoft PSF Sponsor NVIDIA NVIDIA PSF Sponsor Pingdom Pingdom Monitoring Salesforce Salesforce PSF Sponsor Sentry Sentry Error logging StatusPage StatusPage Status page