Intelligent Text Compression & Fact Extraction Engine using NLP

These details have not been verified by PyPI

Project links

Homepage

Project description

🚀 VagaCore - Intelligent Text Compression & Fact Extraction Engine

A production-ready NLP system combining Named Entity Recognition (NER), dependency parsing, and context-aware processing for intelligent fact extraction from unstructured text.

🎯 Features

Core Capabilities

🧠 Hybrid Extraction: Combines ML-based NER with rule-based syntax parsing
📝 Multi-Sentence Processing: Handles complex documents with multiple facts
🔄 Context Memory: Maintains temporal awareness across sentences
🏢 Named Entity Recognition: Identifies PERCENT, MONEY, DATE, ORG, PERSON, LOC
🎯 Semantic Understanding: Extracts Subject-Verb-Object patterns with noise removal
📊 Structured Output: Returns clean JSON facts

Key Innovations

✅ Context-Aware Extraction

Sentences without explicit dates inherit from previous context
Prevents temporal information loss in multi-sentence documents
Critical for RAG and knowledge base indexing

✅ Noise Resistance

Removes adjectives and adverbs before processing
Preserves semantic relationships
Filters subjective language

✅ Domain Intelligence

Recognizes financial keywords (revenue, profit, earnings, sales)
Prioritizes domain entities over generic organizations
Smart quantity filtering (million → context)

📊 Example

Input

Apple reported $500 million in revenue during Q3 2024 in the Asia-Pacific region.
The profit increased by 15% in the same period.

Output

[
  {
    "subject": "Apple",
    "action": "report",
    "object": "revenue",
    "entity": "revenue",
    "value": "$500 million",
    "time": "Q3 2024"
  },
  {
    "subject": "profit",
    "action": "increase",
    "object": null,
    "entity": null,
    "value": "15%",
    "time": "Q3 2024"
  }
]

🏗️ Architecture

Input Text
    ↓
┌─────────────────────────────────────┐
│   Parser (spaCy)                    │
│   - Tokenization                    │
│   - POS Tagging                     │
│   - Named Entity Recognition        │
│   - Dependency Parsing              │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│   Noise Removal (Utils)             │
│   - Remove adjectives/adverbs       │
│   - Keep semantic prepositions      │
│   - Filter stop words               │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│   Hybrid Extraction (Extractor)     │
│                                     │
│   ┌─ ML Path (NER)                 │
│   │  - PERCENT, MONEY              │
│   │  - DATE, TIME                  │
│   │  - ORG, PERSON, LOC            │
│   │                                 │
│   └─ Rule Path (Syntax)             │
│      - Domain keywords              │
│      - Prepositional patterns       │
│      - Subject-Verb-Object          │
└─────────────────────────────────────┘
    ↓
┌─────────────────────────────────────┐
│   Context Memory (Compressor)       │
│   - Propagate temporal context      │
│   - Maintain state across sentences │
│   - Prevent information loss        │
└─────────────────────────────────────┘
    ↓
Output: Structured JSON Facts

📋 Module Overview

`parser.py`

Loads spaCy NLP model
Handles text tokenization and parsing

`extractor.py`

extract_svo(): Subject-Verb-Object extraction with intelligent object selection
extract_entities(): Named Entity Recognition with type labels
extract_entities_by_type(): Organized entity access by category
extract_details(): Hybrid NER + rule-based value/time/entity extraction

`utils.py`

remove_noise(): Removes adjectives/adverbs while preserving semantic relationships

`compressor.py`

compress(): Main pipeline with context memory
Orchestrates all components
Implements temporal propagation

🚀 Quick Start

Installation

cd vagacore
python -m venv venv
.\venv\Scripts\activate  # Windows
source venv/bin/activate # Linux/Mac

pip install spacy
python -m spacy download en_core_web_sm

Basic Usage

from compressor import compress

text = "Apple reported $500 million in revenue during Q3 2024."
result = compress(text)

import json
print(json.dumps(result, indent=2))

Run Demos

# Simple demo
python examples/demo.py

# Advanced demonstrations
python examples/advanced_demo.py

🔍 Use Cases

1. Retrieval-Augmented Generation (RAG)

Extract structured facts for LLM context:

facts = compress(document_text)
# Feed to LLM for better grounding

2. Financial Data Extraction

Parse earnings reports and investor documents:

earnings_report = """
Q3 2024 Revenue: $50 million
Operating margin improved by 5%
"""
facts = compress(earnings_report)

3. Knowledge Base Indexing

Create temporally-aware fact databases:

for document in documents:
    facts = compress(document)
    # Index with time-based grouping

4. News Analysis

Extract named entities and facts from articles:

article = get_news_article()
entities = compress(article)

📈 Performance

What It Handles Well ✅

Multi-sentence documents
Temporal references and quarters
Financial terminology
Organization names and locations
Percentage and monetary values
Contextual pronouns via memory

Current Limitations ⚠️

Single main action per sentence
Simple clause structures work best
Passive voice sometimes reduced accuracy
Requires English text

🔬 Technical Details

Extraction Methods

NER (Named Entity Recognition)

Uses spaCy's trained model
Entity types: PERCENT, MONEY, DATE, ORG, PERSON, GPE, LOC
Confidence-based extraction

Dependency Parsing

Identifies grammatical relationships
Key patterns:
- nsubj: Nominal subject
- ROOT: Root verb
- dobj: Direct object
- pobj: Object of preposition
- attr: Predicate attribute

Context Memory Algorithm

for each sentence:
    extract time from sentence
    if time is None or vague:
        use previous_time
    else:
        update previous_time

🎓 Learning Resources

NLP Basics: The extraction uses fundamental NLP concepts
spaCy: Learn at https://spacy.io
Dependency Parsing: https://en.wikipedia.org/wiki/Dependency_grammar
Context in LLMs: Essential for RAG systems

🤝 Contributing

This is a demonstration project. However, potential improvements:

Multi-action sentence support
Improved passive voice handling
Custom entity type definitions
Confidence scoring for facts
Multi-language support

📄 License

Open source - use freely for learning and development

🙌 Acknowledgments

Built with:

spaCy: Industrial-strength NLP
Python: Core language
NER Technology: Modern entity recognition

VagaCore v0.5 | Hybrid NER + Rule-Based Extraction | Context-Aware Processing something

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

1.0.1

Mar 29, 2026

0.6.0

Mar 29, 2026

This version

0.5.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vagacore-0.5.0.tar.gz (9.5 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vagacore-0.5.0-py3-none-any.whl (8.1 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file vagacore-0.5.0.tar.gz.

File metadata

Download URL: vagacore-0.5.0.tar.gz
Upload date: Mar 29, 2026
Size: 9.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-0.5.0.tar.gz
Algorithm	Hash digest
SHA256	`762ab8ba77b4fc44cf6cf16267fa66a9800bb27eb3ad313fbf6830858c7be848`
MD5	`c60534e2999cdf7006c86c52aec23542`
BLAKE2b-256	`cb6473c46e8e65889d058c9069e3c5e5bbcc82434e43d6c97e5c935c121a52af`

See more details on using hashes here.

File details

Details for the file vagacore-0.5.0-py3-none-any.whl.

File metadata

Download URL: vagacore-0.5.0-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-0.5.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`845bbb8465c364621a5056dcc420346313cec869fa904d2468b0f0930876f7d3`
MD5	`928ddf68a7f2d02cda5894b48899c66b`
BLAKE2b-256	`7831a4b6827953e47e88ae6dc95b189a286379052dd5166e62aa091a6bbdfe2e`

See more details on using hashes here.

vagacore 0.5.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

🚀 VagaCore - Intelligent Text Compression & Fact Extraction Engine

🎯 Features

Core Capabilities

Key Innovations

📊 Example

Input

Output

🏗️ Architecture

📋 Module Overview

parser.py

extractor.py

utils.py

compressor.py

🚀 Quick Start

Installation

Basic Usage

Run Demos

🔍 Use Cases

1. Retrieval-Augmented Generation (RAG)

2. Financial Data Extraction

3. Knowledge Base Indexing

4. News Analysis

📈 Performance

What It Handles Well ✅

Current Limitations ⚠️

🔬 Technical Details

Extraction Methods

🎓 Learning Resources

🤝 Contributing

📄 License

🙌 Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`parser.py`

`extractor.py`

`utils.py`

`compressor.py`