Intelligent Text Compression & Fact Extraction Engine using NLP
Project description
๐ VagaCore - Intelligent Text Compression & Fact Extraction Engine
A production-ready NLP system combining Named Entity Recognition (NER), dependency parsing, and context-aware processing for intelligent fact extraction from unstructured text.
๐ฏ Features
Core Capabilities
- ๐ง Hybrid Extraction: Combines ML-based NER with rule-based syntax parsing
- ๐ Multi-Sentence Processing: Handles complex documents with multiple facts
- ๐ Context Memory: Maintains temporal awareness across sentences
- ๐ข Named Entity Recognition: Identifies PERCENT, MONEY, DATE, ORG, PERSON, LOC
- ๐ฏ Semantic Understanding: Extracts Subject-Verb-Object patterns with noise removal
- ๐ Structured Output: Returns clean JSON facts
Key Innovations
โ Context-Aware Extraction
- Sentences without explicit dates inherit from previous context
- Prevents temporal information loss in multi-sentence documents
- Critical for RAG and knowledge base indexing
โ Noise Resistance
- Removes adjectives and adverbs before processing
- Preserves semantic relationships
- Filters subjective language
โ Domain Intelligence
- Recognizes financial keywords (revenue, profit, earnings, sales)
- Prioritizes domain entities over generic organizations
- Smart quantity filtering (million โ context)
๐ Example
Input
Apple reported $500 million in revenue during Q3 2024 in the Asia-Pacific region.
The profit increased by 15% in the same period.
Output
[
{
"subject": "Apple",
"action": "report",
"object": "revenue",
"entity": "revenue",
"value": "$500 million",
"time": "Q3 2024"
},
{
"subject": "profit",
"action": "increase",
"object": null,
"entity": null,
"value": "15%",
"time": "Q3 2024"
}
]
๐๏ธ Architecture
Input Text
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Parser (spaCy) โ
โ - Tokenization โ
โ - POS Tagging โ
โ - Named Entity Recognition โ
โ - Dependency Parsing โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Noise Removal (Utils) โ
โ - Remove adjectives/adverbs โ
โ - Keep semantic prepositions โ
โ - Filter stop words โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Hybrid Extraction (Extractor) โ
โ โ
โ โโ ML Path (NER) โ
โ โ - PERCENT, MONEY โ
โ โ - DATE, TIME โ
โ โ - ORG, PERSON, LOC โ
โ โ โ
โ โโ Rule Path (Syntax) โ
โ - Domain keywords โ
โ - Prepositional patterns โ
โ - Subject-Verb-Object โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Context Memory (Compressor) โ
โ - Propagate temporal context โ
โ - Maintain state across sentences โ
โ - Prevent information loss โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
Output: Structured JSON Facts
๐ Module Overview
parser.py
- Loads spaCy NLP model
- Handles text tokenization and parsing
extractor.py
extract_svo(): Subject-Verb-Object extraction with intelligent object selectionextract_entities(): Named Entity Recognition with type labelsextract_entities_by_type(): Organized entity access by categoryextract_details(): Hybrid NER + rule-based value/time/entity extraction
utils.py
remove_noise(): Removes adjectives/adverbs while preserving semantic relationships
compressor.py
compress(): Main pipeline with context memory- Orchestrates all components
- Implements temporal propagation
๐ Quick Start
Installation
cd vagacore
python -m venv venv
.\venv\Scripts\activate # Windows
source venv/bin/activate # Linux/Mac
pip install spacy
python -m spacy download en_core_web_sm
Basic Usage
from compressor import compress
text = "Apple reported $500 million in revenue during Q3 2024."
result = compress(text)
import json
print(json.dumps(result, indent=2))
Run Demos
# Simple demo
python examples/demo.py
# Advanced demonstrations
python examples/advanced_demo.py
๐ Use Cases
1. Retrieval-Augmented Generation (RAG)
Extract structured facts for LLM context:
facts = compress(document_text)
# Feed to LLM for better grounding
2. Financial Data Extraction
Parse earnings reports and investor documents:
earnings_report = """
Q3 2024 Revenue: $50 million
Operating margin improved by 5%
"""
facts = compress(earnings_report)
3. Knowledge Base Indexing
Create temporally-aware fact databases:
for document in documents:
facts = compress(document)
# Index with time-based grouping
4. News Analysis
Extract named entities and facts from articles:
article = get_news_article()
entities = compress(article)
๐ Performance
What It Handles Well โ
- Multi-sentence documents
- Temporal references and quarters
- Financial terminology
- Organization names and locations
- Percentage and monetary values
- Contextual pronouns via memory
Current Limitations โ ๏ธ
- Single main action per sentence
- Simple clause structures work best
- Passive voice sometimes reduced accuracy
- Requires English text
๐ฌ Technical Details
Extraction Methods
NER (Named Entity Recognition)
- Uses spaCy's trained model
- Entity types: PERCENT, MONEY, DATE, ORG, PERSON, GPE, LOC
- Confidence-based extraction
Dependency Parsing
- Identifies grammatical relationships
- Key patterns:
nsubj: Nominal subjectROOT: Root verbdobj: Direct objectpobj: Object of prepositionattr: Predicate attribute
Context Memory Algorithm
for each sentence:
extract time from sentence
if time is None or vague:
use previous_time
else:
update previous_time
๐ Learning Resources
- NLP Basics: The extraction uses fundamental NLP concepts
- spaCy: Learn at https://spacy.io
- Dependency Parsing: https://en.wikipedia.org/wiki/Dependency_grammar
- Context in LLMs: Essential for RAG systems
๐ค Contributing
This is a demonstration project. However, potential improvements:
- Multi-action sentence support
- Improved passive voice handling
- Custom entity type definitions
- Confidence scoring for facts
- Multi-language support
๐ License
Open source - use freely for learning and development
๐ Acknowledgments
Built with:
- spaCy: Industrial-strength NLP
- Python: Core language
- NER Technology: Modern entity recognition
VagaCore v0.5 | Hybrid NER + Rule-Based Extraction | Context-Aware Processing something
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vagacore-1.0.1.tar.gz.
File metadata
- Download URL: vagacore-1.0.1.tar.gz
- Upload date:
- Size: 16.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9b89fa0a32aab2ce06e18c8c2485abb5add0dbd3804bdcc31c744b36232152f5
|
|
| MD5 |
bccd25c50b1b0511610eb1bb4c6a95da
|
|
| BLAKE2b-256 |
158d8f88c0dae1c73179176bde981762aeba8597d8d353cedcfab3c5372c0761
|
File details
Details for the file vagacore-1.0.1-py3-none-any.whl.
File metadata
- Download URL: vagacore-1.0.1-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44fc7e4550c08dde014e9d76c4edc85e21f0a4deedbe42a7b1cfee4d61488efc
|
|
| MD5 |
8509cca05a9654602ba2ac3ac1b427a6
|
|
| BLAKE2b-256 |
44ffdc256b838ab1916a7df33d75d64ed7fb0ce1ce38f8c9b9aaa98c5fe07e12
|