Skip to main content

Intelligent Text Compression & Fact Extraction Engine using NLP

Project description

๐Ÿš€ VagaCore - Intelligent Text Compression & Fact Extraction Engine

A production-ready NLP system combining Named Entity Recognition (NER), dependency parsing, and context-aware processing for intelligent fact extraction from unstructured text.

๐ŸŽฏ Features

Core Capabilities

  • ๐Ÿง  Hybrid Extraction: Combines ML-based NER with rule-based syntax parsing
  • ๐Ÿ“ Multi-Sentence Processing: Handles complex documents with multiple facts
  • ๐Ÿ”„ Context Memory: Maintains temporal awareness across sentences
  • ๐Ÿข Named Entity Recognition: Identifies PERCENT, MONEY, DATE, ORG, PERSON, LOC
  • ๐ŸŽฏ Semantic Understanding: Extracts Subject-Verb-Object patterns with noise removal
  • ๐Ÿ“Š Structured Output: Returns clean JSON facts

Key Innovations

โœ… Context-Aware Extraction

  • Sentences without explicit dates inherit from previous context
  • Prevents temporal information loss in multi-sentence documents
  • Critical for RAG and knowledge base indexing

โœ… Noise Resistance

  • Removes adjectives and adverbs before processing
  • Preserves semantic relationships
  • Filters subjective language

โœ… Domain Intelligence

  • Recognizes financial keywords (revenue, profit, earnings, sales)
  • Prioritizes domain entities over generic organizations
  • Smart quantity filtering (million โ†’ context)

๐Ÿ“Š Example

Input

Apple reported $500 million in revenue during Q3 2024 in the Asia-Pacific region.
The profit increased by 15% in the same period.

Output

[
  {
    "subject": "Apple",
    "action": "report",
    "object": "revenue",
    "entity": "revenue",
    "value": "$500 million",
    "time": "Q3 2024"
  },
  {
    "subject": "profit",
    "action": "increase",
    "object": null,
    "entity": null,
    "value": "15%",
    "time": "Q3 2024"
  }
]

๐Ÿ—๏ธ Architecture

Input Text
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Parser (spaCy)                    โ”‚
โ”‚   - Tokenization                    โ”‚
โ”‚   - POS Tagging                     โ”‚
โ”‚   - Named Entity Recognition        โ”‚
โ”‚   - Dependency Parsing              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Noise Removal (Utils)             โ”‚
โ”‚   - Remove adjectives/adverbs       โ”‚
โ”‚   - Keep semantic prepositions      โ”‚
โ”‚   - Filter stop words               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Hybrid Extraction (Extractor)     โ”‚
โ”‚                                     โ”‚
โ”‚   โ”Œโ”€ ML Path (NER)                 โ”‚
โ”‚   โ”‚  - PERCENT, MONEY              โ”‚
โ”‚   โ”‚  - DATE, TIME                  โ”‚
โ”‚   โ”‚  - ORG, PERSON, LOC            โ”‚
โ”‚   โ”‚                                 โ”‚
โ”‚   โ””โ”€ Rule Path (Syntax)             โ”‚
โ”‚      - Domain keywords              โ”‚
โ”‚      - Prepositional patterns       โ”‚
โ”‚      - Subject-Verb-Object          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Context Memory (Compressor)       โ”‚
โ”‚   - Propagate temporal context      โ”‚
โ”‚   - Maintain state across sentences โ”‚
โ”‚   - Prevent information loss        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
Output: Structured JSON Facts

๐Ÿ“‹ Module Overview

parser.py

  • Loads spaCy NLP model
  • Handles text tokenization and parsing

extractor.py

  • extract_svo(): Subject-Verb-Object extraction with intelligent object selection
  • extract_entities(): Named Entity Recognition with type labels
  • extract_entities_by_type(): Organized entity access by category
  • extract_details(): Hybrid NER + rule-based value/time/entity extraction

utils.py

  • remove_noise(): Removes adjectives/adverbs while preserving semantic relationships

compressor.py

  • compress(): Main pipeline with context memory
  • Orchestrates all components
  • Implements temporal propagation

๐Ÿš€ Quick Start

Installation

cd vagacore
python -m venv venv
.\venv\Scripts\activate  # Windows
source venv/bin/activate # Linux/Mac

pip install spacy
python -m spacy download en_core_web_sm

Basic Usage

from compressor import compress

text = "Apple reported $500 million in revenue during Q3 2024."
result = compress(text)

import json
print(json.dumps(result, indent=2))

Run Demos

# Simple demo
python examples/demo.py

# Advanced demonstrations
python examples/advanced_demo.py

๐Ÿ” Use Cases

1. Retrieval-Augmented Generation (RAG)

Extract structured facts for LLM context:

facts = compress(document_text)
# Feed to LLM for better grounding

2. Financial Data Extraction

Parse earnings reports and investor documents:

earnings_report = """
Q3 2024 Revenue: $50 million
Operating margin improved by 5%
"""
facts = compress(earnings_report)

3. Knowledge Base Indexing

Create temporally-aware fact databases:

for document in documents:
    facts = compress(document)
    # Index with time-based grouping

4. News Analysis

Extract named entities and facts from articles:

article = get_news_article()
entities = compress(article)

๐Ÿ“ˆ Performance

What It Handles Well โœ…

  • Multi-sentence documents
  • Temporal references and quarters
  • Financial terminology
  • Organization names and locations
  • Percentage and monetary values
  • Contextual pronouns via memory

Current Limitations โš ๏ธ

  • Single main action per sentence
  • Simple clause structures work best
  • Passive voice sometimes reduced accuracy
  • Requires English text

๐Ÿ”ฌ Technical Details

Extraction Methods

NER (Named Entity Recognition)

  • Uses spaCy's trained model
  • Entity types: PERCENT, MONEY, DATE, ORG, PERSON, GPE, LOC
  • Confidence-based extraction

Dependency Parsing

  • Identifies grammatical relationships
  • Key patterns:
    • nsubj: Nominal subject
    • ROOT: Root verb
    • dobj: Direct object
    • pobj: Object of preposition
    • attr: Predicate attribute

Context Memory Algorithm

for each sentence:
    extract time from sentence
    if time is None or vague:
        use previous_time
    else:
        update previous_time

๐ŸŽ“ Learning Resources

๐Ÿค Contributing

This is a demonstration project. However, potential improvements:

  • Multi-action sentence support
  • Improved passive voice handling
  • Custom entity type definitions
  • Confidence scoring for facts
  • Multi-language support

๐Ÿ“„ License

Open source - use freely for learning and development

๐Ÿ™Œ Acknowledgments

Built with:

  • spaCy: Industrial-strength NLP
  • Python: Core language
  • NER Technology: Modern entity recognition

VagaCore v0.5 | Hybrid NER + Rule-Based Extraction | Context-Aware Processing something

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vagacore-0.5.0.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vagacore-0.5.0-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file vagacore-0.5.0.tar.gz.

File metadata

  • Download URL: vagacore-0.5.0.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-0.5.0.tar.gz
Algorithm Hash digest
SHA256 762ab8ba77b4fc44cf6cf16267fa66a9800bb27eb3ad313fbf6830858c7be848
MD5 c60534e2999cdf7006c86c52aec23542
BLAKE2b-256 cb6473c46e8e65889d058c9069e3c5e5bbcc82434e43d6c97e5c935c121a52af

See more details on using hashes here.

File details

Details for the file vagacore-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: vagacore-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 845bbb8465c364621a5056dcc420346313cec869fa904d2468b0f0930876f7d3
MD5 928ddf68a7f2d02cda5894b48899c66b
BLAKE2b-256 7831a4b6827953e47e88ae6dc95b189a286379052dd5166e62aa091a6bbdfe2e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page