Skip to main content

Intelligent Text Compression & Fact Extraction Engine using NLP

Project description

๐Ÿš€ VagaCore - Intelligent Text Compression & Fact Extraction Engine

A production-ready NLP system combining Named Entity Recognition (NER), dependency parsing, and context-aware processing for intelligent fact extraction from unstructured text.

๐ŸŽฏ Features

Core Capabilities

  • ๐Ÿง  Hybrid Extraction: Combines ML-based NER with rule-based syntax parsing
  • ๐Ÿ“ Multi-Sentence Processing: Handles complex documents with multiple facts
  • ๐Ÿ”„ Context Memory: Maintains temporal awareness across sentences
  • ๐Ÿข Named Entity Recognition: Identifies PERCENT, MONEY, DATE, ORG, PERSON, LOC
  • ๐ŸŽฏ Semantic Understanding: Extracts Subject-Verb-Object patterns with noise removal
  • ๐Ÿ“Š Structured Output: Returns clean JSON facts

Key Innovations

โœ… Context-Aware Extraction

  • Sentences without explicit dates inherit from previous context
  • Prevents temporal information loss in multi-sentence documents
  • Critical for RAG and knowledge base indexing

โœ… Noise Resistance

  • Removes adjectives and adverbs before processing
  • Preserves semantic relationships
  • Filters subjective language

โœ… Domain Intelligence

  • Recognizes financial keywords (revenue, profit, earnings, sales)
  • Prioritizes domain entities over generic organizations
  • Smart quantity filtering (million โ†’ context)

๐Ÿ“Š Example

Input

Apple reported $500 million in revenue during Q3 2024 in the Asia-Pacific region.
The profit increased by 15% in the same period.

Output

[
  {
    "subject": "Apple",
    "action": "report",
    "object": "revenue",
    "entity": "revenue",
    "value": "$500 million",
    "time": "Q3 2024"
  },
  {
    "subject": "profit",
    "action": "increase",
    "object": null,
    "entity": null,
    "value": "15%",
    "time": "Q3 2024"
  }
]

๐Ÿ—๏ธ Architecture

Input Text
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Parser (spaCy)                    โ”‚
โ”‚   - Tokenization                    โ”‚
โ”‚   - POS Tagging                     โ”‚
โ”‚   - Named Entity Recognition        โ”‚
โ”‚   - Dependency Parsing              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Noise Removal (Utils)             โ”‚
โ”‚   - Remove adjectives/adverbs       โ”‚
โ”‚   - Keep semantic prepositions      โ”‚
โ”‚   - Filter stop words               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Hybrid Extraction (Extractor)     โ”‚
โ”‚                                     โ”‚
โ”‚   โ”Œโ”€ ML Path (NER)                 โ”‚
โ”‚   โ”‚  - PERCENT, MONEY              โ”‚
โ”‚   โ”‚  - DATE, TIME                  โ”‚
โ”‚   โ”‚  - ORG, PERSON, LOC            โ”‚
โ”‚   โ”‚                                 โ”‚
โ”‚   โ””โ”€ Rule Path (Syntax)             โ”‚
โ”‚      - Domain keywords              โ”‚
โ”‚      - Prepositional patterns       โ”‚
โ”‚      - Subject-Verb-Object          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Context Memory (Compressor)       โ”‚
โ”‚   - Propagate temporal context      โ”‚
โ”‚   - Maintain state across sentences โ”‚
โ”‚   - Prevent information loss        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
Output: Structured JSON Facts

๐Ÿ“‹ Module Overview

parser.py

  • Loads spaCy NLP model
  • Handles text tokenization and parsing

extractor.py

  • extract_svo(): Subject-Verb-Object extraction with intelligent object selection
  • extract_entities(): Named Entity Recognition with type labels
  • extract_entities_by_type(): Organized entity access by category
  • extract_details(): Hybrid NER + rule-based value/time/entity extraction

utils.py

  • remove_noise(): Removes adjectives/adverbs while preserving semantic relationships

compressor.py

  • compress(): Main pipeline with context memory
  • Orchestrates all components
  • Implements temporal propagation

๐Ÿš€ Quick Start

Installation

cd vagacore
python -m venv venv
.\venv\Scripts\activate  # Windows
source venv/bin/activate # Linux/Mac

pip install spacy
python -m spacy download en_core_web_sm

Basic Usage

from compressor import compress

text = "Apple reported $500 million in revenue during Q3 2024."
result = compress(text)

import json
print(json.dumps(result, indent=2))

Run Demos

# Simple demo
python examples/demo.py

# Advanced demonstrations
python examples/advanced_demo.py

๐Ÿ” Use Cases

1. Retrieval-Augmented Generation (RAG)

Extract structured facts for LLM context:

facts = compress(document_text)
# Feed to LLM for better grounding

2. Financial Data Extraction

Parse earnings reports and investor documents:

earnings_report = """
Q3 2024 Revenue: $50 million
Operating margin improved by 5%
"""
facts = compress(earnings_report)

3. Knowledge Base Indexing

Create temporally-aware fact databases:

for document in documents:
    facts = compress(document)
    # Index with time-based grouping

4. News Analysis

Extract named entities and facts from articles:

article = get_news_article()
entities = compress(article)

๐Ÿ“ˆ Performance

What It Handles Well โœ…

  • Multi-sentence documents
  • Temporal references and quarters
  • Financial terminology
  • Organization names and locations
  • Percentage and monetary values
  • Contextual pronouns via memory

Current Limitations โš ๏ธ

  • Single main action per sentence
  • Simple clause structures work best
  • Passive voice sometimes reduced accuracy
  • Requires English text

๐Ÿ”ฌ Technical Details

Extraction Methods

NER (Named Entity Recognition)

  • Uses spaCy's trained model
  • Entity types: PERCENT, MONEY, DATE, ORG, PERSON, GPE, LOC
  • Confidence-based extraction

Dependency Parsing

  • Identifies grammatical relationships
  • Key patterns:
    • nsubj: Nominal subject
    • ROOT: Root verb
    • dobj: Direct object
    • pobj: Object of preposition
    • attr: Predicate attribute

Context Memory Algorithm

for each sentence:
    extract time from sentence
    if time is None or vague:
        use previous_time
    else:
        update previous_time

๐ŸŽ“ Learning Resources

๐Ÿค Contributing

This is a demonstration project. However, potential improvements:

  • Multi-action sentence support
  • Improved passive voice handling
  • Custom entity type definitions
  • Confidence scoring for facts
  • Multi-language support

๐Ÿ“„ License

Open source - use freely for learning and development

๐Ÿ™Œ Acknowledgments

Built with:

  • spaCy: Industrial-strength NLP
  • Python: Core language
  • NER Technology: Modern entity recognition

VagaCore v0.5 | Hybrid NER + Rule-Based Extraction | Context-Aware Processing something

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vagacore-0.6.0.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vagacore-0.6.0-py3-none-any.whl (9.8 kB view details)

Uploaded Python 3

File details

Details for the file vagacore-0.6.0.tar.gz.

File metadata

  • Download URL: vagacore-0.6.0.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-0.6.0.tar.gz
Algorithm Hash digest
SHA256 987c4f0890adbb930c73afed42b1c8b5f0dffc0372dab287f69458b4cd8d342a
MD5 f3546a114cfaf1dad2e024445aafb964
BLAKE2b-256 3eb6ecf959fdcf6eda0c673cd9bd476bf1b58213536a87ee1fcb0cf727694594

See more details on using hashes here.

File details

Details for the file vagacore-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: vagacore-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 9.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 78ce8be01d60b4cb2573a99dcdc3b5efc214a3d0ef0a002e3fc36b313162f93d
MD5 b91afcdb068c40e53fbedf9195274e16
BLAKE2b-256 12e82342545986d8f38c22f3858b9078ec45e6692e160dc961e3a80457ec7bcc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page