Skip to main content

Intelligent Text Compression & Fact Extraction Engine using NLP

Project description

๐Ÿš€ VagaCore - Intelligent Text Compression & Fact Extraction Engine

A production-ready NLP system combining Named Entity Recognition (NER), dependency parsing, and context-aware processing for intelligent fact extraction from unstructured text.

๐ŸŽฏ Features

Core Capabilities

  • ๐Ÿง  Hybrid Extraction: Combines ML-based NER with rule-based syntax parsing
  • ๐Ÿ“ Multi-Sentence Processing: Handles complex documents with multiple facts
  • ๐Ÿ”„ Context Memory: Maintains temporal awareness across sentences
  • ๐Ÿข Named Entity Recognition: Identifies PERCENT, MONEY, DATE, ORG, PERSON, LOC
  • ๐ŸŽฏ Semantic Understanding: Extracts Subject-Verb-Object patterns with noise removal
  • ๐Ÿ“Š Structured Output: Returns clean JSON facts

Key Innovations

โœ… Context-Aware Extraction

  • Sentences without explicit dates inherit from previous context
  • Prevents temporal information loss in multi-sentence documents
  • Critical for RAG and knowledge base indexing

โœ… Noise Resistance

  • Removes adjectives and adverbs before processing
  • Preserves semantic relationships
  • Filters subjective language

โœ… Domain Intelligence

  • Recognizes financial keywords (revenue, profit, earnings, sales)
  • Prioritizes domain entities over generic organizations
  • Smart quantity filtering (million โ†’ context)

๐Ÿ“Š Example

Input

Apple reported $500 million in revenue during Q3 2024 in the Asia-Pacific region.
The profit increased by 15% in the same period.

Output

[
  {
    "subject": "Apple",
    "action": "report",
    "object": "revenue",
    "entity": "revenue",
    "value": "$500 million",
    "time": "Q3 2024"
  },
  {
    "subject": "profit",
    "action": "increase",
    "object": null,
    "entity": null,
    "value": "15%",
    "time": "Q3 2024"
  }
]

๐Ÿ—๏ธ Architecture

Input Text
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Parser (spaCy)                    โ”‚
โ”‚   - Tokenization                    โ”‚
โ”‚   - POS Tagging                     โ”‚
โ”‚   - Named Entity Recognition        โ”‚
โ”‚   - Dependency Parsing              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Noise Removal (Utils)             โ”‚
โ”‚   - Remove adjectives/adverbs       โ”‚
โ”‚   - Keep semantic prepositions      โ”‚
โ”‚   - Filter stop words               โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Hybrid Extraction (Extractor)     โ”‚
โ”‚                                     โ”‚
โ”‚   โ”Œโ”€ ML Path (NER)                 โ”‚
โ”‚   โ”‚  - PERCENT, MONEY              โ”‚
โ”‚   โ”‚  - DATE, TIME                  โ”‚
โ”‚   โ”‚  - ORG, PERSON, LOC            โ”‚
โ”‚   โ”‚                                 โ”‚
โ”‚   โ””โ”€ Rule Path (Syntax)             โ”‚
โ”‚      - Domain keywords              โ”‚
โ”‚      - Prepositional patterns       โ”‚
โ”‚      - Subject-Verb-Object          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Context Memory (Compressor)       โ”‚
โ”‚   - Propagate temporal context      โ”‚
โ”‚   - Maintain state across sentences โ”‚
โ”‚   - Prevent information loss        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    โ†“
Output: Structured JSON Facts

๐Ÿ“‹ Module Overview

parser.py

  • Loads spaCy NLP model
  • Handles text tokenization and parsing

extractor.py

  • extract_svo(): Subject-Verb-Object extraction with intelligent object selection
  • extract_entities(): Named Entity Recognition with type labels
  • extract_entities_by_type(): Organized entity access by category
  • extract_details(): Hybrid NER + rule-based value/time/entity extraction

utils.py

  • remove_noise(): Removes adjectives/adverbs while preserving semantic relationships

compressor.py

  • compress(): Main pipeline with context memory
  • Orchestrates all components
  • Implements temporal propagation

๐Ÿš€ Quick Start

Installation

cd vagacore
python -m venv venv
.\venv\Scripts\activate  # Windows
source venv/bin/activate # Linux/Mac

pip install spacy
python -m spacy download en_core_web_sm

Basic Usage

from compressor import compress

text = "Apple reported $500 million in revenue during Q3 2024."
result = compress(text)

import json
print(json.dumps(result, indent=2))

Run Demos

# Simple demo
python examples/demo.py

# Advanced demonstrations
python examples/advanced_demo.py

๐Ÿ” Use Cases

1. Retrieval-Augmented Generation (RAG)

Extract structured facts for LLM context:

facts = compress(document_text)
# Feed to LLM for better grounding

2. Financial Data Extraction

Parse earnings reports and investor documents:

earnings_report = """
Q3 2024 Revenue: $50 million
Operating margin improved by 5%
"""
facts = compress(earnings_report)

3. Knowledge Base Indexing

Create temporally-aware fact databases:

for document in documents:
    facts = compress(document)
    # Index with time-based grouping

4. News Analysis

Extract named entities and facts from articles:

article = get_news_article()
entities = compress(article)

๐Ÿ“ˆ Performance

What It Handles Well โœ…

  • Multi-sentence documents
  • Temporal references and quarters
  • Financial terminology
  • Organization names and locations
  • Percentage and monetary values
  • Contextual pronouns via memory

Current Limitations โš ๏ธ

  • Single main action per sentence
  • Simple clause structures work best
  • Passive voice sometimes reduced accuracy
  • Requires English text

๐Ÿ”ฌ Technical Details

Extraction Methods

NER (Named Entity Recognition)

  • Uses spaCy's trained model
  • Entity types: PERCENT, MONEY, DATE, ORG, PERSON, GPE, LOC
  • Confidence-based extraction

Dependency Parsing

  • Identifies grammatical relationships
  • Key patterns:
    • nsubj: Nominal subject
    • ROOT: Root verb
    • dobj: Direct object
    • pobj: Object of preposition
    • attr: Predicate attribute

Context Memory Algorithm

for each sentence:
    extract time from sentence
    if time is None or vague:
        use previous_time
    else:
        update previous_time

๐ŸŽ“ Learning Resources

๐Ÿค Contributing

This is a demonstration project. However, potential improvements:

  • Multi-action sentence support
  • Improved passive voice handling
  • Custom entity type definitions
  • Confidence scoring for facts
  • Multi-language support

๐Ÿ“„ License

Open source - use freely for learning and development

๐Ÿ™Œ Acknowledgments

Built with:

  • spaCy: Industrial-strength NLP
  • Python: Core language
  • NER Technology: Modern entity recognition

VagaCore v0.5 | Hybrid NER + Rule-Based Extraction | Context-Aware Processing something

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vagacore-1.0.1.tar.gz (16.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vagacore-1.0.1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file vagacore-1.0.1.tar.gz.

File metadata

  • Download URL: vagacore-1.0.1.tar.gz
  • Upload date:
  • Size: 16.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9b89fa0a32aab2ce06e18c8c2485abb5add0dbd3804bdcc31c744b36232152f5
MD5 bccd25c50b1b0511610eb1bb4c6a95da
BLAKE2b-256 158d8f88c0dae1c73179176bde981762aeba8597d8d353cedcfab3c5372c0761

See more details on using hashes here.

File details

Details for the file vagacore-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: vagacore-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for vagacore-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 44fc7e4550c08dde014e9d76c4edc85e21f0a4deedbe42a7b1cfee4d61488efc
MD5 8509cca05a9654602ba2ac3ac1b427a6
BLAKE2b-256 44ffdc256b838ab1916a7df33d75d64ed7fb0ce1ce38f8c9b9aaa98c5fe07e12

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page