Skip to main content

A Python package to extract data from unstructured into structured format

Project description

ValidEx

ValidEx is a Python library that simplifies retrieval, extraction and training of structured data from various unstructured sources.

GitHub Contributors GitHub Last Commit GitHub Issues GitHub Pull Requests Github License

🏷 Features

  • Structured Data Extraction: Parse and extract structured data from various unstructured sources including web pages, text files, PDFs, and more.
  • Heuristic data cleaning text normalization (case, whitespace, special characters), deduplication
  • Concurrency Support: Efficiently process multiple data sources simultaneously.
  • Retry Mechanism: Implement automatic retries for failed extraction attempts.
  • Hallucination check: Implement strategies to detect and reduce LLM hallucinations in extracted data.
  • Fine-tuning Dataset Export: Generate datasets in JSONL format for OpenAI chat fine-tuning.
  • Local Model Creation: Build custom extraction models combining Named Entity Recognition (NER) and regular expressions.

📦 Installation

To get started with ValidEx, simply install the package using pip:

pip install validex

⛓️ Quick Start

import validex
from pydantic import BaseModel


class Superhero(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]


def main():
    app = validex.App()

    app.add("https://www.britannica.com/topic/list-of-superheroes-2024795")
    app.add("*.txt")
    app.add("*.pdf")
    app.add("*.md")

    superheroes = app.extract(Superhero)
    print(f"Extracted superheroes: {list(superheroes)}")

    first_hero = app.extract_first(Superhero)
    print(f"First extracted hero: {first_hero}")

    print(f"Total cost: ${app.cost()}")
    print(f"Total usage: {app.usage}")


if __name__ == "__main__":
    main()
[
    (
        Superhero(
            name="Batman",
            age=81,
            power="Brilliant detective skills, martial arts",
            enemies=["Joker", "Penguin"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Wonder Woman",
            age=80,
            power="Superhuman strength, speed, agility",
            enemies=["Ares", "Cheetah"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Spider-Man",
            age=59,
            power="Wall-crawling, spider sense",
            enemies=["Green Goblin", "Venom"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Captain America",
            age=101,
            power="Super soldier serum, shield",
            enemies=["Red Skull", "Hydra"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Superman", age=35, power="Flight", enemies=["Lex Luthor", "Doomsday"]
        ),
        {"url": "https://www.britannica.com/robots.txt"},
    ),
    (
        Superhero(
            name="Wonder Woman",
            age=30,
            power="Super Strength",
            enemies=["Ares", "Cheetah"],
        ),
        {"url": "https://www.britannica.com/robots.txt"},
    ),
    (
        Superhero(
            name="Spider-Man",
            age=25,
            power="Wall-crawling",
            enemies=["Green Goblin", "Venom"],
        ),
        {"url": "https://www.britannica.com/robots.txt"},
    ),
]

Hallucinations and autofix

class Superhero(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]

    def fix(self):
        # Logic to auto fix and normalize the generated data
        if self.age < 0:
            self.age = 0

    def check_hallucinations(self):
        # Check name
        if not re.match(r"^[A-Za-z\s-]+$", self.name):
            raise ValueError(f"Name '{self.name}' contains unusual characters")

        # Check age
        if self.age < 0 or self.age > 1000:
            raise ValueError(f"Age {self.age} seems unrealistic")

        # Check power
        if len(self.power) > 50:
            raise ValueError("Power description is unusually long")

        # Check enemies
        if len(self.enemies) > 10:
            raise ValueError("Unusually high number of enemies")

        for enemy in self.enemies:
            if not re.match(r"^[A-Za-z\s-]+$", enemy):
                raise ValueError(f"Enemy name '{enemy}' contains unusual characters")

Experimental: Export and fine tunning

# Use the OpenAI chat fine-tuning format to save data
app.export_jsonl("fine_tune.jsonl")

# Local model training
app.fit()
app.save("state.validex")


app.infer_extract("booob")

Multi-model Extraction

ValidEx supports extracting multiple models at once

class Superhero2(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]


multi_results = app.multi_extract(Superhero, Superhero2)
print(f"Multi-extraction results: {multi_results}")

Limitations

TBD

🛠️ Roadmap

👋 Contributing

Contributions to ValidEx are welcome! If you'd like to contribute, please follow these steps:

  • Fork the repository on GitHub
  • Create a new branch for your changes
  • Commit your changes to the new branch
  • Push your changes to the forked repository
  • Open a pull request to the main ValidEx repository

Before contributing, please read the contributing guidelines.

License

ValidEx is released under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

validex-0.0.2.tar.gz (13.4 kB view details)

Uploaded Source

File details

Details for the file validex-0.0.2.tar.gz.

File metadata

  • Download URL: validex-0.0.2.tar.gz
  • Upload date:
  • Size: 13.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.4.0 CPython/3.10.12 Linux/6.5.0-1025-azure

File hashes

Hashes for validex-0.0.2.tar.gz
Algorithm Hash digest
SHA256 ed060b154db126b575a971e7256ebd77860a4add2b9cfa9eefe91fc151804200
MD5 d148efeae0a0e1f90f3ff8eabea491bf
BLAKE2b-256 ea7529264fc0aef59ff45d2ae9ddfde44d39e7a41fd2de1e9c579010f864e791

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page