Low-Cost Cross-Domain Web Structured Information Extraction using specialized LoRA adapters.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abdo-Mansour

These details have not been verified by PyPI

Project description

AXEtract

Low-Cost Cross-Domain Web Structured Information Extraction

![Documentation](https://img.shields.io/badge/docs-latest-teal null) ![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg null) ![GitHub](https://img.shields.io/github/stars/abdo-Mansour/axetract?style=social null)

AXEtract is a high-performance, low-cost framework for extracting structured data from web pages. Based on the paper "AXE: Low-Cost Cross-Domain Web Structured Information Extraction", it optimizes the extraction pipeline by using specialized LoRA adapters for pruning and query-specific extraction, enabling state-of-the-art results with small models (e.g., Qwen3-0.6B).

🚀 Key Features

🎯 Specialized LoRA Adapters: Uses task-specific adapters for DOM pruning and structured extraction, achieving high accuracy with minimal token overhead.
✂️ Smart DOM Pruning: Classifies and prunes irrelevant HTML nodes before passing them to the extractor, significantly reducing context window usage and costs.
📍 Grounded XPath Resolution (GXR): Automatically maps extracted JSON fields back to their original source XPaths in the DOM for verification and grounding.
⚡ High-Throughput Pipeline: Built-in support for multiple LLM engines, including vLLM for production-grade serving and HuggingFace for local research.
🌐 Cross-Domain Versatility: Designed to generalize across various web domains (e-commerce, real estate, listings) without needing domain-specific rules.

🛠️ Architecture

AXEtract follows a three-part decoupled pipeline for maximum efficiency:

Preprocessor: Fetches raw HTML and chunks it into manageable, token-aware fragments.
AI Extractor: Divided into two stages:
- Pruner: A lightweight LLM (LoRA-powered) filters out noise and selects only relevant HTML chunks.
- Extractor: A task-specific LLM maps the pruned HTML content directly to a structured JSON schema or natural language answer.
Postprocessor: Validates the output and resolves source XPaths via Grounded XPath Resolution (GXR).

📦 Installation

# Install from PyPI
uv pip install axetract

# Or install from source
git clone https://github.com/abdo-Mansour/axetract.git
cd axetract
uv sync

🚥 Quick Start

from pydantic import BaseModel
from axetract.pipeline import AXEPipeline

# 1. Initialize the pipeline with default LoRA adapters
# (Automatically downloads adapters from HuggingFace Hub)
pipeline = AXEPipeline.from_config(use_vllm=False)

# 2. Define your desired extraction schema
class Product(BaseModel):
    name: str
    price: str
    rating: float

# 3. Extract from a URL or raw HTML
url = "https://example.com/item/12345"
result = pipeline.extract(url, schema=Product)

# 4. Access your structured data
print(f"Status: {result.status}")
print(f"Prediction: {result.prediction}")
print(f"Source XPaths: {result.xpaths}")

🌐 API Server

AXEtract includes a built-in FastAPI server for high-throughput serving. After installing the package, start it with the installed CLI entry point:

axe-server

Or via python -m for development installs:

python -m axetract.server

Configuration is done via environment variables:

Variable	Default	Description
`AXE_USE_VLLM`	`false`	Set to `true` to use vLLM backend
`AXE_PORT`	`8000`	Port to listen on
`AXE_HOST`	`0.0.0.0`	Host to bind to
`AXE_LOG_FILE`	(stderr)	Optional path to a log file

See axe_server/client_example.py for examples of interacting with the API via requests.

📝 Citation

If you use AXEtract in your research, please cite our paper:

@misc{mansour2026axe,
      title={AXE: Low-Cost Cross-Domain Web Structured Information Extraction}, 
      author={Abdelrahman Mansour and Khaled W. Alshaer and Moataz Elsaban},
      year={2026},
      eprint={2602.01838},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.01838}, 
}

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

abdo-Mansour

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.3

Mar 30, 2026

0.1.1

Mar 29, 2026

This version

0.1.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

axetract-0.1.0.tar.gz (40.4 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

axetract-0.1.0-py3-none-any.whl (52.8 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file axetract-0.1.0.tar.gz.

File metadata

Download URL: axetract-0.1.0.tar.gz
Upload date: Mar 29, 2026
Size: 40.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for axetract-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`856ecfe8328ed26bde963b8012c71d6a4f372cb4ee4a7787195ee9f1cf3542c1`
MD5	`78ad690f2186fbedda810757170ce251`
BLAKE2b-256	`d4b987549307828653c1a82625762710baf23f46314d76ed584f3b3893368387`

See more details on using hashes here.

Provenance

The following attestation bundles were made for axetract-0.1.0.tar.gz:

Publisher: publish.yml on abdo-Mansour/axetract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: axetract-0.1.0.tar.gz
- Subject digest: 856ecfe8328ed26bde963b8012c71d6a4f372cb4ee4a7787195ee9f1cf3542c1
- Sigstore transparency entry: 1191919546
- Sigstore integration time: Mar 29, 2026
Source repository:
- Permalink: abdo-Mansour/axetract@3709092f7034d3b2a69d4f49836091f29cba1d68
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/abdo-Mansour
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3709092f7034d3b2a69d4f49836091f29cba1d68
- Trigger Event: release

File details

Details for the file axetract-0.1.0-py3-none-any.whl.

File metadata

Download URL: axetract-0.1.0-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 52.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for axetract-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`08be06245dc1a904a2ec84e8db53809dfda563fca6b014153159c8c54e7d9a61`
MD5	`d9c8707a615723b232dee9c0b5efe487`
BLAKE2b-256	`ef34bf5ace7d98a64fd443a8f60db5f50ab2cb7fb8f0afcd9ba11182c8cebbb6`

See more details on using hashes here.

Provenance

The following attestation bundles were made for axetract-0.1.0-py3-none-any.whl:

Publisher: publish.yml on abdo-Mansour/axetract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: axetract-0.1.0-py3-none-any.whl
- Subject digest: 08be06245dc1a904a2ec84e8db53809dfda563fca6b014153159c8c54e7d9a61
- Sigstore transparency entry: 1191919552
- Sigstore integration time: Mar 29, 2026
Source repository:
- Permalink: abdo-Mansour/axetract@3709092f7034d3b2a69d4f49836091f29cba1d68
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/abdo-Mansour
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@3709092f7034d3b2a69d4f49836091f29cba1d68
- Trigger Event: release

axetract 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

AXEtract

Low-Cost Cross-Domain Web Structured Information Extraction

🚀 Key Features

🛠️ Architecture

📦 Installation

🚥 Quick Start

🌐 API Server

📝 Citation

📜 License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance