Low-Cost Cross-Domain Web Structured Information Extraction using specialized LoRA adapters.
Project description
AXEtract is a high-performance, low-cost framework for extracting structured data from web pages. Based on the paper "AXE: Low-Cost Cross-Domain Web Structured Information Extraction", it optimizes the extraction pipeline by using specialized LoRA adapters for pruning and query-specific extraction, enabling state-of-the-art results with small models (e.g., Qwen3-0.6B).
🚀 Key Features
- 🎯 Specialized LoRA Adapters: Uses task-specific adapters for DOM pruning and structured extraction, achieving high accuracy with minimal token overhead.
- ✂️ Smart DOM Pruning: Classifies and prunes irrelevant HTML nodes before passing them to the extractor, significantly reducing context window usage and costs.
- 📍 Grounded XPath Resolution (GXR): Automatically maps extracted JSON fields back to their original source XPaths in the DOM for verification and grounding.
- ⚡ High-Throughput Pipeline: Built-in support for multiple LLM engines, including vLLM for production-grade serving and HuggingFace for local research.
- 🌐 Cross-Domain Versatility: Designed to generalize across various web domains (e-commerce, real estate, listings) without needing domain-specific rules.
🛠️ Architecture
AXEtract follows a three-part decoupled pipeline for maximum efficiency:
- Preprocessor: Fetches raw HTML and chunks it into manageable, token-aware fragments.
- AI Extractor: Divided into two stages:
- Pruner: A lightweight LLM (LoRA-powered) filters out noise and selects only relevant HTML chunks.
- Extractor: A task-specific LLM maps the pruned HTML content directly to a structured JSON schema or natural language answer.
- Postprocessor: Validates the output and resolves source XPaths via Grounded XPath Resolution (GXR).
📦 Installation
# Install from PyPI
uv pip install axetract
# Or install from source
git clone https://github.com/abdo-Mansour/axetract.git
cd axetract
uv sync
🚥 Quick Start
from pydantic import BaseModel
from axetract.pipeline import AXEPipeline
# 1. Initialize the pipeline with default LoRA adapters
# (Automatically downloads adapters from HuggingFace Hub)
pipeline = AXEPipeline.from_config(use_vllm=False)
# 2. Define your desired extraction schema
class Product(BaseModel):
name: str
price: str
rating: float
# 3. Extract from a URL or raw HTML
url = "https://example.com/item/12345"
result = pipeline.extract(url, schema=Product)
# 4. Access your structured data
print(f"Status: {result.status}")
print(f"Prediction: {result.prediction}")
print(f"Source XPaths: {result.xpaths}")
🌐 API Server
AXEtract includes a built-in FastAPI server for high-throughput serving. After installing the package, start it with the installed CLI entry point:
axe-server
Or via python -m for development installs:
python -m axetract.server
Configuration is done via environment variables:
| Variable | Default | Description |
|---|---|---|
AXE_USE_VLLM |
false |
Set to true to use vLLM backend |
AXE_PORT |
8000 |
Port to listen on |
AXE_HOST |
0.0.0.0 |
Host to bind to |
AXE_LOG_FILE |
(stderr) | Optional path to a log file |
See axe_server/client_example.py for examples of interacting with the API via requests.
📝 Citation
If you use AXEtract in your research, please cite our paper:
@misc{mansour2026axe,
title={AXE: Low-Cost Cross-Domain Web Structured Information Extraction},
author={Abdelrahman Mansour and Khaled W. Alshaer and Moataz Elsaban},
year={2026},
eprint={2602.01838},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.01838},
}
📜 License
This project is licensed under the MIT License - see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file axetract-0.1.3.tar.gz.
File metadata
- Download URL: axetract-0.1.3.tar.gz
- Upload date:
- Size: 40.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d27dd5a1bae010bd85b3ec3f46dab770ba44447a6a32014555524cef92324f7b
|
|
| MD5 |
b343cbcff71770970694b5a28e91db04
|
|
| BLAKE2b-256 |
9a0b365c2d8146ea53dca9746cd22ba20b664bab4c30799887effa572bf42b2c
|
Provenance
The following attestation bundles were made for axetract-0.1.3.tar.gz:
Publisher:
publish.yml on abdo-Mansour/axetract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
axetract-0.1.3.tar.gz -
Subject digest:
d27dd5a1bae010bd85b3ec3f46dab770ba44447a6a32014555524cef92324f7b - Sigstore transparency entry: 1201127220
- Sigstore integration time:
-
Permalink:
abdo-Mansour/axetract@309d7d87672bd275b1127beefe568fca8bd3a636 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/abdo-Mansour
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@309d7d87672bd275b1127beefe568fca8bd3a636 -
Trigger Event:
release
-
Statement type:
File details
Details for the file axetract-0.1.3-py3-none-any.whl.
File metadata
- Download URL: axetract-0.1.3-py3-none-any.whl
- Upload date:
- Size: 52.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
63e31e79f4bdb59c6eb428d36e4c142407b0f2e2c83aa05120f87615687c117f
|
|
| MD5 |
d5d44d06c2660903a9d179d41de4f21a
|
|
| BLAKE2b-256 |
a942faa7a21753febc1911599e63739c3dc00df34d740eb8423f3a1cb07ee57f
|
Provenance
The following attestation bundles were made for axetract-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on abdo-Mansour/axetract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
axetract-0.1.3-py3-none-any.whl -
Subject digest:
63e31e79f4bdb59c6eb428d36e4c142407b0f2e2c83aa05120f87615687c117f - Sigstore transparency entry: 1201127222
- Sigstore integration time:
-
Permalink:
abdo-Mansour/axetract@309d7d87672bd275b1127beefe568fca8bd3a636 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/abdo-Mansour
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@309d7d87672bd275b1127beefe568fca8bd3a636 -
Trigger Event:
release
-
Statement type: