Skip to main content

Acon is the intelligence layer for any web scraper. Pair it with Scrapling, Playwright, or httpx to crawl smarter.

Project description

Acon Logo

Acon — The Intelligent Brain for Any Scraper

Acon doesn't replace Scrapling or Firecrawl. It tells them where to look.


Why Acon?

Most crawlers are dumb. They follow links blindly, return raw HTML, and break the moment a site changes its structure. Before you can extract anything useful, you need to understand what you're dealing with.

Acon is a site intelligence engine. It maps the structural "skeleton" of a website automatically — before any data extraction happens — so your scraper always knows where to look.


🏗️ The Core Thesis

Most modern web scrapers suffer from "URL Exhaustion"—they spend 90% of their bandwidth fetching identical product or blog pages. Acon introduces a Topology Orchestrator that maps, classifies, and samples site structures to find the "Skeleton" of a site before you spend a cent on proxies.

💰 Acon vs. Scrapling (The 1:1 Battle)

Metric Scrapling Alone (Blind) Acon + Scrapling (Brain)
Pages Crawled 1,000 40
Time Taken 870s (14.5 min) 111s (1.8 min)
Bandwidth Used 20.72 MB 1.39 MB
Est. Proxy Cost $1.000 $0.040
Structural DNA 4/4 Found 4/4 Found

96% less crawling. 25x faster structural discovery. Measured on books.toscrape.com. Run it yourself: python benchmarks/acon_vs_scrapling.py


🚀 Use Cases

Price Monitoring & E-Commerce Intelligence
Acon detects pagination patterns and repeating product templates automatically. No manual selector configuration per site.

Content Archival & Research
Feed Acon a publication's root URL. It identifies the site's content structure, prioritizes article pages over navigation noise, and hands you a clean discovery map.

Site Auditing & SEO Analysis
Get an instant structural report — template count, link depth, topology classification (SPA vs static vs paginated) — in a single run.


⚡ What Makes Acon Different

Capability Typical Crawler Acon
JS-rendered sites Manual Playwright setup Autonomous escalation
Site structure Unknown until scraped Detected before extraction
Large site performance Degrades at scale O(log N) priority queue
Failed crawls Lost progress SQLite resumption (WAL)

🛠️ Installation

Requirement: Python >= 3.10

pip install acon-intel
# To enable JS-rendering features
playwright install chromium

⚡ Quick Start

import asyncio
import trafilatura
from acon import SiteCrawlOrchestrator, CrawlConfig

async def main():
    # Acon discovers the 'skeleton', Trafilatura extracts the 'flesh'
    config = CrawlConfig(
        max_pages=10,
        post_process=lambda html: trafilatura.extract(html, output_format="markdown")
    )
    
    brain = SiteCrawlOrchestrator()
    result = await brain.crawl_site("https://example.com", config)
    
    for page in result["page_summaries"]:
        print(f"URL: {page['url']}")
        print(f"Content: {page['result'][:200]}...") # Markdown from Trafilatura

📦 The Output Shape

Acon returns a structured SiteCrawlResult containing everything needed for downstream extraction:

{
  "topology": "paginated",
  "pages_crawled": 42,
  "page_summaries": [
    {
      "url": "https://example.com/p/123",
      "page_type": "standard",
      "js_required": false,
      "parent_url": "https://example.com/list"
    }
  ],
  "crawl_meta": {
    "reflection": {
      "intelligence_score": 0.85,
      "advice": "Continue current strategy."
    }
  }
}

🚀 Hardened Features

  • 💾 Enterprise Persistence: SQLite/WAL state management. Resumable sessions.
  • 🧠 Autonomous Fidelity Escalation: Automatic switch to JS rendering if static fetch returns no signals.
  • 🗼 Topology-Aware Prioritization: $O(\log N)$ priority queue that adapts to site structure on-the-fly.
  • 📊 Operational Reflection: Real-time "Intelligence Score" and diagnostic advice.

🛣️ Roadmap

  • Stealth Integration: Native support for Camoufox (Fingerprint bypass).
  • LLM-Ready Pipeline: Native Trafilatura integration for high-fidelity Markdown output.
  • Discovery API: Expose Acon as a standalone Discovery microservice for non-Python stacks.

Acon is a standalone module designed for high-efficiency site intelligence.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

acon_intel-0.1.0.tar.gz (53.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

acon_intel-0.1.0-py3-none-any.whl (58.7 kB view details)

Uploaded Python 3

File details

Details for the file acon_intel-0.1.0.tar.gz.

File metadata

  • Download URL: acon_intel-0.1.0.tar.gz
  • Upload date:
  • Size: 53.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for acon_intel-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4e7ffa242c97ae3d5e4ff1ac2648658c33d7c857c3c2469cc08c2272078908dd
MD5 31bbfa479ff8c4b49dfb9505707ceee5
BLAKE2b-256 bffb734d0c97b167bd6212e6aba36538f38c9a1a77044ce7dd77234a144cdf42

See more details on using hashes here.

File details

Details for the file acon_intel-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: acon_intel-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 58.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.12

File hashes

Hashes for acon_intel-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 93db4e638c3bb8f2ba54d3d040cbae76bd1696c539a2cead03f7ba545de7b7c6
MD5 38e8be89210c5fa09be910cf7263235a
BLAKE2b-256 9a643822f697399c62ca8eddb760e4a28dc357c2014e8c673c3c3652981f8539

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page