Skip to main content

Extract clean, readable content from any website

Project description

Contextractor

Extract clean, readable content from any website using Trafilatura.

Available as: PyPI | npm | Docker | Apify actor

Try the Playground to configure extraction settings and preview commands before running.

Install

pip install contextractor

or

npm install -g contextractor

Requires Python 3.12+ (pip) or Node.js 18+ (npm). Playwright Chromium is installed automatically.

Usage

contextractor https://example.com

Works with zero config. Pass URLs directly, or use a config file for complex setups:

contextractor https://example.com --precision --save json -o ./results
contextractor --config config.json --max-pages 10

CLI Options

contextractor [OPTIONS] [URLS...]

Crawl Settings:
  --config, -c          Path to JSON config file
  --output-dir, -o      Output directory
  --max-pages           Max pages to crawl (0 = unlimited)
  --crawl-depth         Max link depth from start URLs (0 = start only)
  --headless/--no-headless  Browser headless mode (default: headless)
  --max-concurrency     Max parallel requests (default: 50)
  --max-retries         Max request retries (default: 3)
  --max-results         Max results per crawl (0 = unlimited)

Proxy:
  --proxy-urls          Comma-separated proxy URLs (http://user:pass@host:port)
  --proxy-rotation      Rotation: recommended, perRequest, untilFailure

Browser:
  --launcher            Browser engine: chromium, firefox (default: chromium)
  --wait-until          Page load event: load, networkidle, domcontentloaded (default: load)
  --page-load-timeout   Timeout in seconds (default: 60)
  --ignore-cors         Disable CORS/CSP restrictions
  --close-cookie-modals Auto-dismiss cookie banners
  --max-scroll-height   Max scroll height in pixels (default: 5000)
  --ignore-ssl-errors   Skip SSL certificate verification
  --user-agent          Custom User-Agent string

Crawl Filtering:
  --globs               Comma-separated glob patterns to include
  --excludes            Comma-separated glob patterns to exclude
  --link-selector       CSS selector for links to follow
  --keep-url-fragments  Preserve URL fragments
  --respect-robots-txt  Honor robots.txt

Cookies & Headers:
  --cookies             JSON array of cookie objects
  --headers             JSON object of custom HTTP headers

Output Format:
  --save                Output formats, comma-separated (default: markdown)
                        Valid: markdown, html, text, json, jsonl, xml, xml-tei, all

Content Extraction:
  --precision           High precision mode (less noise)
  --recall              High recall mode (more content)
  --fast                Fast extraction mode (less thorough)
  --no-links            Exclude links from output
  --no-comments         Exclude comments from output
  --include-tables/--no-tables  Include tables (default: include)
  --include-images      Include image descriptions
  --include-formatting/--no-formatting  Preserve formatting (default: preserve)
  --deduplicate         Deduplicate extracted content
  --target-language     Filter by language (e.g. "en")
  --with-metadata/--no-metadata  Extract metadata (default: with)
  --prune-xpath         XPath patterns to remove from content

Diagnostics:
  --verbose, -v         Enable verbose logging

CLI flags override config file settings. Merge order: defaults → config file → CLI args

Config File (optional)

Use a JSON config file to set options:

{
  "urls": ["https://example.com", "https://docs.example.com"],
  "save": ["markdown"],
  "outputDir": "./output",
  "crawlDepth": 1,
  "proxy": {
    "urls": ["http://user:pass@host:port"],
    "rotation": "recommended"
  },
  "trafilaturaConfig": {
    "favorPrecision": true,
    "includeLinks": true,
    "includeTables": true,
    "deduplicate": true
  }
}

Crawl Settings

Field Type Default Description
urls array [] URLs to extract content from
maxPages int 0 Max pages to crawl (0 = unlimited)
outputDir string "./output" Directory for extracted content
crawlDepth int 0 How deep to follow links (0 = start URLs only)
headless bool true Browser headless mode
maxConcurrency int 50 Max parallel browser pages
maxRetries int 3 Max retries for failed requests
maxResults int 0 Max results per crawl (0 = unlimited)

Proxy Configuration

Field Type Default Description
proxy.urls array [] Proxy URLs (http://user:pass@host:port or socks5://host:port)
proxy.rotation string "recommended" recommended, perRequest, untilFailure
proxy.tiered array [] Tiered proxy escalation (config-file only)

Browser Settings

Field Type Default Description
launcher string "chromium" Browser engine: chromium, firefox
waitUntil string "load" Page load event: load, networkidle, domcontentloaded
pageLoadTimeout int 60 Page load timeout in seconds
ignoreCors bool false Disable CORS/CSP restrictions
closeCookieModals bool true Auto-dismiss cookie consent banners
maxScrollHeight int 5000 Max scroll height in pixels (0 = disable)
ignoreSslErrors bool false Skip SSL certificate verification
userAgent string "" Custom User-Agent string

Crawl Filtering

Field Type Default Description
globs array [] Glob patterns for URLs to include
excludes array [] Glob patterns for URLs to exclude
linkSelector string "" CSS selector for links to follow
keepUrlFragments bool false Treat URLs with different fragments as different pages
respectRobotsTxt bool false Honor robots.txt

Cookies & Headers

Field Type Default Description
cookies array [] Initial cookies ([{"name": "...", "value": "...", "domain": "..."}])
headers object {} Custom HTTP headers ({"Authorization": "Bearer token"})

Output Format

Field Type Default Description
save array ["markdown"] Output formats: markdown, html, text, json, jsonl, xml, xml-tei, all

Content Extraction

All options go under the trafilaturaConfig key in config files, or use the equivalent CLI flags:

Field Type Default Description
favorPrecision bool false High precision, less noise
favorRecall bool false High recall, more content
includeComments bool true Include comments
includeTables bool true Include tables
includeImages bool false Include images
includeFormatting bool true Preserve formatting
includeLinks bool true Include links
deduplicate bool false Deduplicate content
withMetadata bool true Extract metadata (title, author, date)
targetLanguage string null Filter by language (e.g. "en")
fast bool false Fast mode (less thorough)
pruneXpath array null XPath patterns to remove from content

Node.js API

Use contextractor as a library in your Node.js code:

const { extract } = require("contextractor");

// Extract a single URL
await extract("https://example.com", {
  save: "markdown",
  outputDir: "./output",
});

// Multiple URLs with extraction options
await extract(["https://a.com", "https://b.com"], {
  precision: true,
  noLinks: true,
  includeTables: true,
  save: ["markdown", "json"],
  outputDir: "./results",
});

// Using a config file
await extract("https://example.com", { config: "./config.json" });

ESM import:

import { extract } from "contextractor";

extract(urls, options) returns Promise<void> — output goes to outputDir or stdout. Options use the same camelCase names as listed in CLI Options and Config File.

Python API

Install the extraction engine:

pip install contextractor-engine

Use ContentExtractor to extract content from HTML:

from contextractor_engine import ContentExtractor, TrafilaturaConfig

# Basic extraction
extractor = ContentExtractor()
result = extractor.extract(html, url="https://example.com", output_format="markdown")
print(result.content)

# High precision with custom config
config = TrafilaturaConfig(favor_precision=True, include_tables=True, deduplicate=True)
extractor = ContentExtractor(config=config)
result = extractor.extract(html, output_format="json")

Extract metadata:

meta = extractor.extract_metadata(html, url="https://example.com")
print(meta.title, meta.author, meta.date)

Available output formats: txt, markdown, json, xml, xmltei

See the contextractor-engine README for full API reference.

Docker

docker run ghcr.io/contextractor/contextractor https://example.com

Save output to your local machine:

docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output

Use a config file:

docker run -v ./config.json:/config.json ghcr.io/contextractor/contextractor --config /config.json

All CLI flags work the same inside Docker.

Docker from Code

Call Docker extraction programmatically:

Node.js:

const { execSync } = require("child_process");
const result = execSync(
  "docker run ghcr.io/contextractor/contextractor https://example.com",
  { encoding: "utf-8" }
);
console.log(result);

Python:

import subprocess
result = subprocess.run(
    ["docker", "run", "ghcr.io/contextractor/contextractor", "https://example.com"],
    capture_output=True, text=True
)
print(result.stdout)

Volume mount for output:

docker run -v $(pwd)/output:/output ghcr.io/contextractor/contextractor https://example.com -o /output

Output

One file per crawled page, named from the URL slug (e.g. example-com-page.md). Metadata (title, author, date) is included in the output header when available.

Platforms

  • npm: macOS arm64, Linux (x64, arm64), Windows x64
  • Docker: linux/amd64, linux/arm64

License

Apache-2.0

Docs version

2026-04-16T12:41:28Z

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextractor-0.3.12-py3-none-any.whl (18.1 kB view details)

Uploaded Python 3

File details

Details for the file contextractor-0.3.12-py3-none-any.whl.

File metadata

  • Download URL: contextractor-0.3.12-py3-none-any.whl
  • Upload date:
  • Size: 18.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for contextractor-0.3.12-py3-none-any.whl
Algorithm Hash digest
SHA256 fd52a7d31c5a6fec282deb71c1b5b0ea15199e858c926c7868883a376e505ff4
MD5 c463c5a5bdac42b7bbccbac8dfb3b0c3
BLAKE2b-256 17bbff2fa0464d38e6e92bba1130d8719982370e743f9a1ce2a8b903b509d34b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page