Extract clean, readable content from any website

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

contextractor

These details have not been verified by PyPI

Project links

Homepage

Project description

Contextractor

Extract clean, readable content from any website using Trafilatura.

Available as: pip | npm | Docker | Apify actor

Try the Playground to configure extraction settings and preview commands before running.

Install

pip install contextractor

npm install -g contextractor

Requires Python 3.12+ (pip) or Node.js 18+ (npm). Playwright Chromium is installed automatically.

Usage

contextractor https://example.com

Works with zero config. Pass URLs directly, or use a config file for complex setups:

contextractor https://example.com --precision --save-json -o ./results
contextractor --config config.json --max-pages 10

CLI Options

contextractor [OPTIONS] [URLS...]

Crawl Settings:
  --config, -c          Path to JSON config file
  --output-dir, -o      Output directory
  --max-pages           Max pages to crawl (0 = unlimited)
  --crawl-depth         Max link depth from start URLs (0 = start only)
  --headless/--no-headless  Browser headless mode (default: headless)
  --max-concurrency     Max parallel requests (default: 50)
  --max-retries         Max request retries (default: 3)
  --max-results         Max results per crawl (0 = unlimited)

Proxy:
  --proxy-urls          Comma-separated proxy URLs (http://user:pass@host:port)
  --proxy-rotation      Rotation: recommended, per_request, until_failure

Browser:
  --launcher            Browser engine: chromium, firefox (default: chromium)
  --wait-until          Page load event: load, networkidle, domcontentloaded (default: load)
  --page-load-timeout   Timeout in seconds (default: 60)
  --ignore-cors         Disable CORS/CSP restrictions
  --close-cookie-modals Auto-dismiss cookie banners
  --max-scroll-height   Max scroll height in pixels (default: 5000)
  --ignore-ssl-errors   Skip SSL certificate verification
  --user-agent          Custom User-Agent string

Crawl Filtering:
  --globs               Comma-separated glob patterns to include
  --excludes            Comma-separated glob patterns to exclude
  --link-selector       CSS selector for links to follow
  --keep-url-fragments  Preserve URL fragments
  --respect-robots-txt  Honor robots.txt

Cookies & Headers:
  --cookies             JSON array of cookie objects
  --headers             JSON object of custom HTTP headers

Output Toggles:
  --save-markdown/--no-save-markdown  Save extracted markdown (default: true)
  --save-raw-html       Save raw HTML to output
  --save-text           Save extracted text
  --save-json           Save extracted JSON
  --save-jsonl          Save all pages as JSONL (single file)
  --save-xml            Save extracted XML
  --save-xml-tei        Save extracted XML-TEI

Content Extraction:
  --precision           High precision mode (less noise)
  --recall              High recall mode (more content)
  --fast                Fast extraction mode (less thorough)
  --no-links            Exclude links from output
  --no-comments         Exclude comments from output
  --include-tables/--no-tables  Include tables (default: include)
  --include-images      Include image descriptions
  --include-formatting/--no-formatting  Preserve formatting (default: preserve)
  --deduplicate         Deduplicate extracted content
  --target-language     Filter by language (e.g. "en")
  --with-metadata/--no-metadata  Extract metadata (default: with)
  --prune-xpath         XPath patterns to remove from content

Diagnostics:
  --verbose, -v         Enable verbose logging

CLI flags override config file settings. Merge order: defaults → config file → CLI args

Config File (optional)

Use a JSON config file to set options:

{
  "urls": ["https://example.com", "https://docs.example.com"],
  "saveMarkdown": true,
  "outputDir": "./output",
  "crawlDepth": 1,
  "proxy": {
    "urls": ["http://user:pass@host:port"],
    "rotation": "recommended"
  },
  "trafilaturaConfig": {
    "favorPrecision": true,
    "includeLinks": true,
    "includeTables": true,
    "deduplicate": true
  }
}

Crawl Settings

Field	Type	Default	Description
`urls`	array	`[]`	URLs to extract content from
`maxPages`	int	0	Max pages to crawl (0 = unlimited)
`outputDir`	string	`"./output"`	Directory for extracted content
`crawlDepth`	int	0	How deep to follow links (0 = start URLs only)
`headless`	bool	true	Browser headless mode
`maxConcurrency`	int	50	Max parallel browser pages
`maxRetries`	int	3	Max retries for failed requests
`maxResults`	int	0	Max results per crawl (0 = unlimited)

Proxy Configuration

Field	Type	Default	Description
`proxy.urls`	array	`[]`	Proxy URLs (`http://user:pass@host:port` or `socks5://host:port`)
`proxy.rotation`	string	`"recommended"`	`recommended`, `per_request`, `until_failure`
`proxy.tiered`	array	`[]`	Tiered proxy escalation (config-file only)

Browser Settings

Field	Type	Default	Description
`launcher`	string	`"chromium"`	Browser engine: `chromium`, `firefox`
`waitUntil`	string	`"load"`	Page load event: `load`, `networkidle`, `domcontentloaded`
`pageLoadTimeout`	int	60	Page load timeout in seconds
`ignoreCors`	bool	false	Disable CORS/CSP restrictions
`closeCookieModals`	bool	true	Auto-dismiss cookie consent banners
`maxScrollHeight`	int	5000	Max scroll height in pixels (0 = disable)
`ignoreSslErrors`	bool	false	Skip SSL certificate verification
`userAgent`	string	`""`	Custom User-Agent string

Crawl Filtering

Field	Type	Default	Description
`globs`	array	`[]`	Glob patterns for URLs to include
`excludes`	array	`[]`	Glob patterns for URLs to exclude
`linkSelector`	string	`""`	CSS selector for links to follow
`keepUrlFragments`	bool	false	Treat URLs with different fragments as different pages
`respectRobotsTxt`	bool	false	Honor robots.txt

Cookies & Headers

Field	Type	Default	Description
`cookies`	array	`[]`	Initial cookies (`[{"name": "...", "value": "...", "domain": "..."}]`)
`headers`	object	`{}`	Custom HTTP headers (`{"Authorization": "Bearer token"}`)

Output Toggles

Each toggle saves its format independently. Multiple can be enabled at once:

Field	Type	Default	Description
`saveMarkdown`	bool	true	Save extracted markdown
`saveRawHtml`	bool	false	Save raw HTML
`saveText`	bool	false	Save extracted plain text
`saveJson`	bool	false	Save extracted JSON
`saveJsonl`	bool	false	Save all pages as JSONL (single file)
`saveXml`	bool	false	Save extracted XML
`saveXmlTei`	bool	false	Save extracted XML-TEI

Content Extraction

All options go under the trafilaturaConfig key in config files, or use the equivalent CLI flags:

Field	Type	Default	Description
`favorPrecision`	bool	false	High precision, less noise
`favorRecall`	bool	false	High recall, more content
`includeComments`	bool	true	Include comments
`includeTables`	bool	true	Include tables
`includeImages`	bool	false	Include images
`includeFormatting`	bool	true	Preserve formatting
`includeLinks`	bool	true	Include links
`deduplicate`	bool	false	Deduplicate content
`withMetadata`	bool	true	Extract metadata (title, author, date)
`targetLanguage`	string	null	Filter by language (e.g. `"en"`)
`fast`	bool	false	Fast mode (less thorough)
`pruneXpath`	array	null	XPath patterns to remove from content

Docker

docker run ghcr.io/contextractor/contextractor https://example.com

Save output to your local machine:

docker run -v ./output:/output ghcr.io/contextractor/contextractor https://example.com -o /output

Use a config file:

docker run -v ./config.json:/config.json ghcr.io/contextractor/contextractor --config /config.json

All CLI flags work the same inside Docker.

Output

One file per crawled page, named from the URL slug (e.g. example-com-page.md). Metadata (title, author, date) is included in the output header when available.

Platforms

npm: macOS arm64, Linux (x64, arm64), Windows x64
Docker: linux/amd64, linux/arm64

License

Apache-2.0

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

contextractor

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.12

Apr 16, 2026

0.3.11

Apr 14, 2026

This version

0.3.10

Apr 12, 2026

0.3.9

Apr 12, 2026

0.3.8

Apr 12, 2026

0.3.7

Apr 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

contextractor-0.3.10-py3-none-any.whl (17.5 kB view details)

Uploaded Apr 12, 2026 Python 3

File details

Details for the file contextractor-0.3.10-py3-none-any.whl.

File metadata

Download URL: contextractor-0.3.10-py3-none-any.whl
Upload date: Apr 12, 2026
Size: 17.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.6 {"installer":{"name":"uv","version":"0.11.6","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for contextractor-0.3.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`0a2f4f2f73d1bc9a1e9c18660f13ec33127c93357564fdbe9736999d7b4d9eee`
MD5	`1d2fc1a00f1072b0b994c12ff623b5f8`
BLAKE2b-256	`f88f9e98e569c68846e9d6e002833192ac445b14ab733be31032d4bd3ea297dc`

See more details on using hashes here.

contextractor 0.3.10

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Project description

Contextractor

Install

Usage

CLI Options

Config File (optional)

Crawl Settings

Proxy Configuration

Browser Settings

Crawl Filtering

Cookies & Headers

Output Toggles

Content Extraction

Docker

Output

Platforms

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

File details

File metadata

File hashes