Python async re-implementation of CeWL (Custom Word List Generator)

These details have not been verified by PyPI

Project links

Project description

pycewl

Python async re-implementation of CeWL (Custom Word List Generator)

A modern, high-performance Python implementation of CeWL with:

Async spider using asyncio + httpx
HTML parsing with beautifulsoup4
Google Search discovery for seed URLs
Smart Word Relevance Scoring using Google Natural Language API
Production-grade packaging for PyPI

Installation

pip install pycewl

For development:

pip install pycewl[dev]

For PDF metadata extraction:

pip install pycewl[pdf]

Quick Start

Basic Usage

Spider a website and generate a wordlist:

pycewl https://example.com -w words.txt

With Options

# Set spider depth and show word counts
pycewl https://example.com -d 3 -c -w words.txt

# Extract emails too
pycewl https://example.com -e --email-file emails.txt -w words.txt

# Lowercase words, minimum 5 characters
pycewl https://example.com --lowercase -m 5 -w words.txt

Google Search Integration

Find seed URLs using Google Custom Search:

# Set environment variables
export GOOGLE_API_KEY="your-api-key"
export GOOGLE_SEARCH_ENGINE_ID="your-search-engine-id"

# Search and spider
pycewl --google-keyword "star trek fan site" -w words.txt

Smart Relevance Scoring

Group words by relevance to your search query using Google NLP:

# Set Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Spider with relevance scoring
pycewl --google-keyword "star trek" --relevance-scoring \
    --related-file related.txt \
    --unrelated-file general.txt

Output structure:

=== Words Related to "star trek" ===
enterprise, 42
spock, 38
federation, 25
...

=== General Words (Not Query-Specific) ===
welcome, 15
contact, 12
page, 8
...

CLI Reference

Option	Description
`-d, --depth INT`	Spider depth (default: 2)
`-m, --min-word-length INT`	Minimum word length (default: 3)
`-x, --max-word-length INT`	Maximum word length
`-w, --write PATH`	Output file for words
`-n, --no-words`	Don't output wordlist
`-c, --count`	Show word counts
`-g, --groups INT`	Group words by count ranges
`-o, --offsite`	Allow spidering offsite URLs
`-e, --email`	Extract email addresses
`--email-file PATH`	Output file for emails
`-a, --meta`	Extract metadata from documents
`--meta-file PATH`	Output file for metadata
`-k, --keep`	Keep downloaded files
`--lowercase`	Convert words to lowercase
`--with-numbers`	Include words containing numbers
`--convert-umlauts`	Convert German umlauts to ASCII
`-u, --user-agent TEXT`	Custom user agent
`--concurrency INT`	Concurrent requests (default: 10)
`--auth-type TEXT`	Authentication type (basic/digest/bearer)
`--auth-user TEXT`	Authentication username
`--auth-pass TEXT`	Authentication password
`--auth-token TEXT`	Bearer/JWT token for authentication
`--proxy-host TEXT`	Proxy hostname
`--proxy-port INT`	Proxy port
`-H, --header TEXT`	HTTP header (Name: Value)
`-v, --verbose`	Verbose output
`--google-keyword TEXT`	Search Google for seed URLs
`--google-max-results INT`	Max Google results (default: 10)
`--relevance-scoring`	Enable word relevance scoring
`--relevance-threshold FLOAT`	Relevance threshold (default: 0.5)
`--related-file PATH`	Output for query-related words
`--unrelated-file PATH`	Output for general words
`--version`	Show version

Bearer Token Authentication

Authenticate with a bearer or JWT token:

# Using --auth-token (auto-detects bearer type)
pycewl https://api.example.com --auth-token "your-access-token" -w words.txt

# Explicit auth type
pycewl https://api.example.com --auth-type bearer --auth-token "eyJhbG..." -w words.txt

Python API

import asyncio
from pycewl import Crawler, CeWLConfig, SpiderConfig, WordConfig, WordExtractor

async def main():
    config = CeWLConfig(
        url="https://example.com",
        spider=SpiderConfig(depth=2, concurrency=5),
        word=WordConfig(min_length=4, lowercase=True),
    )

    crawler = Crawler(config)
    extractor = WordExtractor(config.word)

    async for result in crawler.crawl(["https://example.com"]):
        if result.html:
            extractor.process_html(result.html)

    for word, count in extractor.get_sorted_words()[:20]:
        print(f"{word}: {count}")

asyncio.run(main())

Google Cloud Setup

For Google Search

Go to Google Cloud Console
Create a project and enable the Custom Search API
Create an API key
Set up a Programmable Search Engine

Set environment variables:

export GOOGLE_API_KEY="your-api-key"
export GOOGLE_SEARCH_ENGINE_ID="your-cx-id"

For Relevance Scoring (NLP)

Enable the Natural Language API in Google Cloud Console
Create a service account with Natural Language API access
Download the JSON key file

Set environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"

Development

# Clone repository
git clone https://github.com/digininja/CeWL.git
cd CeWL/pycewl

# Install dev dependencies
make install-dev

# Run tests
make test

# Run linting
make lint

# Format code
make format

# Build package
make build

License

MIT License - see LICENSE for details.

Credits

Original CeWL by Robin Wood (digininja)
Python implementation maintains feature parity with Ruby original

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jan 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycewl-0.1.0.tar.gz (35.9 kB view details)

Uploaded Jan 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pycewl-0.1.0-py3-none-any.whl (29.3 kB view details)

Uploaded Jan 26, 2026 Python 3

File details

Details for the file pycewl-0.1.0.tar.gz.

File metadata

Download URL: pycewl-0.1.0.tar.gz
Upload date: Jan 26, 2026
Size: 35.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pycewl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b691510ca3fe2c1470b4c94ae99ef80e529873231cd7d4701d58474b5c1de25e`
MD5	`c8717b5fd0591b4562e1e63609cd3a7f`
BLAKE2b-256	`4b617a0a96a16da70f89f0fa82cc1f094afe00b5dd3dfb84ba1733fa1b2162d9`

See more details on using hashes here.

File details

Details for the file pycewl-0.1.0-py3-none-any.whl.

File metadata

Download URL: pycewl-0.1.0-py3-none-any.whl
Upload date: Jan 26, 2026
Size: 29.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pycewl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6f815c5217020d76068430fd398e36efa44520c184cfd3147311d371bc04d423`
MD5	`c5f21d45681fa8129a83694d773ad3f7`
BLAKE2b-256	`a64a7bf127b5899bc26cf598ef9d8c1d81448aedfa0e0ca0325f6d3d796fd4e2`

See more details on using hashes here.

pycewl 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

pycewl

Installation

Quick Start

Basic Usage

With Options

Google Search Integration

Smart Relevance Scoring

CLI Reference

Bearer Token Authentication

Python API

Google Cloud Setup

For Google Search

For Relevance Scoring (NLP)

Development

License

Credits

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes