Skip to main content

Python async re-implementation of CeWL (Custom Word List Generator)

Project description

pycewl

Python async re-implementation of CeWL (Custom Word List Generator)

CI Python 3.11+ License: MIT

A modern, high-performance Python implementation of CeWL with:

  • Async spider using asyncio + httpx
  • HTML parsing with beautifulsoup4
  • Google Search discovery for seed URLs
  • Smart Word Relevance Scoring using Google Natural Language API
  • Production-grade packaging for PyPI

Installation

pip install pycewl

For development:

pip install pycewl[dev]

For PDF metadata extraction:

pip install pycewl[pdf]

Quick Start

Basic Usage

Spider a website and generate a wordlist:

pycewl https://example.com -w words.txt

With Options

# Set spider depth and show word counts
pycewl https://example.com -d 3 -c -w words.txt

# Extract emails too
pycewl https://example.com -e --email-file emails.txt -w words.txt

# Lowercase words, minimum 5 characters
pycewl https://example.com --lowercase -m 5 -w words.txt

Google Search Integration

Find seed URLs using Google Custom Search:

# Set environment variables
export GOOGLE_API_KEY="your-api-key"
export GOOGLE_SEARCH_ENGINE_ID="your-search-engine-id"

# Search and spider
pycewl --google-keyword "star trek fan site" -w words.txt

Smart Relevance Scoring

Group words by relevance to your search query using Google NLP:

# Set Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# Spider with relevance scoring
pycewl --google-keyword "star trek" --relevance-scoring \
    --related-file related.txt \
    --unrelated-file general.txt

Output structure:

=== Words Related to "star trek" ===
enterprise, 42
spock, 38
federation, 25
...

=== General Words (Not Query-Specific) ===
welcome, 15
contact, 12
page, 8
...

CLI Reference

Option Description
-d, --depth INT Spider depth (default: 2)
-m, --min-word-length INT Minimum word length (default: 3)
-x, --max-word-length INT Maximum word length
-w, --write PATH Output file for words
-n, --no-words Don't output wordlist
-c, --count Show word counts
-g, --groups INT Group words by count ranges
-o, --offsite Allow spidering offsite URLs
-e, --email Extract email addresses
--email-file PATH Output file for emails
-a, --meta Extract metadata from documents
--meta-file PATH Output file for metadata
-k, --keep Keep downloaded files
--lowercase Convert words to lowercase
--with-numbers Include words containing numbers
--convert-umlauts Convert German umlauts to ASCII
-u, --user-agent TEXT Custom user agent
--concurrency INT Concurrent requests (default: 10)
--auth-type TEXT Authentication type (basic/digest/bearer)
--auth-user TEXT Authentication username
--auth-pass TEXT Authentication password
--auth-token TEXT Bearer/JWT token for authentication
--proxy-host TEXT Proxy hostname
--proxy-port INT Proxy port
-H, --header TEXT HTTP header (Name: Value)
-v, --verbose Verbose output
--google-keyword TEXT Search Google for seed URLs
--google-max-results INT Max Google results (default: 10)
--relevance-scoring Enable word relevance scoring
--relevance-threshold FLOAT Relevance threshold (default: 0.5)
--related-file PATH Output for query-related words
--unrelated-file PATH Output for general words
--version Show version

Bearer Token Authentication

Authenticate with a bearer or JWT token:

# Using --auth-token (auto-detects bearer type)
pycewl https://api.example.com --auth-token "your-access-token" -w words.txt

# Explicit auth type
pycewl https://api.example.com --auth-type bearer --auth-token "eyJhbG..." -w words.txt

Python API

import asyncio
from pycewl import Crawler, CeWLConfig, SpiderConfig, WordConfig, WordExtractor

async def main():
    config = CeWLConfig(
        url="https://example.com",
        spider=SpiderConfig(depth=2, concurrency=5),
        word=WordConfig(min_length=4, lowercase=True),
    )

    crawler = Crawler(config)
    extractor = WordExtractor(config.word)

    async for result in crawler.crawl(["https://example.com"]):
        if result.html:
            extractor.process_html(result.html)

    for word, count in extractor.get_sorted_words()[:20]:
        print(f"{word}: {count}")

asyncio.run(main())

Google Cloud Setup

For Google Search

  1. Go to Google Cloud Console
  2. Create a project and enable the Custom Search API
  3. Create an API key
  4. Set up a Programmable Search Engine
  5. Set environment variables:
    export GOOGLE_API_KEY="your-api-key"
    export GOOGLE_SEARCH_ENGINE_ID="your-cx-id"
    

For Relevance Scoring (NLP)

  1. Enable the Natural Language API in Google Cloud Console
  2. Create a service account with Natural Language API access
  3. Download the JSON key file
  4. Set environment variable:
    export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
    

Development

# Clone repository
git clone https://github.com/digininja/CeWL.git
cd CeWL/pycewl

# Install dev dependencies
make install-dev

# Run tests
make test

# Run linting
make lint

# Format code
make format

# Build package
make build

License

MIT License - see LICENSE for details.

Credits

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pycewl-0.1.0.tar.gz (35.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pycewl-0.1.0-py3-none-any.whl (29.3 kB view details)

Uploaded Python 3

File details

Details for the file pycewl-0.1.0.tar.gz.

File metadata

  • Download URL: pycewl-0.1.0.tar.gz
  • Upload date:
  • Size: 35.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pycewl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b691510ca3fe2c1470b4c94ae99ef80e529873231cd7d4701d58474b5c1de25e
MD5 c8717b5fd0591b4562e1e63609cd3a7f
BLAKE2b-256 4b617a0a96a16da70f89f0fa82cc1f094afe00b5dd3dfb84ba1733fa1b2162d9

See more details on using hashes here.

File details

Details for the file pycewl-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pycewl-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 29.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for pycewl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 6f815c5217020d76068430fd398e36efa44520c184cfd3147311d371bc04d423
MD5 c5f21d45681fa8129a83694d773ad3f7
BLAKE2b-256 a64a7bf127b5899bc26cf598ef9d8c1d81448aedfa0e0ca0325f6d3d796fd4e2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page