Python async re-implementation of CeWL (Custom Word List Generator)
Project description
pycewl
Python async re-implementation of CeWL (Custom Word List Generator)
A modern, high-performance Python implementation of CeWL with:
- Async spider using
asyncio+httpx - HTML parsing with
beautifulsoup4 - Google Search discovery for seed URLs
- Smart Word Relevance Scoring using Google Natural Language API
- Production-grade packaging for PyPI
Installation
pip install pycewl
For development:
pip install pycewl[dev]
For PDF metadata extraction:
pip install pycewl[pdf]
Quick Start
Basic Usage
Spider a website and generate a wordlist:
pycewl https://example.com -w words.txt
With Options
# Set spider depth and show word counts
pycewl https://example.com -d 3 -c -w words.txt
# Extract emails too
pycewl https://example.com -e --email-file emails.txt -w words.txt
# Lowercase words, minimum 5 characters
pycewl https://example.com --lowercase -m 5 -w words.txt
Google Search Integration
Find seed URLs using Google Custom Search:
# Set environment variables
export GOOGLE_API_KEY="your-api-key"
export GOOGLE_SEARCH_ENGINE_ID="your-search-engine-id"
# Search and spider
pycewl --google-keyword "star trek fan site" -w words.txt
Smart Relevance Scoring
Group words by relevance to your search query using Google NLP:
# Set Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
# Spider with relevance scoring
pycewl --google-keyword "star trek" --relevance-scoring \
--related-file related.txt \
--unrelated-file general.txt
Output structure:
=== Words Related to "star trek" ===
enterprise, 42
spock, 38
federation, 25
...
=== General Words (Not Query-Specific) ===
welcome, 15
contact, 12
page, 8
...
CLI Reference
| Option | Description |
|---|---|
-d, --depth INT |
Spider depth (default: 2) |
-m, --min-word-length INT |
Minimum word length (default: 3) |
-x, --max-word-length INT |
Maximum word length |
-w, --write PATH |
Output file for words |
-n, --no-words |
Don't output wordlist |
-c, --count |
Show word counts |
-g, --groups INT |
Group words by count ranges |
-o, --offsite |
Allow spidering offsite URLs |
-e, --email |
Extract email addresses |
--email-file PATH |
Output file for emails |
-a, --meta |
Extract metadata from documents |
--meta-file PATH |
Output file for metadata |
-k, --keep |
Keep downloaded files |
--lowercase |
Convert words to lowercase |
--with-numbers |
Include words containing numbers |
--convert-umlauts |
Convert German umlauts to ASCII |
-u, --user-agent TEXT |
Custom user agent |
--concurrency INT |
Concurrent requests (default: 10) |
--auth-type TEXT |
Authentication type (basic/digest/bearer) |
--auth-user TEXT |
Authentication username |
--auth-pass TEXT |
Authentication password |
--auth-token TEXT |
Bearer/JWT token for authentication |
--proxy-host TEXT |
Proxy hostname |
--proxy-port INT |
Proxy port |
-H, --header TEXT |
HTTP header (Name: Value) |
-v, --verbose |
Verbose output |
--google-keyword TEXT |
Search Google for seed URLs |
--google-max-results INT |
Max Google results (default: 10) |
--relevance-scoring |
Enable word relevance scoring |
--relevance-threshold FLOAT |
Relevance threshold (default: 0.5) |
--related-file PATH |
Output for query-related words |
--unrelated-file PATH |
Output for general words |
--version |
Show version |
Bearer Token Authentication
Authenticate with a bearer or JWT token:
# Using --auth-token (auto-detects bearer type)
pycewl https://api.example.com --auth-token "your-access-token" -w words.txt
# Explicit auth type
pycewl https://api.example.com --auth-type bearer --auth-token "eyJhbG..." -w words.txt
Python API
import asyncio
from pycewl import Crawler, CeWLConfig, SpiderConfig, WordConfig, WordExtractor
async def main():
config = CeWLConfig(
url="https://example.com",
spider=SpiderConfig(depth=2, concurrency=5),
word=WordConfig(min_length=4, lowercase=True),
)
crawler = Crawler(config)
extractor = WordExtractor(config.word)
async for result in crawler.crawl(["https://example.com"]):
if result.html:
extractor.process_html(result.html)
for word, count in extractor.get_sorted_words()[:20]:
print(f"{word}: {count}")
asyncio.run(main())
Google Cloud Setup
For Google Search
- Go to Google Cloud Console
- Create a project and enable the Custom Search API
- Create an API key
- Set up a Programmable Search Engine
- Set environment variables:
export GOOGLE_API_KEY="your-api-key" export GOOGLE_SEARCH_ENGINE_ID="your-cx-id"
For Relevance Scoring (NLP)
- Enable the Natural Language API in Google Cloud Console
- Create a service account with Natural Language API access
- Download the JSON key file
- Set environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/key.json"
Development
# Clone repository
git clone https://github.com/digininja/CeWL.git
cd CeWL/pycewl
# Install dev dependencies
make install-dev
# Run tests
make test
# Run linting
make lint
# Format code
make format
# Build package
make build
License
MIT License - see LICENSE for details.
Credits
- Original CeWL by Robin Wood (digininja)
- Python implementation maintains feature parity with Ruby original
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pycewl-0.1.0.tar.gz.
File metadata
- Download URL: pycewl-0.1.0.tar.gz
- Upload date:
- Size: 35.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b691510ca3fe2c1470b4c94ae99ef80e529873231cd7d4701d58474b5c1de25e
|
|
| MD5 |
c8717b5fd0591b4562e1e63609cd3a7f
|
|
| BLAKE2b-256 |
4b617a0a96a16da70f89f0fa82cc1f094afe00b5dd3dfb84ba1733fa1b2162d9
|
File details
Details for the file pycewl-0.1.0-py3-none-any.whl.
File metadata
- Download URL: pycewl-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6f815c5217020d76068430fd398e36efa44520c184cfd3147311d371bc04d423
|
|
| MD5 |
c5f21d45681fa8129a83694d773ad3f7
|
|
| BLAKE2b-256 |
a64a7bf127b5899bc26cf598ef9d8c1d81448aedfa0e0ca0325f6d3d796fd4e2
|