Parser made to convert lists of emails and urls into JSON formatted, CSV formatted, or plain text strings or files
Project description
Pyrolysate
Pyrolysate is a Python library and CLI tool for parsing and validating URLs and email addresses. It breaks down URLs and emails into their component parts, validates against IANA's official TLD list, and outputs structured data in JSON, CSV, or text format.
The library offers both a programmer-friendly API and a command-line interface, making it suitable for both development integration and quick data processing tasks. It handles single entries or large datasets efficiently using Python's generator functionality, and provides flexible input/output options including file processing with custom delimiters.
Features
URL Parsing
- Extract scheme, subdomain, domain, TLD, port, path, query, and fragment components
- Support for complex URL patterns including ports, queries, and fragments
- Support for IP addresses in URLs
- Support for both direct input and file processing via CLI or API
- Output as JSON, CSV, or text format through CLI or API
Email Parsing
- Extract local, mail server, and domain components
- Support for plus addressing (e.g., user+tag@domain.com)
- Support for both direct input and file processing via CLI or API
- Output as JSON, CSV, or text format through CLI or API
Top Level Domain Validation
- Automatic updates from IANA's official TLD list
- Local TLD file caching for offline use
- Fallback to common TLDs if both online and local sources fail
Flexible Input/Output
- Process single or multiple entries
- Support for government domain emails (.gov.tld)
- Custom delimiters for file input
- Multiple output formats with .txt format as default (JSON, CSV, text)
- Pretty-printed or minified JSON output
- Console output or file saving options
- Memory-efficient processing of large datasets using Python generators
- Support for compressed input files:
- ZIP archives (processes all text files within .zip)
- GZIP (.gz)
- BZIP2 (.bz2)
- LZMA (.xz, .lzma)
Developer Friendly
- Type hints for better IDE support
- Comprehensive docstrings
- Modular design for easy integration
- Command-line interface for quick testing
API Reference
Email Class
| Method | Parameters | Description |
|---|---|---|
parse_email(email_str) |
email_str: str |
Parses single email address |
parse_email_array(emails) |
emails: list[str] |
Parses list of email addresses |
to_json(emails, prettify=True) |
emails: str|list[str], prettify: bool |
Converts to JSON format |
to_json_file(file_name, emails, prettify=True) |
file_name: str, emails: list[str], prettify: bool |
Converts and saves JSON to file |
to_csv(emails) |
emails: str|list[str] |
Converts to CSV format |
to_csv_file(file_name, emails) |
file_name: str, emails: list[str] |
Converts and saves CSV to file |
URL Class
| Method | Parameters | Description |
|---|---|---|
parse_url(url_str, tlds=[]) |
url_str: str, tlds: list[str] |
Parses single URL |
parse_url_array(urls, tlds=[]) |
urls: list[str], tlds: list[str] |
Parses list of URLs |
to_json(urls, prettify=True) |
urls: str|list[str], prettify: bool |
Converts to JSON format |
to_json_file(file_name, urls, prettify=True) |
file_name: str, urls: list[str], prettify: bool |
Converts and saves JSON to file |
to_csv(urls) |
urls: str|list[str] |
Converts to CSV format |
to_csv_file(file_name, urls) |
file_name: str, urls: list[str] |
Converts and saves CSV to file |
Miscellaneous
| Method | Parameters | Description |
|---|---|---|
file_to_list(input_file_name, delimiter='\n') |
input_file_name: str, delimiter: str |
Parses input file into python list by delimiter |
get_tlds_from_iana |
Fetches latest top level domains from IANA | |
get_tlds_from_local |
path_to_tlds_file: str |
Fetches tlds from local file. Defaults to project's local file if path not specified |
CLI Reference
| Argument | Type | Value when argument is omitted | Description |
|---|---|---|---|
target |
str |
None |
Email or URL string(s) to process |
-u, --url |
flag |
False |
Specify URL input |
-e, --email |
flag |
False |
Specify Email input |
-i, --input_file |
str |
None |
Input file name with extension |
-o, --output_file |
str |
None |
Output file name without extension |
-c, --csv |
flag |
False |
Save output as CSV format |
-j, --json |
flag |
False |
Save output as JSON format |
-np, --no_prettify |
flag |
False |
Turn off prettified JSON output |
-d, --delimiter |
str |
'\n' |
Delimiter for input file parsing |
Input File Support
| Format | Extension | Description |
|---|---|---|
| Text | .txt | Plain text files |
| Log | .log | Plain text log files |
| CSV | .csv | Comma-separated values |
| ZIP | .zip | Archives containing text files |
| GZIP | .gz | GZIP compressed files |
| BZIP2 | .bz2 | BZIP2 compressed files |
| LZMA | .xz, .lzma | LZMA compressed files |
Output Types
Email Parse Output
| Field | Description | Example |
|---|---|---|
| input | Full email | user+tag@gmail.com |
| local | Part before + or @ symbol | user |
| plus_address | Optional part between + and @ | tag |
| mail_server | Domain before TLD | gmail |
| domain | Top-level domain | com |
Example output:
{"user+tag@gmail.com":
{
"local": "user",
"plus_address": "tag",
"mail_server": "gmail",
"domain": "com"
}
}
email,local,plus_address,mail_server,domain
user+tag@gmail.com,user,tag,gmail,com
URL Parse Output
| Field | Description | Example |
|---|---|---|
| scheme | Protocol | https |
| subdomain | Domain prefix | www |
| second_level_domain | Main domain | example |
| top_level_domain | Domain suffix | com |
| port | Port number | 443 |
| path | URL path | blog/post |
| query | Query parameters | q=test |
| fragment | URL fragment | section1 |
Example output:
{"https://www.example.com:443/blog/post?q=test#section1":
{
"scheme": "https",
"subdomain": "www",
"second_level_domain": "example",
"top_level_domain": "com",
"port": "443",
"path": "blog/post",
"query": "q=test",
"fragment": "section1"
}
}
url,scheme,subdomain,second_level_domain,top_level_domain,port,path,query,fragment
https://www.example.com:443/blog/post?q=test#section1,https,www,example,com,443,blog/post,q=test,section1
🚀 Installation
From PyPI
pip install pyrolysate
For Development
- Clone the repository
git clone https://github.com/dawnandrew100/pyrolysate.git
cd pyrolysate
- Create and activate a virtual environment
# Using hatch (recommended)
hatch env create
# Or using venv
python -m venv .venv
# Windows
.venv\Scripts\activate
# Unix/MacOS
source .venv/bin/activate
- Install in development mode
# Using hatch
hatch run dev
# Or using pip
pip install -e .
Verify Installation
# Using hatch (recommended)
hatch run pyro -u example.com
# Or using the CLI directly
pyro -u example.com
The CLI command pyro will be available after installation. If the command isn't found, ensure Python's Scripts directory is in your PATH.
Usage
Input File Parsing
from pyrolysate import file_to_list
Parse file with default newline delimiter
urls = file_to_list("urls.txt")
Parse file with custom delimiter
emails = file_to_list("emails.csv", delimiter=",")
Supported Outputs
- JSON (prettified or minified)
- CSV
- Text (default)
- File output with custom naming
- Console output
Email Parsing
from pyrolysate import email
Parse single email
result = email.parse_email("user@example.com")
Parse plus addressed email
result = email.parse_email("user+tag@example.com")
Parse multiple emails
emails = ["user1@example.com", "user2@agency.gov.uk"]
result = email.parse_email_array(emails)
Convert to JSON
json_output = email.to_json("user@example.com")
json_output = email.to_json(["user1@example.com", "user2@example.com"])
Save to JSON file
email.to_json_file("output", "user@example.com")
email.to_json_file("output", ["user1@example.com", "user2@test.org"])
Convert to CSV
csv_output = email.to_csv("user@example.com")
csv_output = email.to_csv(["user1@example.com", "user2@test.org"])
Save to CSV file
email.to_csv_file("output", "user@example.com")
email.to_csv_file("output", ["user1@example.com", "user2@test.org"])
URL Parsing
from pyrolysate import url
Parse single URL
result = url.parse_url("https://www.example.com/path?q=test#fragment")
Parse multiple URLs
urls = ["example.com", "https://www.test.org"]
result = url.parse_url_array(urls)
Convert to JSON
json_output = url.to_json("example.com")
json_output = url.to_json(["example.com", "test.org"])
Save to JSON file
url.to_json_file("output", "example.com")
url.to_json_file("output", ["example.com", "test.org"])
Convert to CSV
csv_output = url.to_csv("example.com")
csv_output = url.to_csv(["example.com", "test.org"])
Save to CSV file
url.to_csv_file("output", "example.com")
url.to_csv_file("output", ["example.com", "test.org"])
Command Line Interface
CLI help
pyro -h
Parse single URL
pyro -u example.com
Parse multiple URLs
pyro -u example1.com example2.com
Parse URLs from file (one per line by default)
pyro -u -i urls.txt
Parse URLs from CSV file with comma delimiter
pyro -u -i urls.csv -d ","
Parse email with plus addressing
pyro -e user+newsletter@example.com
Parse multiple emails and save as JSON
pyro -e user1@example.com user2@example.com -j -o output
Parse URLs from file and save as CSV
pyro -u -i urls.txt -c -o parsed_urls
Parse emails from file with comma delimiter
pyro -e -i emails.txt -d "," -o output
Parse emails with non-prettified JSON output
pyro -e user@example.com -j -np
Parse different file types
# Parse log file
pyro -u -i server.log
# Parse compressed log file
pyro -u -i server.log.gz
# Parse BZIP2 compressed file
pyro -e -i emails.txt.bz2
# Parse ZIP archive containing logs and text files
pyro -u -i archive.zip
Supported Formats
Email Formats
- Standard:
example@mail.com - Plus Addresses:
example+tag@mail.com - Government:
example@agency.gov.uk
URL Formats
- Basic:
example.com - With subdomain:
www.example.com - With scheme:
https://example.org - With path:
example.com/path/to/file.txt - With port:
example.com:8080 - With query:
example.com/search?q=test - With fragment:
example.com#section1 - IP addresses:
192.168.1.1:8080 - Government domains:
agency.gov.uk - Full complex URLs:
https://www.example.gov.uk:8080/path?q=test#section1
Input File Support
- Plain text files (.txt)
- Plain text log files (.log)
- Comma-separated values (.csv)
- ZIP archives containing text files (.zip)
- GZIP compressed files (.gz)
- BZIP2 compressed files (.bz2)
- LZMA compressed files (.xz, .lzma)
ZIP Archive Support
- Processes all text files within the archive (.txt, .csv, .log)
- Handles nested directories
- Continues processing if some files are corrupted
- UTF-8 encoding expected for text files
Outputs
- Text file (default)
- JSON file (prettified or minified)
- CSV file
- Console output
[!IMPORTANT] This library handles email address comments by removing them from the final output
[!CAUTION]
- This library does not specially handle emails containing double quotes. Double quotes are valid in the local part of an email, but many modern email systems either block or mark emails with quotes as spam.
- Make sure that
requestsis installed before runningget_tlds_from_iana.
[!WARNING] This library is designed and tested to handle http and https urls. Other forms of url may return undefined results.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pyrolysate-1.0.1.tar.gz.
File metadata
- Download URL: pyrolysate-1.0.1.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cdc2732a55c3870aea84645a0cb34776bdab97d29d01b99065a4cf8456e0a510
|
|
| MD5 |
1348ea653caedfdb694e839b92ddb0f7
|
|
| BLAKE2b-256 |
c534da00b3d05819a70ccdda3307543ddf2f0cbb930d1aee58b5d11fd0c49177
|
File details
Details for the file pyrolysate-1.0.1-py3-none-any.whl.
File metadata
- Download URL: pyrolysate-1.0.1-py3-none-any.whl
- Upload date:
- Size: 24.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.25
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76934b3e4e995c9076ae1e6d01d46ae69a2fd518de0be21b1d45a691e15ded2e
|
|
| MD5 |
fca88f0e8ab1beab3cc379c2fb200893
|
|
| BLAKE2b-256 |
2ebb5866a9863b54e0345c330724e3f8662d0ee1101a23e11ad386fa8f6cd359
|