Skip to main content

Parser made to convert lists of emails and urls into JSON formatted, CSV formatted, or plain text strings or files

Project description

Static Badge Python Version from PEP 621 TOML PyPI version GitHub License GitHub branch check runs

Pyrolysate

Pyrolysate is a Python library and CLI tool for parsing and validating URLs and email addresses. It breaks down URLs and emails into their component parts, validates against IANA's official TLD list, and outputs structured data in JSON, CSV, or text format.

The library offers both a programmer-friendly API and a command-line interface, making it suitable for both development integration and quick data processing tasks. It handles single entries or large datasets efficiently using Python's generator functionality, and provides flexible input/output options including file processing with custom delimiters.

Features

URL Parsing

  • Extract scheme, subdomain, domain, TLD, port, path, query, and fragment components
  • Support for complex URL patterns including ports, queries, and fragments
  • Support for IP addresses in URLs
  • Support for both direct input and file processing via CLI or API
  • Output as JSON, CSV, or text format through CLI or API

Email Parsing

  • Extract local, mail server, and domain components
  • Support for plus addressing (e.g., user+tag@domain.com)
  • Support for both direct input and file processing via CLI or API
  • Output as JSON, CSV, or text format through CLI or API

Top Level Domain Validation

  • Automatic updates from IANA's official TLD list
  • Local TLD file caching for offline use
  • Fallback to common TLDs if both online and local sources fail

Flexible Input/Output

  • Process single or multiple entries
  • Support for government domain emails (.gov.tld)
  • Custom delimiters for file input
  • Multiple output formats with .txt format as default (JSON, CSV, text)
  • Pretty-printed or minified JSON output
  • Console output or file saving options
  • Memory-efficient processing of large datasets using Python generators
  • Support for compressed input files:
    • ZIP archives (processes all text files within .zip)
    • GZIP (.gz)
    • BZIP2 (.bz2)
    • LZMA (.xz, .lzma)

Developer Friendly

  • Type hints for better IDE support
  • Comprehensive docstrings
  • Modular design for easy integration
  • Command-line interface for quick testing

API Reference

Email Class

Method Parameters Description
parse_email(email_str) email_str: str Parses single email address
parse_email_array(emails) emails: list[str] Parses list of email addresses
to_json(emails, prettify=True) emails: str|list[str], prettify: bool Converts to JSON format
to_json_file(file_name, emails, prettify=True) file_name: str, emails: list[str], prettify: bool Converts and saves JSON to file
to_csv(emails) emails: str|list[str] Converts to CSV format
to_csv_file(file_name, emails) file_name: str, emails: list[str] Converts and saves CSV to file

URL Class

Method Parameters Description
parse_url(url_str, tlds=[]) url_str: str, tlds: list[str] Parses single URL
parse_url_array(urls, tlds=[]) urls: list[str], tlds: list[str] Parses list of URLs
to_json(urls, prettify=True) urls: str|list[str], prettify: bool Converts to JSON format
to_json_file(file_name, urls, prettify=True) file_name: str, urls: list[str], prettify: bool Converts and saves JSON to file
to_csv(urls) urls: str|list[str] Converts to CSV format
to_csv_file(file_name, urls) file_name: str, urls: list[str] Converts and saves CSV to file

Miscellaneous

Method Parameters Description
file_to_list(input_file_name, delimiter='\n') input_file_name: str, delimiter: str Parses input file into python list by delimiter
get_tlds_from_iana Fetches latest top level domains from IANA
get_tlds_from_local path_to_tlds_file: str Fetches tlds from local file. Defaults to project's local file if path not specified

CLI Reference

Argument Type Value when argument is omitted Description
target str None Email or URL string(s) to process
-u, --url flag False Specify URL input
-e, --email flag False Specify Email input
-i, --input_file str None Input file name with extension
-o, --output_file str None Output file name without extension
-c, --csv flag False Save output as CSV format
-j, --json flag False Save output as JSON format
-np, --no_prettify flag False Turn off prettified JSON output
-d, --delimiter str '\n' Delimiter for input file parsing

Input File Support

Format Extension Description
Text .txt Plain text files
Log .log Plain text log files
CSV .csv Comma-separated values
ZIP .zip Archives containing text files
GZIP .gz GZIP compressed files
BZIP2 .bz2 BZIP2 compressed files
LZMA .xz, .lzma LZMA compressed files

Output Types

Email Parse Output

Field Description Example
input Full email user+tag@gmail.com
local Part before + or @ symbol user
plus_address Optional part between + and @ tag
mail_server Domain before TLD gmail
domain Top-level domain com

Example output:

{"user+tag@gmail.com":
    {
    "local": "user",
    "plus_address": "tag",
    "mail_server": "gmail",
    "domain": "com"
    }
}
email,local,plus_address,mail_server,domain
user+tag@gmail.com,user,tag,gmail,com

URL Parse Output

Field Description Example
scheme Protocol https
subdomain Domain prefix www
second_level_domain Main domain example
top_level_domain Domain suffix com
port Port number 443
path URL path blog/post
query Query parameters q=test
fragment URL fragment section1

Example output:

{"https://www.example.com:443/blog/post?q=test#section1":
    {
    "scheme": "https",
    "subdomain": "www",
    "second_level_domain": "example",
    "top_level_domain": "com",
    "port": "443",
    "path": "blog/post",
    "query": "q=test",
    "fragment": "section1"
    }
}
url,scheme,subdomain,second_level_domain,top_level_domain,port,path,query,fragment
https://www.example.com:443/blog/post?q=test#section1,https,www,example,com,443,blog/post,q=test,section1

🚀 Installation

From PyPI

pip install pyrolysate

For Development

  1. Clone the repository
git clone https://github.com/dawnandrew100/pyrolysate.git
cd pyrolysate
  1. Create and activate a virtual environment
# Using hatch (recommended)
hatch env create

# Or using venv
python -m venv .venv
# Windows
.venv\Scripts\activate
# Unix/MacOS
source .venv/bin/activate
  1. Install in development mode
# Using hatch
hatch run dev

# Or using pip
pip install -e .

Verify Installation

# Using hatch (recommended)
hatch run pyro -u example.com

# Or using the CLI directly
pyro -u example.com

The CLI command pyro will be available after installation. If the command isn't found, ensure Python's Scripts directory is in your PATH.

Usage

Input File Parsing

from pyrolysate import file_to_list

Parse file with default newline delimiter

urls = file_to_list("urls.txt")

Parse file with custom delimiter

emails = file_to_list("emails.csv", delimiter=",")

Supported Outputs

  • JSON (prettified or minified)
  • CSV
  • Text (default)
  • File output with custom naming
  • Console output

Email Parsing

from pyrolysate import email

Parse single email

result = email.parse_email("user@example.com")

Parse plus addressed email

result = email.parse_email("user+tag@example.com")

Parse multiple emails

emails = ["user1@example.com", "user2@agency.gov.uk"]
result = email.parse_email_array(emails)

Convert to JSON

json_output = email.to_json("user@example.com")
json_output = email.to_json(["user1@example.com", "user2@example.com"])

Save to JSON file

email.to_json_file("output", "user@example.com")
email.to_json_file("output", ["user1@example.com", "user2@test.org"])

Convert to CSV

csv_output = email.to_csv("user@example.com")
csv_output = email.to_csv(["user1@example.com", "user2@test.org"])

Save to CSV file

email.to_csv_file("output", "user@example.com")
email.to_csv_file("output", ["user1@example.com", "user2@test.org"])

URL Parsing

from pyrolysate import url

Parse single URL

result = url.parse_url("https://www.example.com/path?q=test#fragment")

Parse multiple URLs

urls = ["example.com", "https://www.test.org"]
result = url.parse_url_array(urls)

Convert to JSON

json_output = url.to_json("example.com")
json_output = url.to_json(["example.com", "test.org"])

Save to JSON file

url.to_json_file("output", "example.com")
url.to_json_file("output", ["example.com", "test.org"])

Convert to CSV

csv_output = url.to_csv("example.com")
csv_output = url.to_csv(["example.com", "test.org"])

Save to CSV file

url.to_csv_file("output", "example.com")
url.to_csv_file("output", ["example.com", "test.org"])

Command Line Interface

CLI help

pyro -h

Parse single URL

pyro -u example.com

Parse multiple URLs

pyro -u example1.com example2.com

Parse URLs from file (one per line by default)

pyro -u -i urls.txt

Parse URLs from CSV file with comma delimiter

pyro -u -i urls.csv -d ","

Parse email with plus addressing

pyro -e user+newsletter@example.com

Parse multiple emails and save as JSON

pyro -e user1@example.com user2@example.com -j -o output

Parse URLs from file and save as CSV

pyro -u -i urls.txt -c -o parsed_urls

Parse emails from file with comma delimiter

pyro -e -i emails.txt -d "," -o output

Parse emails with non-prettified JSON output

pyro -e user@example.com -j -np

Parse different file types

# Parse log file
pyro -u -i server.log

# Parse compressed log file
pyro -u -i server.log.gz

# Parse BZIP2 compressed file
pyro -e -i emails.txt.bz2

# Parse ZIP archive containing logs and text files
pyro -u -i archive.zip

Supported Formats

Email Formats

  • Standard: example@mail.com
  • Plus Addresses: example+tag@mail.com
  • Government: example@agency.gov.uk

URL Formats

  • Basic: example.com
  • With subdomain: www.example.com
  • With scheme: https://example.org
  • With path: example.com/path/to/file.txt
  • With port: example.com:8080
  • With query: example.com/search?q=test
  • With fragment: example.com#section1
  • IP addresses: 192.168.1.1:8080
  • Government domains: agency.gov.uk
  • Full complex URLs: https://www.example.gov.uk:8080/path?q=test#section1

Input File Support

  • Plain text files (.txt)
  • Plain text log files (.log)
  • Comma-separated values (.csv)
  • ZIP archives containing text files (.zip)
  • GZIP compressed files (.gz)
  • BZIP2 compressed files (.bz2)
  • LZMA compressed files (.xz, .lzma)

ZIP Archive Support

  • Processes all text files within the archive (.txt, .csv, .log)
  • Handles nested directories
  • Continues processing if some files are corrupted
  • UTF-8 encoding expected for text files

Outputs

  • Text file (default)
  • JSON file (prettified or minified)
  • CSV file
  • Console output

[!IMPORTANT] This library handles email address comments by removing them from the final output

[!CAUTION]

  • This library does not specially handle emails containing double quotes. Double quotes are valid in the local part of an email, but many modern email systems either block or mark emails with quotes as spam.
  • Make sure that requests is installed before running get_tlds_from_iana.

[!WARNING] This library is designed and tested to handle http and https urls. Other forms of url may return undefined results.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyrolysate-1.0.1.tar.gz (21.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyrolysate-1.0.1-py3-none-any.whl (24.4 kB view details)

Uploaded Python 3

File details

Details for the file pyrolysate-1.0.1.tar.gz.

File metadata

  • Download URL: pyrolysate-1.0.1.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pyrolysate-1.0.1.tar.gz
Algorithm Hash digest
SHA256 cdc2732a55c3870aea84645a0cb34776bdab97d29d01b99065a4cf8456e0a510
MD5 1348ea653caedfdb694e839b92ddb0f7
BLAKE2b-256 c534da00b3d05819a70ccdda3307543ddf2f0cbb930d1aee58b5d11fd0c49177

See more details on using hashes here.

File details

Details for the file pyrolysate-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: pyrolysate-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 24.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.25

File hashes

Hashes for pyrolysate-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 76934b3e4e995c9076ae1e6d01d46ae69a2fd518de0be21b1d45a691e15ded2e
MD5 fca88f0e8ab1beab3cc379c2fb200893
BLAKE2b-256 2ebb5866a9863b54e0345c330724e3f8662d0ee1101a23e11ad386fa8f6cd359

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page