Extract structured data from unstructured text — no AI, just regular expressions. 🔍
Project description
Deckard 🕵️♂️
Extract structured data from unstructured text — no AI, just regular expressions. 🔍
Deckard is a library of regular-expression patterns for extracting structured data (addresses, phone numbers, email addresses, etc.) and a small set of helper utilities that make using those patterns easier.
Status: very early-stage project. Right now the repository contains mostly patterns for Poland. I am looking for contributors from around the world 🌍 — address formats, phone-number formats and other data representations differ by country, so the goal is to gather country-specific patterns for many regions.
Key features ✨
- 🗂️ A collection of ready-to-use regex patterns organized by country (for example
deckard/patterns/pl.py). - 📦 Universal patterns (e.g. email) live in
deckard/patterns/standard.py. - 🛠️ A small helper function
deckard.searchthat combines multiple patterns and returns named-group matches (deckard/main.py).
Installation ⚙️
From PyPI:
pip install deckard
Editable / local development install:
pip install -e .
For contributors — install dependencies with Poetry 🧑💻
This project uses Poetry to manage dependencies and development dependencies.
- Install Poetry (see https://python-poetry.org for instructions).
- From the project root run:
poetry install
This will create a virtual environment and install runtime and development dependencies (including pytest).
To run tests using Poetry:
poetry run pytest
Or start a shell in the created virtualenv and run tests directly:
poetry shell
pytest
Quick usage 🧭
Example using the current public API:
from deckard import search
from deckard.patterns import standard, pl
text = (
"Hello, my email is spaceshaman@tuta.io and my phone number is "
"+48 792 321 321 and my address is ul. Tesotowa 12/6A, 66-700 Bielsko-Biała."
)
result = search([standard.EMAIL, pl.MOBILE_PHONE, pl.ADDRESS], text)
# result.groupdict() will return a dict of named groups, for example:
# {
# 'email': 'spaceshaman@tuta.io',
# 'mobile_phone': '792 321 321',
# 'street': 'ul. Tesotowa',
# 'building': '12',
# 'apartment': '6A',
# 'zip_code': '66-700',
# 'city': 'Bielsko-Biała'
# }
The search helper composes the provided patterns into a single regex (using lookaheads) and returns the first match as a regex.Match object (or None if nothing matched).
Repository layout
deckard/— library codedeckard/main.py— helpersearchfunctiondeckard/patterns/standard.py— universal patterns (e.g.EMAIL)deckard/patterns/pl.py— Poland-specific patterns (address, postal code, phone, etc.)
tests/— unit tests
Examples of existing tests:
tests/test_standard_patterns.py— test forstandard.EMAILtests/test_search_with_multiple_patterns.py— integration tests combiningstandard.EMAILwith patterns frompl.pytests/pl/test_search_address_pl.py— tests for Polish address patterns
Every new pattern must come with tests. Pull requests without tests will not be accepted.
Contributing — how to add new patterns
- Create a new file under
deckard/patterns/named by the country code, e.g.us.py,de.py,fr.py. - Define constants (UPPERCASE) for each pattern, for example
MOBILE_PHONE,ADDRESS,ZIP_CODE. - Add tests under
tests/. Use the existing Polish tests (e.g.tests/test_search_with_multiple_patterns.py) as a template. Provide normal and edge-case examples. - In the PR description explain local rules (phone number format, postal code format, common street abbreviations, etc.).
- PRs without tests will not be accepted.
Tips 💡:
- 🧾 Use clear, consistent named groups in regexes (
?P<name>) sogroupdict()returns a predictable structure. - 📝 Document complex patterns with comments and example inputs if necessary.
Discussion and roadmap 🚧
The project is not yet final — everything is open for discussion. Areas for contributors and discussion include:
- 📋 Defining a minimal set of patterns every country should provide (email, phone, address, postal code, national ID where applicable).
- 🔠 Standardizing group names (
street,building,apartment,zip_code,city,country,mobile_phone, etc.). - ⚖️ Tools for validation and normalization of extracted values.
- 🤖 Automating tests with sample documents in various languages.
If you want to help, open an issue or a PR — a short description of the local data format and one or two patterns with tests is a great place to start.
License 📄
This project is licensed under the MIT License. See the LICENSE file for the full text.
Thanks for your interest — please join the effort. Together we can build an international library of patterns to extract structured data from arbitrary text using robust regular expressions. 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file deckard-0.1.0.tar.gz.
File metadata
- Download URL: deckard-0.1.0.tar.gz
- Upload date:
- Size: 4.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.8.0-1031-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7fdcbac0ae6eef71123efe0efffed065ab66c471699bcf86f6195afbf7ea8d32
|
|
| MD5 |
8d175a41edbd27baf3efddfe09446bc1
|
|
| BLAKE2b-256 |
ff1553003d36fd8d2d79ba0f29bdb751a8bf76e2f54dfcf6ba10e658bbbf9aae
|
File details
Details for the file deckard-0.1.0-py3-none-any.whl.
File metadata
- Download URL: deckard-0.1.0-py3-none-any.whl
- Upload date:
- Size: 6.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.4 CPython/3.12.11 Linux/6.8.0-1031-azure
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7e616f7507dbedad0787cf21b291c3efc747f3dd24ad1bbc820bdbab4e369ee
|
|
| MD5 |
4edee0ade70d10e3e6378498cf7827f6
|
|
| BLAKE2b-256 |
8bb790e19c1738af0768c86254731c304ebe20446b40ed137f471038e5f2a413
|