Extract structured data from FEK PDFs.

These details have not been verified by PyPI

Project links

Project description

fek-extractor

Extract structured data from Greek Government Gazette (ΦΕΚ) PDFs.

It turns messy, two‑column government PDFs into machine‑readable JSON/CSV with FEK metadata and a clean map of Άρθρα (articles). Built on pdfminer.six, with careful two‑column handling, header/footer filtering, Greek‑aware de‑hyphenation, and article detection.

About this project

Greek Government Gazette (ΦΕΚ) documents look uniform at a glance, but their typesetting and structure are anything but. Even “clean” digital PDFs hide quirks that trip up generic parsers and off‑the‑shelf AI:

Multi‑column reading order with full‑width “tails”, footers, and boilerplate that disrupt token flow.
Title vs. body separation where headings, subtitles, and continuations interleave across pages.
Dense legal cross‑references and amendments, with nested exceptions and renumbered clauses.
Inconsistent numbering and metadata, plus occasional encoding artifacts and discretionary hyphens.

This project addresses those realities with a layout‑aware, domain‑specific pipeline that prioritizes determinism and inspectability:

Layout‑aware text reconstruction — two‑column segmentation (k‑means + gutter valley), “tail” detection, header/footer filtering, and stable reading order.
Article‑structure recovery — detects Άρθρο N, associates titles and bodies across page boundaries, and synthesizes a hierarchical TOC when possible.
Greek‑aware normalization — de‑hyphenates safely (soft/discretionary hyphens, wrapped words) while preserving accents/case.
Domain heuristics + light NLP hooks — FEK masthead parsing (series/issue/date), decision numbers, and simple patterns for subject/Θέμα; extension points for NER and reference extraction.
Transparent debugging — page‑focused debug mode and optional metrics so you can see why a page parsed a certain way.

Who it’s for: legal‑tech teams, data engineers, and researchers who need reproducible, explainable FEK extraction that won’t crumble on edge cases. Outcome: structured, searchable, dependable data for automation, analysis, and integration.

If your team needs tailored FEK pipelines or additional NLP components, AspectSoft can help.

Features

FEK-aware text extraction
- Two-column segmentation via k-means over x-coordinates with a gutter-valley heuristic.
- Per-page region classification & demotion — header, footer, column body, full-width tail, noise.
- “Tail” detection for full-width content (signatures, appendices, tables) below the columns.
- Header/footer cleanup tuned to FEK mastheads and page furniture.
- Deterministic reading order (by column → y → x); graceful single-column fallback.
Greek de-hyphenation
- Removes soft/discretionary hyphens (U+00AD) and stitches wrapped words safely.
- Preserves accents/case; conservative rules to avoid over-merging.
- Handles common typography patterns (e.g., hyphen + space breaks).
Header parsing
- Extracts FEK series (Α/Β/… including word→letter normalization), issue number, and date in both DD.MM.YYYY and ISO YYYY-MM-DD.
- Best-effort detection of decision numbers (e.g., “Αριθ.”).
- Tolerant to spacing/diacritic/punctuation variants.
Article detection
- Recognizes Άρθρο N (including letter suffixes like 14Α) and captures title + body.
- Stitches articles across page boundaries; keeps original and normalized numbering.
- Produces a structured articles map for direct programmatic use.
TOC synthesis (optional)
- Builds a hierarchical TOC where present: ΜΕΡΟΣ → ΤΙΤΛΟΣ → ΚΕΦΑΛΑΙΟ → ΤΜΗΜΑ → Άρθρα.
- Emits clean JSON for navigation, QA, or UI rendering.
Metrics (opt-in via --include-metrics)
- Lengths & counts (characters, words, lines) and median line length.
- Top words, character histogram, and pluggable regex matches (e.g., FEK citations, “Θέμα:”).
CLI & Python API
- CLI: single file or directory recursion, JSON/CSV output, --jobs N for parallel processing, focused logging via --debug [PAGE].
- API: extract_pdf_info(path, include_metrics=True|False, ...) returns a ready-to-use record.
Typed codebase & tests
- Static typing (PEP 561), lint/format (ruff, black), type checks (mypy), and tests (pytest).
- Clear module boundaries (io/, parsing/, metrics/, cli.py, core.py).

With this mix, FEK PDFs become consistent, navigable JSON/CSV with reliable metadata and article structure—ready for indexing, analytics, and automation.

✨ Sample PDF for testing ships in data/samples/gr-act-2020-4706-4706_2020.pdf.

Demo & screenshots
Requirements
Install
Quickstart
CLI usage
Python API
Output schema
Technical deep dive
Architecture
Examples
Debug helpers
Performance tips
Project layout
Development
Contributing
License
Contact

Demo & screenshots

FEK extractor — debug view (4514/2018, page 12)

Requirements

Python 3.10+
OS: Linux, macOS, or Windows
Runtime dependency: pdfminer.six

Install

From PyPI

pip install fek-extractor

From source (editable)

git clone https://github.com/dmsfiris/fek-extractor.git
cd fek-extractor
python -m venv .venv
source .venv/bin/activate # Windows: .\.venv\Scripts\Activate.ps1
pip install -U pip
pip install -e . # library + CLI
# or full dev setup
pip install -e ".[dev]"
pre-commit install

With pipx (isolated CLI)

pipx install fek-extractor

Docker (no local Python needed)

docker run --rm -v "$PWD:/work" -w /work python:3.11-slim bash -lc "pip install fek-extractor && fek-extractor -i data/samples -o out.json"

Quickstart

# JSON (default)
fek-extractor -i data/samples -o out.json -f json

# CSV
fek-extractor -i data/samples -o out.csv -f csv

# As a module (equivalent to the CLI)
python -m fek_extractor -i data/samples -o out.json

CLI usage

usage: fek-extractor [-h] --input INPUT [--out OUT] [--format {json,csv}]
 [--no-recursive] [--debug [PAGE]] [--jobs JOBS]
 [--include-metrics] [--articles-only] [--toc-only]

Extract structured info from FEK/Greek-law PDFs.

Options

-i, --input PATH (required) — PDF file or directory.
-o, --out PATH (default: out.json) — Output path.
-f, --format {json,csv} (default: json) — Output format.
--no-recursive — When --input is a directory, do not recurse.
--debug [PAGE] — Enable debug logging; optionally pass a page number (e.g. --debug 39) to focus per‑page debug.
--jobs JOBS — Parallel workers when input is a folder (default 1).
--include-metrics — Add metrics into each record (see below).
--articles-only — Emit only the articles map as JSON (ignores -f csv).
--toc-only — Emit only the synthesized Table of Contents as JSON.

Python API

from fek_extractor import extract_pdf_info

# Single PDF → record (dict)
record = extract_pdf_info("data/samples/gr-act-2020-4706-4706_2020.pdf", include_metrics=True)
print(record["filename"], record["pages"], record["articles_count"])

# Optional kwargs (subject to change):
# debug=True
# debug_pages=[39] # focus page(s) for diagnostics
# dehyphenate=True # on by default

Return type: dict[str, Any] with the fields shown in Output schema.

Output schema

Each record (per PDF) typically contains:

Field	Type	Notes
`path`	string	Absolute or relative input path
`filename`	string	File name only
`pages`	int	Page count
`fek_series`	string?	Single Greek letter (e.g. `Α`) if detected
`fek_number`	string?	Issue number if detected
`fek_date`	string?	Dotted date `DD.MM.YYYY`
`fek_date_iso`	string?	ISO date `YYYY-MM-DD`
`decision_number`	string?	From “Αριθ.” if found
`subject`	string?	Document subject/Θέμα (best‑effort)
`articles`	object	Map of article number → article object
`articles_count`	int	Convenience total
`first_5_lines`	array	First few text lines (debugging aid)
Metrics (only when `--include-metrics`)
`length`	int	Characters in raw text
`num_lines`	int	Number of lines
`median_line_length`	int	Median non‑empty line length
`char_counts`	object	Char → count
`word_counts_top`	object	Top words
`chars`, `words`	int	Totals
`matches`	object	Regex matches (from `data/patterns/patterns.txt`)

Article object

{
 "number": "13", // normalized article id (e.g., "13", "14Α")
 "title": "Οργανωτικές ρυθμίσεις",
 "body": "…full text…",
 // optional structural context when present:
 "part_letter": "Α", "part_title": "…", // ΜΕΡΟΣ
 "title_letter": "I", "title_title": "…", // ΤΙΤΛΟΣ
 "chapter_letter": "1", "chapter_title": "…", // ΚΕΦΑΛΑΙΟ
 "section_letter": "Α", "section_title": "…" // ΤΜΗΜΑ
}

Technical deep dive

Reading order reconstruction Rebuilds logical lines from low‑level glyphs, sorts by column then by y/x to maintain human reading order.
Two‑column segmentation Uses k‑means clustering over x‑coords and gap valley search to find the column gutter; detects and demotes “tail” (full‑width) content below columns.
Greek‑aware normalization Removes soft hyphens, stitches wrapped words, preserves Greek capitalization/accents conservatively.
Header & masthead parsing Regex/heuristics for FEK line (series/issue/date), dotted and ISO date, and decision numbers (Αριθ.).
Article detection & stitching Recognizes Άρθρο N headings, associates titles/bodies across page boundaries, and builds a robust map.
TOC synthesis Extracts hierarchical headers (ΜΕΡΟΣ/ΤΙΤΛΟΣ/ΚΕΦΑΛΑΙΟ/ΤΜΗΜΑ) when present.
Metrics Character/word counts and frequency stats to help diagnose messy PDFs.

Architecture

PDF → glyphs → lines → columns → normalized text
 → header parser → articles parser → {record}
 → (optional) metrics / TOC
 → JSON/CSV writer

Key modules (under src/fek_extractor/):

io/pdf.py – low‑level extraction, column/tail logic
parsing/normalize.py – de‑hyphenation & cleanup
parsing/headers.py – FEK header parsing
parsing/articles.py – article detection + body stitching
metrics.py – optional stats
cli.py – batch processing, JSON/CSV output

Examples

# 1) All PDFs under a folder → JSON
fek-extractor -i ./data/samples -o out.json

# 2) Single PDF → CSV
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf -o out.csv -f csv

# 3) Articles only (for a file)
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --articles-only -o articles.json

# 4) Table of Contents only (for a file)
fek-extractor -i ./data/samples/gr-act-2020-4706-4706_2020.pdf --toc-only -o toc.json

# 5) Process a directory in parallel with 4 workers, include metrics
fek-extractor -i ./data/samples --jobs 4 --include-metrics -o out.json

Debug helpers

There is a small debug entrypoint to inspect column extraction and page layout:

python -m fek_extractor.debug --pdf data/samples/gr-act-2020-4706-4706_2020.pdf --page 39 --check-order

Performance tips

Prefer running with --jobs N on directories to parallelize across files.
For very large gazettes, keep output as JSON first (CSV is slower with many nested keys).
Pre‑process PDFs (deskew/OCR) if the source is scanned images.

Project layout

src/fek_extractor/
 __main__.py # supports `python -m fek_extractor`
 cli.py # CLI entrypoint
 core.py # Orchestration
 io/ # PDF I/O and exporters
 parsing/ # Normalization & parsing rules (articles, headers, dates, HTML)
 metrics.py # Basic text metrics
 models.py # Typed record/contexts
 utils/ # Logging, HTML cleanup helpers

data/
 patterns/patterns.txt # Regexes for extra matches
 samples/ # Sample FEK PDF (optional)

tests/ # Unit/CLI/integration tests
docs/ # MkDocs starter (optional)

Development

# clone and set up
git clone https://github.com/dmsfiris/fek-extractor.git
cd fek-extractor
python -m venv .venv && source .venv/bin/activate
pip install -U pip
pip install -e ".[dev]"
pre-commit install

# run checks
ruff check .
black --check .
mypy src
pytest -q

Contributing

Contributions are welcome! Please open an issue to discuss substantial changes first. By contributing you agree to license your work under the project’s Apache‑2.0 license.

License

This project is licensed under Apache License 2.0. See LICENSE. If you prefer a copyleft model (keeping derivatives open), consider re‑licensing as GPLv3/AGPLv3 or offering dual‑licensing (AGPL for community + commercial license via AspectSoft). See below for guidance.

Picking a license (quick guide)

Max adoption, simple → MIT or Apache‑2.0 (Apache adds a patent grant and NOTICE).
Keep derivatives open → GPLv3 (apps), AGPLv3 (network services).
File‑level copyleft with easier mixing → MPL‑2.0.
Source‑available (not OSI) → Business Source License (BUSL‑1.1), SSPL, Polyform (non‑commercial).

For a project that still offers some protection, Apache‑2.0 is a great default. If you want stronger reciprocity, choose AGPLv3 or dual‑license.

How to apply

Add a LICENSE file (done).
Add a NOTICE file (done) and keep third‑party attributions.
Optionally add license headers to source files, e.g.:

# Copyright (c) 2025 Your Name
# SPDX-License-Identifier: Apache-2.0

Contact

Author: Dimitrios S. Sfyris (AspectSoft)
Email: info@aspectsoft.gr
LinkedIn: https://www.linkedin.com/in/dimitrios-s-sfyris/
Get in touch: If you need bespoke FEK parsing or similar layout‑aware NLP pipelines, reach out.

Acknowledgements

Built on top of pdfminer.six.
Includes heuristics tuned for FEK / Εφημερίδα της Κυβερνήσεως layouts.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Sep 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fek_extractor-0.1.0.tar.gz (78.7 kB view details)

Uploaded Sep 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fek_extractor-0.1.0-py3-none-any.whl (81.1 kB view details)

Uploaded Sep 10, 2025 Python 3

File details

Details for the file fek_extractor-0.1.0.tar.gz.

File metadata

Download URL: fek_extractor-0.1.0.tar.gz
Upload date: Sep 10, 2025
Size: 78.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for fek_extractor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`567702eef880ff104a1a9f034b0af70b109ccc1148ea77a8fb9e1f4e8a0ffb80`
MD5	`50d65b3b351f809db708f6374a3fa5c2`
BLAKE2b-256	`f557ee6785c13f7e6d1ea06431dcaeba40c154400f0ce2a5f857f6dd8039ce14`

See more details on using hashes here.

File details

Details for the file fek_extractor-0.1.0-py3-none-any.whl.

File metadata

Download URL: fek_extractor-0.1.0-py3-none-any.whl
Upload date: Sep 10, 2025
Size: 81.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.4

File hashes

Hashes for fek_extractor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4e6075edd9b1a3cc24887f872ea030bed546e65ff7cf283ad618f8331142b421`
MD5	`fc023b2a43f794bff5e554d8231cd069`
BLAKE2b-256	`8bbec29f4ceeff09bdef528308e2fbf8c6b707d5c1abb73bdb8ef8e8cd02e050`

See more details on using hashes here.

fek-extractor 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

fek-extractor

About this project

Features

Table of contents

Demo & screenshots

Requirements

Install

From PyPI

From source (editable)

With pipx (isolated CLI)

Docker (no local Python needed)

Quickstart

CLI usage

Python API

Output schema

Technical deep dive

Architecture

Examples

Debug helpers

Performance tips

Project layout

Development

Contributing

License

Picking a license (quick guide)

Contact

Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes