Skip to main content

Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaCy-based normalization; PDF export.

Project description

intelli3text

intelli3text is the text-processing backbone of the broader intelli3 project (a classification-oriented research/engineering effort).
It ingests texts from Web/PDF/DOCX/TXT, performs cleaning and multilingual normalization (PT/EN/ES), applies paragraph-level language identification (LID), and produces an auditable PDF report (raw → cleaned → normalized), ready for downstream classification tasks.

This work is part of my Master’s research, advised by Sidgley Camargo de Andrade (advisor) and Clodis Boscarioli (co-advisor).

What this module does (in the intelli3 ecosystem)

  • Acquire: extract main content from the web (and read local PDF/DOCX/TXT).
  • Clean: remove boilerplate, linebreak artifacts, and markup noise.
  • Detect language (per paragraph): fastText LID (lid.176.ftz) for robust PT/EN/ES routing.
  • Normalize: spaCy-based normalization pipeline for stable, comparable text.
  • Export: generate an auditable PDF and structured outputs for classification pipelines in intelli3.

How it works (design choices)

  • Frictionless install. pip install intelli3text declares and enforces fasttext>=0.9.2.
    On first run, models are auto-downloaded (fastText LID and spaCy) and then cached/embedded for offline operation.
  • Reproducible by default. Pinned binaries and install-time model bootstrap minimize OS/WSL/environment drift.
  • Paragraph granularity. LID and normalization operate per-paragraph, improving quality on mixed-language sources.
  • Auditable outputs. PDF report includes raw → cleaned → normalized views to support inspection and research traceability.

Table of Contents


Why this project?

In research and production, common needs include:

  1. Ingest text from heterogeneous sources (web, PDFs, DOCX, TXT);
  2. Clean and normalize the content;
  3. Lemmatize and remove stopwords;
  4. Detect language accurately, including bilingual documents;
  5. Export results with traceability (PDF that shows normalized, cleaned, and raw text).

intelli3text is built to be plug-and-play: pip install and go — no native toolchains, no manual compiles, no painful environment setup.


Key features

  • Ingestion: URL (HTML), PDF (pdfminer.six), DOCX (python-docx), TXT.
  • Cleaning: Unicode fixes (ftfy), noise removal (clean-text), PDF-specific line-break & hyphenation heuristics.
  • Paragraph-level LID: fastText LID (176 languages) with tolerant fallback.
  • spaCy normalization: lemmatized tokens without stopwords/punctuation; PT/EN/ES.
  • PDF export: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.
  • Auto-download on first run:
    • lid.176.bin (fastText LID);
    • spaCy models for PT/EN/ES (lg→md→sm) with offline fallback.
  • CLI & Python API: use from shell or embed in code.

Requirements

  • Python 3.9+
  • Internet only on first run (to download models). After that, it works offline.
  • To avoid binary mismatches, the package pins compatible versions of numpy, thinc, and spacy.

Installation

pip install intelli3text
# or from a local repo:
# pip install .

No extra scripts. On first execution, required models are fetched to a local cache automatically.


Quick start (CLI)

intelli3text "https://pt.wikipedia.org/wiki/Howard_Gardner" --export-pdf output.pdf

Output:

  • JSON to stdout with language_global, cleaned, normalized, and a list of paragraphs.
  • A PDF report at output.pdf.

CLI examples

  • Local PDF:

    intelli3text "./my_paper.pdf" --export-pdf report.pdf
    
  • Choose spaCy model size:

    intelli3text "URL" --nlp-size md
    # options: lg (default) | md | sm
    
  • Select cleaners:

    intelli3text "URL" --cleaners ftfy,clean_text,pdf_breaks
    
  • Save JSON to file:

    intelli3text "URL" --json-out result.json
    
  • Use CLD3 as primary (if installed as extra):

    pip install intelli3text[cld3]
    intelli3text "URL" --lid-primary cld3 --lid-fallback none
    

Full CLI reference: see Docs → CLI on the website: https://jeffersonspeck.github.io/intelli3text/


Python usage (API)

from intelli3text import PipelineBuilder, Intelli3Config

cfg = Intelli3Config(
    cleaners=["ftfy", "clean_text", "pdf_breaks"],
    lid_primary="fasttext",         # or "cld3" if you installed the extra
    lid_fallback=None,              # or "cld3"
    nlp_model_pref="lg",            # "lg" | "md" | "sm"
    export={"pdf": {"path": "output.pdf", "include_global_normalized": True}},
)

pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://pt.wikipedia.org/wiki/Howard_Gardner")

print(res["language_global"], len(res["paragraphs"]))
print(res["paragraphs"][0]["language"], res["paragraphs"][0]["normalized"][:200])

More samples (including safe-to-import examples): Docs → Examples.


Language identification (LID)

  • Primary: fastText LID (lid.176.bin) auto-downloaded on first use.

  • Tolerant: if fasttext is unavailable, the pipeline won’t crash — it returns "pt" with confidence 0.0 as a safe fallback.

  • Accuracy: detection is per paragraph; language_global is the most frequent.

  • Optional: pycld3 via extra:

    pip install intelli3text[cld3]
    # CLI: --lid-primary cld3 --lid-fallback none
    

spaCy models & normalization

  • Size preference: lgmdsm.

  • If the model is missing, the library tries to download it.

  • Offline: falls back to spacy.blank(<lang>) with a sentencizer (no crash).

  • Normalization includes:

    • tokenization;
    • dropping stopwords/punctuation/whitespace;
    • lemmatization (when the model has a lexicon);
    • joining lemmas.

Cleaning pipeline

Default order (--cleaners ftfy,clean_text,pdf_breaks):

  1. FTFY: fixes Unicode glitches.
  2. clean-text: removes URLs/emails/phones; keeps numbers/punctuation by default.
  3. pdf_breaks: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).

You can customize the list/order via CLI or API.


PDF export

The report includes:

  • Summary (global language, total paragraphs),

  • Global Normalized Text (optional),

  • Per-paragraph table (language, confidence, normalized preview),

  • Per-paragraph sections showing:

    • normalized,
    • cleaned,
    • raw.

Library: ReportLab.


Cache, auto-downloads & offline mode

  • Default cache directory: ~/.cache/intelli3text/ Override via env var: INTELLI3TEXT_CACHE_DIR=/your/custom/path

  • Auto-download on first use:

    • lid.176.bin (fastText LID),
    • spaCy models PT/EN/ES in order lg→md→sm.
  • Offline behavior:

    • LID returns fallback "pt", 0.0 if fastText is unavailable;
    • spaCy uses blank() (functional, but without full lexical features).

Architecture & Design Patterns

Applied patterns:

  • Builder: PipelineBuilder composes extractors, cleaners, LID, normalizer, and exporters from declarative config.

  • Strategy:

    • Extractors (Web/PDF/DOCX/TXT) implement IExtractor.
    • Cleaners implement ICleaner, chained via CleanerChain.
    • Language Detectors implement a simple interface (FastTextLID, CLD3LID).
    • Normalizer implements INormalizer (SpacyNormalizer here).
    • Exporters implement IExporter (PDFExporter here).
  • Factory/Registry: lazy loading of spaCy models by lang/size with fallbacks.

  • Facade: CLI and Pipeline.process() offer a simple entry point.

Package layout (summary)

src/intelli3text/
  __init__.py
  __main__.py            # CLI
  config.py              # Intelli3Config (parameters)
  utils.py               # cache/download helpers
  builder.py             # PipelineBuilder (Builder)
  pipeline.py            # Pipeline (Facade)

  extractors/            # Strategy
    base.py
    web_trafilatura.py
    file_pdfminer.py
    file_docx.py
    file_text.py

  cleaners/              # Strategy + Chain of Responsibility
    base.py
    chain.py
    unicode_ftfy.py
    clean_text.py
    pdf_linebreaks.py

  lid/                   # Strategy
    base.py
    fasttext_lid.py
    # (optional) cld3_lid.py

  nlp/
    base.py
    registry.py          # Factory/Registry (spaCy models + fallback)
    spacy_normalizer.py  # Strategy

  export/
    base.py
    pdf_reportlab.py     # Strategy

Design Science Research (DSR)

Artifact. A production-oriented NLP pipeline for ingestion, cleaning, paragraph-level language identification (LID), normalization, and PDF export, designed for reproducibility (binary pins, install-time model bootstrap) and trivial installation. This aligns with DSR’s emphasis on building useful artifacts that extend human and organizational capabilities. :contentReference[oaicite:0]{index=0}

Problem. Heterogeneous sources (Web/PDF/DOCX/TXT), bilingual/multilingual content, and environment friction (native deps, wheels, OS/WSL divergences) often break reproducibility and degrade text quality via boilerplate/noise. Prior work highlights the importance of robust boilerplate removal and main-content extraction for downstream NLP quality. :contentReference[oaicite:1]{index=1}

Design.

  • Acquisition & cleaning: Web extraction via Trafilatura (main text, comments, metadata) plus jusText-style boilerplate filtering; both are well-studied choices for reliable textual corpora. :contentReference[oaicite:2]{index=2}
  • Language ID: fastText LID model (recognizes 176 languages) with install-time download/embedding to remove runtime network dependency. :contentReference[oaicite:3]{index=3}
  • Normalization: spaCy pipeline (industrial-strength NLP; v2+ with Bloom embeddings/CNNs) with pinned versions for deterministic behavior across environments. :contentReference[oaicite:4]{index=4}
  • Reproducibility: strict dependency pinning and build hooks; artifact packaged with the LID model to guarantee availability at install time, consistent with DSR guidance on rigor and verifiability. :contentReference[oaicite:5]{index=5}

Demonstration. Command-line interface and Python API across Web/PDF/DOCX/TXT; LID for PT/EN/ES using fastText; auditable PDF report that shows raw, cleaned, and normalized views. :contentReference[oaicite:6]{index=6}

Evaluation.

  • Technical robustness: empirical tests across user-site installs, WSL, and Windows; deterministic packaging validated by install-time model embedding. (Engineering claim; methodology aligned with DSR evaluation guidance.) :contentReference[oaicite:7]{index=7}
  • Quality: LID confidence/coverage supported by the fastText 176-language models; cleaning quality supported by established extractors (Trafilatura/jusText). :contentReference[oaicite:8]{index=8}

Contributions.

  • Engineering: Builder/Strategy/Factory patterns to decouple extractors, cleaners, LID, and normalizers for reuse. (Standard software-engineering patterns applied to the artifact.)
  • DSR grounding: Follows Hevner et al.’s design-science guidelines (relevance, rigor, design evaluation) and Peffers et al.’s DSRM (problem identification → artifact design → evaluation → communication). :contentReference[oaicite:9]{index=9}

Notes on verification:

  • DSR foundations are confirmed via MISQ (Hevner et al., 2004) and the DSRM (Peffers et al., 2007). (MISQ)
  • Trafilatura demo paper (ACL 2021) and docs confirm main-content extraction with comments/metadata. (ACL Anthology)
  • jusText origins and efficacy for boilerplate removal are documented in Pomikálek’s thesis. (Informações da Universidade)
  • fastText LID page confirms 176-language models (lid.176.*). (fastText)
  • spaCy v2 architecture (Bloom embeddings/CNNs) is documented in Honnibal & Montani. (Sentometrics Research)

Binary compatibility (NumPy/Thinc/spaCy)

To avoid the classic numpy.dtype size changed error:

  • We pin compatible versions in pyproject.toml.

  • If you already had other global packages and hit this error:

    1. pip uninstall -y spacy thinc numpy
    2. pip cache purge
    3. pip install --user --no-cache-dir "numpy==1.26.4" "thinc==8.2.4" "spacy==3.7.4"
    4. pip install --user --no-cache-dir intelli3text (or -e . from the local repo)

Tip: always use the same Python that runs intelli3text (check head -1 ~/.local/bin/intelli3text).


Performance tips

  • Paragraph length: controlled by paragraph_min_chars (default 30) and lid_min_chars (default 60).
  • LID sample cap: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.
  • spaCy model size: sm is lighter; lg gives better quality (default).

Extensibility

  • New sources: implement IExtractor and register in PipelineBuilder.
  • New cleaners: implement ICleaner and map it in NAME2CLEANER.
  • New LIDs: implement the interface under lid/base.py.
  • Exporters: implement IExporter (e.g., JSONL/CSV/HTML), expose option in CLI/Builder.

Troubleshooting

  • Trafilatura ‘unidecode’ warning: already handled — we depend on Unidecode.

  • No Internet on first run:

    • LID: fallback "pt", 0.0.
    • spaCy: spacy.blank(<lang>).
    • Later, with Internet, run again to fetch full models.
  • ModuleNotFoundError: fasttext:

    • We depend on fasttext-wheel (prebuilt wheels).
    • Reinstall: pip install fasttext-wheel.

More tips and parameter-by-parameter guidance: https://jeffersonspeck.github.io/intelli3text/


Roadmap

  • Exporters: HTML/Markdown with paragraph navigation.
  • Quality metrics (lexical density, diversity, etc.).
  • More languages via custom spaCy models.
  • Optional normalization using Stanza.

License

MIT — you’re free to use, modify and distribute.

Note: the original upstream licenses of third-party models and libraries still apply.


How to cite

Speck, J. (2025). intelli3text: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: https://github.com/jeffersonspeck/intelli3text

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelli3text-0.2.7.tar.gz (44.6 kB view details)

Uploaded Source

File details

Details for the file intelli3text-0.2.7.tar.gz.

File metadata

  • Download URL: intelli3text-0.2.7.tar.gz
  • Upload date:
  • Size: 44.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for intelli3text-0.2.7.tar.gz
Algorithm Hash digest
SHA256 96bc2b4bcd5c7980d105a566c166fe13038d67713c84863b1f32d27638f8f402
MD5 ed0191ae5390edf72d659a3aea69403b
BLAKE2b-256 a3d418b393459ef745dc58972dc251e19fc19e56299e1f35977570717db28776

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page