Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaCy-based normalization; PDF export.

These details have not been verified by PyPI

Project links

Project description

intelli3text

Ingestion of texts (Web/PDF/DOCX/TXT), cleaning, and multilingual normalization (PT/EN/ES) with paragraph-level language detection and PDF export.
Focus on frictionless install (pip install): on first run it auto-downloads the required models (fastText LID and spaCy) and works offline with sensible fallbacks.

Docs website: https://jeffersonspeck.github.io/intelli3text/
PyPI: https://pypi.org/project/intelli3text/
Repository: https://github.com/jeffersonspeck/intelli3text

Usage Manual
Why this project?
Key features
Requirements
Installation
Quick start (CLI)
CLI examples
Python usage (API)
Language identification (LID)
spaCy models & normalization
Cleaning pipeline
PDF export
Cache, auto-downloads & offline mode
Architecture & Design Patterns
Design Science Research (DSR)
Binary compatibility (NumPy/Thinc/spaCy)
Performance tips
Extensibility
Troubleshooting
Publishing to PyPI
Roadmap
License
How to cite

Why this project?

In research and production, common needs include:

Ingest text from heterogeneous sources (web, PDFs, DOCX, TXT);
Clean and normalize the content;
Lemmatize and remove stopwords;
Detect language accurately, including bilingual documents;
Export results with traceability (PDF that shows normalized, cleaned, and raw text).

intelli3text is built to be plug-and-play: pip install and go — no native toolchains, no manual compiles, no painful environment setup.

Key features

Ingestion: URL (HTML), PDF (pdfminer.six), DOCX (python-docx), TXT.
Cleaning: Unicode fixes (ftfy), noise removal (clean-text), PDF-specific line-break & hyphenation heuristics.
Paragraph-level LID: fastText LID (176 languages) with tolerant fallback.
spaCy normalization: lemmatized tokens without stopwords/punctuation; PT/EN/ES.
PDF export: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.
Auto-download on first run:
- lid.176.bin (fastText LID);
- spaCy models for PT/EN/ES (lg→md→sm) with offline fallback.
CLI & Python API: use from shell or embed in code.

Requirements

Python 3.9+
Internet only on first run (to download models). After that, it works offline.
To avoid binary mismatches, the package pins compatible versions of numpy, thinc, and spacy.

Installation

pip install intelli3text
# or from a local repo:
# pip install .

No extra scripts. On first execution, required models are fetched to a local cache automatically.

Quick start (CLI)

intelli3text "https://pt.wikipedia.org/wiki/Howard_Gardner" --export-pdf output.pdf

Output:

JSON to stdout with language_global, cleaned, normalized, and a list of paragraphs.
A PDF report at output.pdf.

CLI examples

Local PDF:

intelli3text "./my_paper.pdf" --export-pdf report.pdf

Choose spaCy model size:

intelli3text "URL" --nlp-size md
# options: lg (default) | md | sm

Select cleaners:

intelli3text "URL" --cleaners ftfy,clean_text,pdf_breaks

Save JSON to file:

intelli3text "URL" --json-out result.json

Use CLD3 as primary (if installed as extra):

pip install intelli3text[cld3]
intelli3text "URL" --lid-primary cld3 --lid-fallback none

Full CLI reference: see Docs → CLI on the website: https://jeffersonspeck.github.io/intelli3text/

Python usage (API)

from intelli3text import PipelineBuilder, Intelli3Config

cfg = Intelli3Config(
    cleaners=["ftfy", "clean_text", "pdf_breaks"],
    lid_primary="fasttext",         # or "cld3" if you installed the extra
    lid_fallback=None,              # or "cld3"
    nlp_model_pref="lg",            # "lg" | "md" | "sm"
    export={"pdf": {"path": "output.pdf", "include_global_normalized": True}},
)

pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://pt.wikipedia.org/wiki/Howard_Gardner")

print(res["language_global"], len(res["paragraphs"]))
print(res["paragraphs"][0]["language"], res["paragraphs"][0]["normalized"][:200])

More samples (including safe-to-import examples): Docs → Examples.

Language identification (LID)

Primary: fastText LID (lid.176.bin) auto-downloaded on first use.
Tolerant: if fasttext is unavailable, the pipeline won’t crash — it returns "pt" with confidence 0.0 as a safe fallback.
Accuracy: detection is per paragraph; language_global is the most frequent.

Optional: pycld3 via extra:

pip install intelli3text[cld3]
# CLI: --lid-primary cld3 --lid-fallback none

spaCy models & normalization

Size preference: lg → md → sm.
If the model is missing, the library tries to download it.
Offline: falls back to spacy.blank(<lang>) with a sentencizer (no crash).
Normalization includes:
- tokenization;
- dropping stopwords/punctuation/whitespace;
- lemmatization (when the model has a lexicon);
- joining lemmas.

Cleaning pipeline

Default order (--cleaners ftfy,clean_text,pdf_breaks):

FTFY: fixes Unicode glitches.
clean-text: removes URLs/emails/phones; keeps numbers/punctuation by default.
pdf_breaks: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).

You can customize the list/order via CLI or API.

PDF export

The report includes:

Summary (global language, total paragraphs),
Global Normalized Text (optional),
Per-paragraph table (language, confidence, normalized preview),
Per-paragraph sections showing:
- normalized,
- cleaned,
- raw.

Library: ReportLab.

Cache, auto-downloads & offline mode

Default cache directory: ~/.cache/intelli3text/ Override via env var: INTELLI3TEXT_CACHE_DIR=/your/custom/path
Auto-download on first use:
- lid.176.bin (fastText LID),
- spaCy models PT/EN/ES in order lg→md→sm.
Offline behavior:
- LID returns fallback "pt", 0.0 if fastText is unavailable;
- spaCy uses blank() (functional, but without full lexical features).

Architecture & Design Patterns

Applied patterns:

Builder: PipelineBuilder composes extractors, cleaners, LID, normalizer, and exporters from declarative config.
Strategy:
- Extractors (Web/PDF/DOCX/TXT) implement IExtractor.
- Cleaners implement ICleaner, chained via CleanerChain.
- Language Detectors implement a simple interface (FastTextLID, CLD3LID).
- Normalizer implements INormalizer (SpacyNormalizer here).
- Exporters implement IExporter (PDFExporter here).
Factory/Registry: lazy loading of spaCy models by lang/size with fallbacks.
Facade: CLI and Pipeline.process() offer a simple entry point.

Package layout (summary)

src/intelli3text/
  __init__.py
  __main__.py            # CLI
  config.py              # Intelli3Config (parameters)
  utils.py               # cache/download helpers
  builder.py             # PipelineBuilder (Builder)
  pipeline.py            # Pipeline (Facade)

  extractors/            # Strategy
    base.py
    web_trafilatura.py
    file_pdfminer.py
    file_docx.py
    file_text.py

  cleaners/              # Strategy + Chain of Responsibility
    base.py
    chain.py
    unicode_ftfy.py
    clean_text.py
    pdf_linebreaks.py

  lid/                   # Strategy
    base.py
    fasttext_lid.py
    # (optional) cld3_lid.py

  nlp/
    base.py
    registry.py          # Factory/Registry (spaCy models + fallback)
    spacy_normalizer.py  # Strategy

  export/
    base.py
    pdf_reportlab.py     # Strategy

Design Science Research (DSR)

Artifact: robust ingestion/cleaning/LID/normalization/export pipeline prioritizing reproducibility and trivial install.
Problem: heterogeneous sources, bilingual content, and environment friction (native deps, binary mismatches).
Design: auto-downloads, fallbacks, and stable binary pins; per-paragraph LID; auditable PDF report.
Demonstration: clean CLI & Python API; Web/PDF/DOCX/TXT; PT/EN/ES.
Evaluation: empirical stability across environments (user site, WSL, Windows), LID quality (fastText), normalization quality (spaCy).
Contributions: engineering best practices (Builder/Strategy/Factory) to minimize friction and maximize reuse in research/production.

Binary compatibility (NumPy/Thinc/spaCy)

To avoid the classic numpy.dtype size changed error:

We pin compatible versions in pyproject.toml.
If you already had other global packages and hit this error:
1. pip uninstall -y spacy thinc numpy
2. pip cache purge
3. pip install --user --no-cache-dir "numpy==1.26.4" "thinc==8.2.4" "spacy==3.7.4"
4. pip install --user --no-cache-dir intelli3text (or -e . from the local repo)

Tip: always use the same Python that runs intelli3text (check head -1 ~/.local/bin/intelli3text).

Performance tips

Paragraph length: controlled by paragraph_min_chars (default 30) and lid_min_chars (default 60).
LID sample cap: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.
spaCy model size: sm is lighter; lg gives better quality (default).

Extensibility

New sources: implement IExtractor and register in PipelineBuilder.
New cleaners: implement ICleaner and map it in NAME2CLEANER.
New LIDs: implement the interface under lid/base.py.
Exporters: implement IExporter (e.g., JSONL/CSV/HTML), expose option in CLI/Builder.

Troubleshooting

Trafilatura ‘unidecode’ warning: already handled — we depend on Unidecode.
No Internet on first run:
- LID: fallback "pt", 0.0.
- spaCy: spacy.blank(<lang>).
- Later, with Internet, run again to fetch full models.
ModuleNotFoundError: fasttext:
- We depend on fasttext-wheel (prebuilt wheels).
- Reinstall: pip install fasttext-wheel.

More tips and parameter-by-parameter guidance: https://jeffersonspeck.github.io/intelli3text/

Roadmap

Exporters: HTML/Markdown with paragraph navigation.
Quality metrics (lexical density, diversity, etc.).
More languages via custom spaCy models.
Optional normalization using Stanza.

License

MIT — you’re free to use, modify and distribute.

Note: the original upstream licenses of third-party models and libraries still apply.

How to cite

Speck, J. (2025). intelli3text: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: https://github.com/jeffersonspeck/intelli3text

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.7

Oct 13, 2025

0.2.6

Oct 13, 2025

0.2.5

Oct 13, 2025

This version

0.2.4

Oct 13, 2025

0.2.3

Oct 12, 2025

0.2.2

Oct 12, 2025

0.2.1

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

intelli3text-0.2.4.tar.gz (30.2 kB view details)

Uploaded Oct 13, 2025 Source

File details

Details for the file intelli3text-0.2.4.tar.gz.

File metadata

Download URL: intelli3text-0.2.4.tar.gz
Upload date: Oct 13, 2025
Size: 30.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for intelli3text-0.2.4.tar.gz
Algorithm	Hash digest
SHA256	`e2619f9920c3cfdd8d4dee1df0b27007e0d3e70505b14e01d2852a6eb9b25cbd`
MD5	`cc1cc84937cfd1c4c65d6bdf0385f071`
BLAKE2b-256	`9b51083d00ab894b3893d777262868980054374da8ad39d61199246928872235`

See more details on using hashes here.

intelli3text 0.2.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

intelli3text

Table of Contents

Why this project?

Key features

Requirements

Installation

Quick start (CLI)

CLI examples

Python usage (API)

Language identification (LID)

spaCy models & normalization

Cleaning pipeline

PDF export

Cache, auto-downloads & offline mode

Architecture & Design Patterns

Design Science Research (DSR)

Binary compatibility (NumPy/Thinc/spaCy)

Performance tips

Extensibility

Troubleshooting

Roadmap

License

How to cite

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes