Ingestion (web/PDF/DOCX/TXT), cleaning, paragraph-level LID (PT/EN/ES), and spaCy-based normalization; PDF export.
Project description
intelli3text
Ingestion of texts (Web/PDF/DOCX/TXT), cleaning, and multilingual normalization (PT/EN/ES) with paragraph-level language detection and PDF export.
Focus on frictionless install (pip install): on first run it auto-downloads the required models (fastText LID and spaCy) and works offline with sensible fallbacks.
Docs website: https://jeffersonspeck.github.io/intelli3text/
PyPI: https://pypi.org/project/intelli3text/
Repository: https://github.com/jeffersonspeck/intelli3text
Table of Contents
- Usage Manual
- Why this project?
- Key features
- Requirements
- Installation
- Quick start (CLI)
- CLI examples
- Python usage (API)
- Language identification (LID)
- spaCy models & normalization
- Cleaning pipeline
- PDF export
- Cache, auto-downloads & offline mode
- Architecture & Design Patterns
- Design Science Research (DSR)
- Binary compatibility (NumPy/Thinc/spaCy)
- Performance tips
- Extensibility
- Troubleshooting
- Publishing to PyPI
- Roadmap
- License
- How to cite
Why this project?
In research and production, common needs include:
- Ingest text from heterogeneous sources (web, PDFs, DOCX, TXT);
- Clean and normalize the content;
- Lemmatize and remove stopwords;
- Detect language accurately, including bilingual documents;
- Export results with traceability (PDF that shows normalized, cleaned, and raw text).
intelli3text is built to be plug-and-play: pip install and go — no native toolchains, no manual compiles, no painful environment setup.
Key features
- Ingestion: URL (HTML), PDF (
pdfminer.six), DOCX (python-docx), TXT. - Cleaning: Unicode fixes (
ftfy), noise removal (clean-text), PDF-specific line-break & hyphenation heuristics. - Paragraph-level LID: fastText LID (176 languages) with tolerant fallback.
- spaCy normalization: lemmatized tokens without stopwords/punctuation; PT/EN/ES.
- PDF export: summary, global normalized text, per-paragraph table and sections for cleaned/normalized/raw text.
- Auto-download on first run:
lid.176.bin(fastText LID);- spaCy models for PT/EN/ES (
lg→md→sm) with offline fallback.
- CLI & Python API: use from shell or embed in code.
Requirements
- Python 3.9+
- Internet only on first run (to download models). After that, it works offline.
- To avoid binary mismatches, the package pins compatible versions of
numpy,thinc, andspacy.
Installation
pip install intelli3text
# or from a local repo:
# pip install .
No extra scripts. On first execution, required models are fetched to a local cache automatically.
Quick start (CLI)
intelli3text "https://pt.wikipedia.org/wiki/Howard_Gardner" --export-pdf output.pdf
Output:
- JSON to
stdoutwithlanguage_global,cleaned,normalized, and a list ofparagraphs. - A PDF report at
output.pdf.
CLI examples
-
Local PDF:
intelli3text "./my_paper.pdf" --export-pdf report.pdf
-
Choose spaCy model size:
intelli3text "URL" --nlp-size md # options: lg (default) | md | sm
-
Select cleaners:
intelli3text "URL" --cleaners ftfy,clean_text,pdf_breaks
-
Save JSON to file:
intelli3text "URL" --json-out result.json
-
Use CLD3 as primary (if installed as extra):
pip install intelli3text[cld3] intelli3text "URL" --lid-primary cld3 --lid-fallback none
Full CLI reference: see Docs → CLI on the website: https://jeffersonspeck.github.io/intelli3text/
Python usage (API)
from intelli3text import PipelineBuilder, Intelli3Config
cfg = Intelli3Config(
cleaners=["ftfy", "clean_text", "pdf_breaks"],
lid_primary="fasttext", # or "cld3" if you installed the extra
lid_fallback=None, # or "cld3"
nlp_model_pref="lg", # "lg" | "md" | "sm"
export={"pdf": {"path": "output.pdf", "include_global_normalized": True}},
)
pipeline = PipelineBuilder(cfg).build()
res = pipeline.process("https://pt.wikipedia.org/wiki/Howard_Gardner")
print(res["language_global"], len(res["paragraphs"]))
print(res["paragraphs"][0]["language"], res["paragraphs"][0]["normalized"][:200])
More samples (including safe-to-import examples): Docs → Examples.
Language identification (LID)
-
Primary: fastText LID (
lid.176.bin) auto-downloaded on first use. -
Tolerant: if
fasttextis unavailable, the pipeline won’t crash — it returns"pt"with confidence0.0as a safe fallback. -
Accuracy: detection is per paragraph;
language_globalis the most frequent. -
Optional:
pycld3via extra:pip install intelli3text[cld3] # CLI: --lid-primary cld3 --lid-fallback none
spaCy models & normalization
-
Size preference:
lg→md→sm. -
If the model is missing, the library tries to download it.
-
Offline: falls back to
spacy.blank(<lang>)with asentencizer(no crash). -
Normalization includes:
- tokenization;
- dropping stopwords/punctuation/whitespace;
- lemmatization (when the model has a lexicon);
- joining lemmas.
Cleaning pipeline
Default order (--cleaners ftfy,clean_text,pdf_breaks):
- FTFY: fixes Unicode glitches.
- clean-text: removes URLs/emails/phones; keeps numbers/punctuation by default.
- pdf_breaks: PDF heuristics (de-hyphenation; merge artificial breaks; collapse multiple newlines).
You can customize the list/order via CLI or API.
PDF export
The report includes:
-
Summary (global language, total paragraphs),
-
Global Normalized Text (optional),
-
Per-paragraph table (language, confidence, normalized preview),
-
Per-paragraph sections showing:
- normalized,
- cleaned,
- raw.
Library: ReportLab.
Cache, auto-downloads & offline mode
-
Default cache directory:
~/.cache/intelli3text/Override via env var:INTELLI3TEXT_CACHE_DIR=/your/custom/path -
Auto-download on first use:
lid.176.bin(fastText LID),- spaCy models PT/EN/ES in order
lg→md→sm.
-
Offline behavior:
- LID returns fallback
"pt", 0.0if fastText is unavailable; - spaCy uses
blank()(functional, but without full lexical features).
- LID returns fallback
Architecture & Design Patterns
Applied patterns:
-
Builder:
PipelineBuildercomposes extractors, cleaners, LID, normalizer, and exporters from declarative config. -
Strategy:
- Extractors (Web/PDF/DOCX/TXT) implement
IExtractor. - Cleaners implement
ICleaner, chained viaCleanerChain. - Language Detectors implement a simple interface (
FastTextLID,CLD3LID). - Normalizer implements
INormalizer(SpacyNormalizerhere). - Exporters implement
IExporter(PDFExporterhere).
- Extractors (Web/PDF/DOCX/TXT) implement
-
Factory/Registry: lazy loading of spaCy models by lang/size with fallbacks.
-
Facade: CLI and
Pipeline.process()offer a simple entry point.
Package layout (summary)
src/intelli3text/
__init__.py
__main__.py # CLI
config.py # Intelli3Config (parameters)
utils.py # cache/download helpers
builder.py # PipelineBuilder (Builder)
pipeline.py # Pipeline (Facade)
extractors/ # Strategy
base.py
web_trafilatura.py
file_pdfminer.py
file_docx.py
file_text.py
cleaners/ # Strategy + Chain of Responsibility
base.py
chain.py
unicode_ftfy.py
clean_text.py
pdf_linebreaks.py
lid/ # Strategy
base.py
fasttext_lid.py
# (optional) cld3_lid.py
nlp/
base.py
registry.py # Factory/Registry (spaCy models + fallback)
spacy_normalizer.py # Strategy
export/
base.py
pdf_reportlab.py # Strategy
Design Science Research (DSR)
- Artifact: robust ingestion/cleaning/LID/normalization/export pipeline prioritizing reproducibility and trivial install.
- Problem: heterogeneous sources, bilingual content, and environment friction (native deps, binary mismatches).
- Design: auto-downloads, fallbacks, and stable binary pins; per-paragraph LID; auditable PDF report.
- Demonstration: clean CLI & Python API; Web/PDF/DOCX/TXT; PT/EN/ES.
- Evaluation: empirical stability across environments (user site, WSL, Windows), LID quality (fastText), normalization quality (spaCy).
- Contributions: engineering best practices (Builder/Strategy/Factory) to minimize friction and maximize reuse in research/production.
Binary compatibility (NumPy/Thinc/spaCy)
To avoid the classic numpy.dtype size changed error:
-
We pin compatible versions in
pyproject.toml. -
If you already had other global packages and hit this error:
pip uninstall -y spacy thinc numpypip cache purgepip install --user --no-cache-dir "numpy==1.26.4" "thinc==8.2.4" "spacy==3.7.4"pip install --user --no-cache-dir intelli3text(or-e .from the local repo)
Tip: always use the same Python that runs
intelli3text(checkhead -1 ~/.local/bin/intelli3text).
Performance tips
- Paragraph length: controlled by
paragraph_min_chars(default 30) andlid_min_chars(default 60). - LID sample cap: very long texts are truncated (~2k chars) to speed up without hurting accuracy much.
- spaCy model size:
smis lighter;lggives better quality (default).
Extensibility
- New sources: implement
IExtractorand register inPipelineBuilder. - New cleaners: implement
ICleanerand map it inNAME2CLEANER. - New LIDs: implement the interface under
lid/base.py. - Exporters: implement
IExporter(e.g., JSONL/CSV/HTML), expose option in CLI/Builder.
Troubleshooting
-
Trafilatura ‘unidecode’ warning: already handled — we depend on
Unidecode. -
No Internet on first run:
- LID: fallback
"pt", 0.0. - spaCy:
spacy.blank(<lang>). - Later, with Internet, run again to fetch full models.
- LID: fallback
-
ModuleNotFoundError: fasttext:- We depend on
fasttext-wheel(prebuilt wheels). - Reinstall:
pip install fasttext-wheel.
- We depend on
More tips and parameter-by-parameter guidance: https://jeffersonspeck.github.io/intelli3text/
Roadmap
- Exporters: HTML/Markdown with paragraph navigation.
- Quality metrics (lexical density, diversity, etc.).
- More languages via custom spaCy models.
- Optional normalization using Stanza.
License
MIT — you’re free to use, modify and distribute.
Note: the original upstream licenses of third-party models and libraries still apply.
How to cite
Speck, J. (2025). intelli3text: ingestion, cleaning, paragraph-level LID and spaCy normalization with PDF export. GitHub: https://github.com/jeffersonspeck/intelli3text
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file intelli3text-0.2.4.tar.gz.
File metadata
- Download URL: intelli3text-0.2.4.tar.gz
- Upload date:
- Size: 30.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e2619f9920c3cfdd8d4dee1df0b27007e0d3e70505b14e01d2852a6eb9b25cbd
|
|
| MD5 |
cc1cc84937cfd1c4c65d6bdf0385f071
|
|
| BLAKE2b-256 |
9b51083d00ab894b3893d777262868980054374da8ad39d61199246928872235
|