No project description provided
Project description
Extractor Wrapper
English
extractor_wrapper is a lightweight Python package that provides a unified way to extract text (and basic metadata) from common document formats. Instead of learning multiple libraries, you can use one interface to work with:
- PDF files (
.pdf) - Word documents (
.docand.docx) - Excel spreadsheets (
.xlsx) - PowerPoint presentations (
.pptx) - Plain text files (
.txt) - Outlook message files (
.msg)
Principle
-
Each format has its own extractor class under
extractor_wrapper.ext, for example:PDFExtractorDOCExtractorDOCXExtractorXLSXExtractorPPTXExtractorTXTExtractorMSGExtractor
-
If a required third-party library for an extractor is missing at runtime, that extractor is marked as unavailable rather than causing an import error.
-
You can either:
- Use an extractor class directly, e.g.
PDFExtractor("file.pdf"). - Use the factory method
ExtractorFactory.auto_extract(path)to automatically pick the right extractor based on file extension.
- Use an extractor class directly, e.g.
Simple Usage Examples
1. Using the factory (ExtractorFactory.auto_extract)
import os
from extractor_wrapper import ExtractorFactory
path = "documents/report.pdf"
try:
content = ExtractorFactory.auto_extract(path)
print("Extracted content (first 500 chars):")
print(content[:500])
except Exception as e:
print(f"Extraction failed: {e}")
-
What happens here:
auto_extract(path)inspects the file extension (.pdf) and instantiatesPDFExtractorbehind the scenes.- It returns the full text content (as a single string).
- If the required library (
pdfminer.sixorPyPDF2) is not installed, it raises an error.
2. Instantiating an extractor class directly
from extractor_wrapper.ext.docx import DOCXExtractor
path = "notes/project.docx"
docx_ext = DOCXExtractor()
# No need to check availability explicitly; if a required library is missing, initialization will raise.
try:
content = docx_ext.extract(path)
print("DOCX content (first 300 chars):")
print(content[:300])
except Exception as e:
print(f"Cannot extract DOCX: {e}")
-
What happens here:
- You import
DOCXExtractorand instantiate it with a.docxpath. - The
extract()method returns a single string containing all paragraphs from the document. - If
python-docxis missing, instantiation orextract()will fail with a clear message.
- You import
3. Looping over a folder of mixed files
import os
from extractor_wrapper import ExtractorFactory
from loggerplusplus import Logger
logger = Logger(identifier="BatchTest")
INPUT_DIR = "test_files"
OUTPUT_DIR = "extracted_texts"
os.makedirs(OUTPUT_DIR, exist_ok=True)
for filename in os.listdir(INPUT_DIR):
full_path = os.path.join(INPUT_DIR, filename)
if not os.path.isfile(full_path):
continue
logger.info(f"Testing {filename}")
try:
content = ExtractorFactory.auto_extract(full_path)
out_file = os.path.join(OUTPUT_DIR, filename + ".txt")
with open(out_file, "w", encoding="utf-8") as f:
f.write(content)
logger.info(f"Saved extracted text to {out_file}")
except Exception as e:
logger.error(f"Failed to extract {filename}: {e}")
-
What happens here:
- We walk through each file in
test_files/. - For each file, call
ExtractorFactory.auto_extract, then write the returned string to a.txtfile underextracted_texts/. - If an extractor is unavailable (missing dependency) or any error occurs, it’s caught and logged.
- We walk through each file in
Project Structure (simplified)
extractor_wrapper/ ← repository root
├── extractor_wrapper/ ← Python package
│ ├── __init__.py ← may define `ExtractorFactory.auto_extract`
│ └── ext/
│ ├── base.py ← BaseExtractor (shared interface)
│ ├── pdf.py ← PDFExtractor implementation
│ ├── doc.py ← DOCExtractor implementation
│ ├── docx.py ← DOCXExtractor implementation
│ ├── xlsx.py ← XLSXExtractor implementation
│ ├── pptx.py ← PPTXExtractor implementation
│ ├── txt.py ← TXTExtractor implementation
│ └── msg.py ← MSGExtractor implementation
├── README.md ← this file
├── LICENSE ← GNU GPL v3 license text
└── setup.py ← package installation script
How It Works
-
Extractor Classes (
extractor_wrapper/ext/*.py) each implement anextract()method:- For text-based formats (PDF, DOC, DOCX, TXT, MSG),
extract()returns a single string of the full text. - For spreadsheets (
.xlsx), you might choose to return a dictionary of sheet names → rows (but the factory returns just the raw content as a string or structured data). - For presentations (
.pptx),extract()returns a list of slide strings (the factory might join them internally before returning).
- For text-based formats (PDF, DOC, DOCX, TXT, MSG),
-
Availability:
- If the underlying required library is missing, instantiating the extractor or calling
extract()raises an informative exception instead of silently failing.
- If the underlying required library is missing, instantiating the extractor or calling
-
Factory (
ExtractorFactory.auto_extract):- Checks the file extension, picks the right extractor class, calls
extract(), and returns its result. - Simplifies your code so you don’t need to write
if-elifblocks on extensions.
- Checks the file extension, picks the right extractor class, calls
Author
Project created and maintained by Florian BARRE. For questions or contributions, feel free to contact me. My Website | LinkedIn | GitHub
Français
extractor_wrapper est un package Python léger qui fournit une interface unique pour extraire du texte (et des métadonnées basiques) depuis des formats de documents courants. Au lieu de gérer plusieurs bibliothèques, vous pouvez utiliser une seule interface pour :
- PDF (
.pdf) - Word (
.docet.docx) - Excel (
.xlsx) - PowerPoint (
.pptx) - Texte brut (
.txt) - Message Outlook (
.msg)
Principe
-
Chaque format possède sa propre classe d’extracteur dans
extractor_wrapper.ext, par exemple :PDFExtractorDOCExtractorDOCXExtractorXLSXExtractorPPTXExtractorTXTExtractorMSGExtractor
-
Si la bibliothèque tierce nécessaire n’est pas installée, l’extracteur est marqué comme indisponible à l’exécution (aucune erreur d’importation).
-
Vous pouvez :
- Utiliser directement une classe d’extracteur, par exemple
PDFExtractor("fichier.pdf"). - Utiliser la méthode factory
ExtractorFactory.auto_extract(path)pour choisir automatiquement l’extracteur en fonction de l’extension du fichier.
- Utiliser directement une classe d’extracteur, par exemple
Exemples Simples
1. Avec la factory (ExtractorFactory.auto_extract)
import os
from extractor_wrapper import ExtractorFactory
chemin = "documents/rapport.pdf"
try:
contenu = ExtractorFactory.auto_extract(chemin)
print("Contenu extrait (500 premiers caractères) :")
print(contenu[:500])
except Exception as e:
print(f"Échec de l’extraction : {e}")
-
Explication :
auto_extract(chemin)regarde l’extension (.pdf) et utilisePDFExtractor.- Renvoie tout le texte du PDF sous forme d’une chaîne.
- Si
pdfminer.sixouPyPDF2n’est pas installé, une exception est levée.
2. Instancier directement un extracteur
from extractor_wrapper.ext.docx import DOCXExtractor
chemin = "notes/projet.docx"
docx_ext = DOCXExtractor()
# Si la librairie python-docx manque, l’init ou extract() lèvera une erreur.
try:
contenu = docx_ext.extract(chemin)
print("Contenu DOCX (300 premiers caractères) :")
print(contenu[:300])
except Exception as e:
print(f"Impossible d’extraire le DOCX : {e}")
-
Explication :
- On importe
DOCXExtractor, on l’instancie avec un chemin.docx. extract()renvoie une chaîne contenant tous les paragraphes.- Si
python-docxmanque, une erreur claire est produite.
- On importe
3. Parcourir un dossier de fichiers mixtes
import os
from extractor_wrapper import ExtractorFactory
from loggerplusplus import Logger
logger = Logger(identifier="BatchTest")
REPERTOIRE_ENTREE = "test_files"
REPERTOIRE_SORTIE = "extracted_texts"
os.makedirs(REPERTOIRE_SORTIE, exist_ok=True)
for nom_fichier in os.listdir(REPERTOIRE_ENTREE):
chemin = os.path.join(REPERTOIRE_ENTREE, nom_fichier)
if not os.path.isfile(chemin):
continue
logger.info(f"Traitement de {nom_fichier}")
try:
contenu = ExtractorFactory.auto_extract(chemin)
sortie = os.path.join(REPERTOIRE_SORTIE, nom_fichier + ".txt")
with open(sortie, "w", encoding="utf-8") as f:
f.write(contenu)
logger.info(f"Texte sauvegardé dans {sortie}")
except Exception as e:
logger.error(f"Échec pour {nom_fichier} : {e}")
-
Explication :
- On parcourt chaque fichier dans
test_files/. - Appel à
auto_extract→ on obtient une chaîne de texte, qu’on écrit dans un.txtdansextracted_texts/. - Si l’extracteur est indisponible ou s’il y a une erreur, on la consigne dans les logs.
- On parcourt chaque fichier dans
Auteur
Projet créé et maintenu par Florian BARRE. Pour toute question ou contribution, n’hésitez pas à me contacter. Mon Site | Mon LinkedIn | Mon GitHub
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extractor_wrapper-0.1.2.tar.gz.
File metadata
- Download URL: extractor_wrapper-0.1.2.tar.gz
- Upload date:
- Size: 29.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b2e1f9d9449b52f6373a0c70f1733685ff7142dfc5b81703d6014ce6fce1cb0
|
|
| MD5 |
84b1b8cb6fc0501bfc9e4396f890074e
|
|
| BLAKE2b-256 |
fca16f82c491853124e90a06de7070abb2b90721239bbb5987f9b9bcb45f52fb
|
File details
Details for the file extractor_wrapper-0.1.2-py3-none-any.whl.
File metadata
- Download URL: extractor_wrapper-0.1.2-py3-none-any.whl
- Upload date:
- Size: 31.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a263b53809f4b2395df0b3b5cd7524eab06400ffbab1e3d17f1b97d456921bb1
|
|
| MD5 |
63215f878667c83848fba8dc5c3b7c00
|
|
| BLAKE2b-256 |
646dc246b95d44c7fdab03cb12997cb7179c03141699052a75eb8fc8dd5f6b9b
|