This module is designed to convert all types of files into usable text str to make it easier to work with python

Project description

Documentation for Document Converter Code

This Python code provides a set of classes designed to extract text from various document formats: PDF, PPTX (PowerPoint), and DOCX (Word). It utilizes the PyPDF2, python-pptx, and python-docx libraries to accomplish this task. Below is a breakdown of each class and its methods. To install this :

pip install doc-converter

Classes

1. `pdf`

The pdf class is responsible for extracting text from PDF files.

Methods

__init__(self, path: str) -> None
- Parameters:
  - path: The directory path where the PDF files are located.
- Description: Initializes the class with the specified directory path.
pdf_to_text(self) -> List[str]
- Returns: A list of strings, each containing the text extracted from a PDF file.
- Description:
  - Changes the working directory to the specified path.
  - Scans for all PDF files in the directory.
  - Reads each PDF file and extracts text from all pages.
  - Returns a list containing the extracted text from each PDF file.

2. `pptx_files`

The pptx_files class is designed to extract text from PPTX (PowerPoint) files.

Methods

__init__(self, path: str) -> None
- Parameters:
  - path: The directory path where the PPTX files are located.
- Description: Initializes the class with the specified directory path.
pptx_to_text(self) -> List[str]
- Returns: A list of strings, each containing the text extracted from a PPTX file.
- Description:
  - Changes the working directory to the specified path.
  - Scans for all PPTX files in the directory.
  - Reads each PPTX file and extracts text from all slides and shapes.
  - Returns a list containing the extracted text from each PPTX file.

3. `DocxtoText`

The DocxtoText class is responsible for extracting text from DOCX (Word) files.

Methods

__init__(self, path: str) -> None
- Parameters:
  - path: The directory path where the DOCX files are located.
- Description: Initializes the class with the specified directory path.
docx_to_text(self) -> List[str]
- Returns: A list of strings, each containing the text extracted from a DOCX file.
- Description:
  - Changes the working directory to the specified path.
  - Scans for all DOCX files in the directory.
  - Reads each DOCX file and extracts text from all paragraphs.
  - Returns a list containing the extracted text from each DOCX file.

Example Usage

To use these classes, instantiate the desired class with the appropriate path and call the corresponding text extraction method:

# Example for extracting text from DOCX files
docx_converter = DocxtoText(path=r"C:\Users\Ibrahim.intern\Desktop\Converter")
docx_texts = docx_converter.docx_to_text()
print(len(docx_texts))  # Prints the number of DOCX files processed

# Example for extracting text from PDF files
pdf_converter = pdf(path=r"C:\Users\Ibrahim.intern\Desktop\Converter")
pdf_texts = pdf_converter.pdf_to_text()
print(len(pdf_texts))  # Prints the number of PDF files processed

# Example for extracting text from PPTX files
pptx_converter = pptx_files(path=r"C:\Users\Ibrahim.intern\Desktop\Converter")
pptx_texts = pptx_converter.pptx_to_text()
print(len(pptx_texts))  # Prints the number of PPTX files processed

Requirements

To run this code, ensure that the following packages are installed:

python-docx
python-pptx
PyPDF2

You can install these packages using pip:

pip install python-docx python-pptx PyPDF2

Notes

The code assumes that the specified directory contains files of the supported formats. If there are no files of a given format, an empty list will be returned for that format.
Make sure to handle any potential exceptions that may arise when reading files, such as file access issues or file format errors.

Project details

Release history Release notifications | RSS feed

This version

0.0.3

Oct 4, 2024

0.0.2

Oct 1, 2024

0.0.1

Oct 1, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_converter-0.0.3.tar.gz (3.4 kB view details)

Uploaded Oct 4, 2024 Source

File details

Details for the file doc_converter-0.0.3.tar.gz.

File metadata

Download URL: doc_converter-0.0.3.tar.gz
Upload date: Oct 4, 2024
Size: 3.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for doc_converter-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`613760b8c9057c6e8df007f62b9fe9e4d6b834b446fd7ad6f4ce29693b9bfd6b`
MD5	`2d6c490b7400fe57721bba01562377b7`
BLAKE2b-256	`9abdd161f94ff9fbee097f345be9742016c90e486538e7c43bb6b8f93bf0ade1`

See more details on using hashes here.

Doc-Converter 0.0.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

Documentation for Document Converter Code

Classes

1. `pdf`

Methods

2. `pptx_files`

Methods

3. `DocxtoText`

Methods

Example Usage

Requirements

Notes

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

Doc-Converter 0.0.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Meta

Project description

Documentation for Document Converter Code

Classes

1. pdf

Methods

2. pptx_files

Methods

3. DocxtoText

Methods

Example Usage

Requirements

Notes

Project details

Verified details

Maintainers

Meta

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes

1. `pdf`

2. `pptx_files`

3. `DocxtoText`