A library to process PDF files for Pineone
Project description
PineconePDFExtractor
PineconePDFExtractor is a Python library for extracting text from PDF files for pinecone.
Installation
Use the package manager pip to install PineconePDFExtractor.
pip install PineconePDFExtractor
Check the latest version here:
https://pypi.org/project/PineconePDFExtractor/
Usage
from pdf.PineconePDFExtractor import PdfProcessor
# Create a PineconePDFExtractor instance with a batch size of 200
extractor = PdfProcessor(200)
# Process a list of PDF files
result = extractor.process_files(['file1.pdf', 'file2.pdf'])
# The result is a dictionary with the batch size and a list of documents
# Each document is a dictionary with the id (file name without extension), metadata (number of pages), source (file path), and text (extracted text)
## Example result
# {
# 'batch_size': 200,
# 'documents': [
# {
# 'id': 'file1',
# 'metadata': {
# 'pages': 1
# },
# 'source': 'file1.pdf',
# 'text': 'This is the extracted text from file1.pdf'
# },
# {
# 'id': 'file2',
# 'metadata': {
# 'pages': 2
# },
# 'source': 'file2.pdf',
# 'text': 'This is the extracted text from file2.pdf'
# }
# ]
# }
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Close
Hashes for PineconePDFExtractor-0.1.8.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1406ef6566590c2da60a2186d0dbd47f58ae55a3e3bfab889503939daf3324bb |
|
MD5 | f7660018ba8d9c2304f24a253c0d2328 |
|
BLAKE2b-256 | 8643cbc3117412ad9b70a4724ae7bb60f49309b431bbddfb99ddd9bd6dbb24db |
Close
Hashes for PineconePDFExtractor-0.1.8-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | bea71d4505f5361bd297de8fbccc8dd4d74f30b299d52e4913c9d6a8ef533257 |
|
MD5 | 6b6300d3ed78369fd5be29924a7a45a0 |
|
BLAKE2b-256 | 5c21dea542a1ed7ad2cee8151d7ba44511c7a815e8c5ddf4f75113d81e051358 |