Skip to main content

No project description provided

Project description

engineX

engineX is a Python package for semantic search in PDF documents and web pages. It allows users to extract relevant information from PDFs and websites based on natural language queries.

Installation

You can install engineX using pip:

pip install engineX

## Usage

engineX provides two main functions: `process_pdf` for analyzing PDF documents and `crawl_and_query` for searching web pages.

### Processing a PDF

To search within a PDF document:

```python
from engineX import process_pdfx

pdf_path = "path/to/your/document.pdf"
query = "What is the main topic of this document?"

results = process_pdf(pdf_path, query)

for chunk, similarity in results:
    print(f"Similarity: {similarity:.4f}")
    print(f"Chunk: {chunk[:200]}...")
    print("-" * 50)

Crawling and Querying a Web Page

To search the content of a web page:

from engineX import crawl_and_query

url = "https://example.com"
query = "What services does this website offer?"

results = crawl_and_query(url, query)

for chunk, similarity in results:
    print(f"Similarity: {similarity:.4f}")
    print(f"Chunk: {chunk[:200]}...")
    print("-" * 50)

Both functions return a list of tuples, where each tuple contains a relevant text chunk and its similarity score to the query.

Configuration

engineX can be customized by modifying the config.py file in the package directory. Here are the available configuration options:

Embedding Model

  • Variable: EMBEDDING_MODEL
  • Default: "sentence-transformers/all-MiniLM-L6-v2"
  • Description: The name of the Hugging Face model used for text embeddings.

Text Chunking

  • Variables: CHUNK_SIZE and CHUNK_OVERLAP
  • Defaults: CHUNK_SIZE = 1024, CHUNK_OVERLAP = 80
  • Description: Control how the text is split into chunks for processing.
    • CHUNK_SIZE: Maximum number of characters in each chunk.
    • CHUNK_OVERLAP: Number of characters that overlap between consecutive chunks.

Search Results

  • Variable: TOP_K_RESULTS
  • Default: 5
  • Description: The number of top results to return from the semantic search.

How to Modify Configuration

To change these settings, locate the config.py file in your engineX installation directory and edit the values. For example:

# config.py

EMBEDDING_MODEL = "sentence-transformers/all-mpnet-base-v2"
CHUNK_SIZE = 2048
CHUNK_OVERLAP = 100
TOP_K_RESULTS = 10

After modifying the config file, restart your Python environment or reload the engineX package for the changes to take effect.

Viewing Current Configuration

You can view the current configuration settings in your Python script or interactive session:

import engineX

engineX.print_config()

This will display the current values of all configuration options.

Examples

Analyzing a Research Paper

from engineX import process_pdf

pdf_path = "research_paper.pdf"
query = "What are the key findings of this research?"

results = process_pdf(pdf_path, query)

print("Key findings from the research paper:")
for chunk, similarity in results:
    print(f"Relevance: {similarity:.2f}")
    print(chunk)
    print("-" * 50)

Extracting Information from a Company Website

from engineX import crawl_and_query

url = "https://www.company.com/about"
query = "What is the company's mission statement?"

results = crawl_and_query(url, query)

print("Company mission statement:")
for chunk, similarity in results:
    if similarity > 0.8:  # Only print highly relevant results
        print(chunk)
        break

Troubleshooting

If you encounter any issues:

  • Ensure you have the latest version of engineX installed.
  • Check that all dependencies are correctly installed.
  • Verify that the PDF file or URL you're trying to access is valid and accessible.
  • If you've modified the configuration, try reverting to default settings.

For persistent issues, please open an issue on our GitHub repository.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

engineX-0.5.6.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

engineX-0.5.6-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file engineX-0.5.6.tar.gz.

File metadata

  • Download URL: engineX-0.5.6.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for engineX-0.5.6.tar.gz
Algorithm Hash digest
SHA256 003a2e710e81f5367645cb50fb597ee649ab54774ac129464b5c695d2199cd68
MD5 2b4c7bc91b597275a3576137dd4bde8b
BLAKE2b-256 7548c6da049d45a4ea5fed823198558b6193d8820cdb908f110e192e7c3bf4bd

See more details on using hashes here.

File details

Details for the file engineX-0.5.6-py3-none-any.whl.

File metadata

  • Download URL: engineX-0.5.6-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.11.3

File hashes

Hashes for engineX-0.5.6-py3-none-any.whl
Algorithm Hash digest
SHA256 ec03c93c6ff8cfc1ae59fb8df72dc16f0c3e04f8b8676e0c130dd3ef9d7bf3c3
MD5 454d50203b1a995205e0c59ff748419c
BLAKE2b-256 7634ad8636b2dcb08cfca1a6b50f9ad20d9ba8ff4fdaeb10b767b8019de0e301

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page