Data processing pipeline using MLX (scraper, chunker, extractor).

These details have not been verified by PyPI

Project links

Homepage

Project description

MLXPipeline

A Local-First Data Extraction Pipeline powered by Apple MLX

MLXPipeline Banner

Overview

MLXPipeline is a Python package for extracting, chunking, and analyzing data from various sources, optimized specifically for Apple Silicon. Unlike cloud-based solutions, MLXPipeline performs all processing, including machine learning inference tasks, locally on your machine.

Built on Apple's MLX framework, it efficiently leverages Apple's M-series chips to provide:

🔒 Enhanced Privacy: All data stays on your device
💰 Cost Efficiency: No API fees or usage limits
⚡ Speed: Optimized for Apple Silicon (M1/M2/M3 chips)
🔄 Versatility: Handles text, PDFs, webpages, images, audio, video, and more

Features

Content Extraction

Document Parsing: Extract text from PDFs, DOCX, PPTX, HTML, Markdown, JSON, CSV
Web Scraping: Extract content from websites with clean-up of navigation, ads, etc.
OCR: Extract text from images using Optical Character Recognition
Audio Transcription: Convert audio files to text using MLX-powered Whisper
Video Transcription: Extract and transcribe audio from video files

Content Processing

Text Chunking: Split large documents into manageable chunks based on size
Semantic Chunking: Create chunks based on semantic similarity using embeddings
Structured Information Extraction: Extract specific information using local LLMs

Local-First Machine Learning

MLX-Powered: Uses Apple's MLX framework for ML tasks
Embeddings: Generate text embeddings for semantic processing
Transcription: Local audio transcription using MLX Whisper models
LLM Integration: Use local Large Language Models for content extraction and analysis

Requirements

macOS on Apple Silicon (M1/M2/M3 or newer)
Python 3.9+
MLX compatible models (not included in the package)

Installation

pip install mlxpipeline

Model Setup

MLXPipeline requires pre-downloaded MLX-compatible models to function. You should download:

MLX LLM Model: For extraction and analysis (e.g., Llama-3-8B-MLX)
MLX Whisper Model: For audio transcription
MLX Embedding Model: For semantic chunking (e.g., all-MiniLM-L6-v2)

These models should be placed in the default model directory ~/.mlxpipeline/models or specified using the appropriate arguments.

You can find MLX-compatible models at:

MLX Community Models
Hugging Face (models with MLX support)

Usage

MLXPipeline can be used both as a command-line tool and as a Python library.

Command Line Usage

Scraping a webpage:

mlxpipeline --source https://example.com --output result.txt

Chunking a document:

mlxpipeline --source document.pdf --chunk-type text --chunk-size 1000 --output chunks.txt

Extracting structured information:

mlxpipeline --source article.txt --extract --schema schema.json --output result.json

Audio transcription:

mlxpipeline --source recording.mp3 --whisper-model-path ~/.mlxpipeline/models/whisper-small-mlx --output transcription.txt

Using a custom LLM for extraction:

mlxpipeline --source document.pdf --extract --schema schema.json --llm-model-path ~/models/llama-3-8b-mlx --output result.json

Python Library Usage

Basic document processing:

from mlxpipeline.scraper import scrape_pdf
from mlxpipeline.chunker import chunk_text
from mlxpipeline.extract import extract_from_chunk

# Extract text from PDF
text = scrape_pdf("document.pdf")

# Split into chunks
chunks = chunk_text(text, chunk_size=1000, chunk_overlap=100)

# Extract structured information
schema = {
    "title": "string",
    "author": "string",
    "key_points": "array"
}

for i, chunk in enumerate(chunks):
    result = extract_from_chunk(
        chunk, 
        schema=schema,
        llm_model_path="~/.mlxpipeline/models/llm/llama-3-8b-mlx"
    )
    print(f"Chunk {i+1} extraction: {result}")

Webpage scraping and analysis:

from mlxpipeline.scraper import scrape_webpage, ai_extract_webpage_content

# Simple webpage scraping
content = scrape_webpage("https://example.com/article")

# AI-powered extraction (focuses on main content)
main_content = ai_extract_webpage_content(
    "https://example.com/article",
    llm_model_path="~/.mlxpipeline/models/llm/llama-3-8b-mlx"
)

Audio transcription:

from mlxpipeline.scraper import scrape_audio

# Transcribe audio
transcription = scrape_audio(
    "interview.mp3",
    whisper_model_path="~/.mlxpipeline/models/whisper/whisper-small-mlx"
)

Differences from thepipe_api

MLXPipeline is a fork of thepipe_api that has been completely redesigned to be local-first and Apple Silicon optimized. Key differences include:

Local-Only Processing: All ML tasks run locally with no cloud API dependencies
MLX Integration: Uses Apple's MLX framework for optimized performance on Apple Silicon
Model Management: Users must download and provide MLX-compatible models
Simplified Architecture: Removed all API interactions and authentication requirements
Performance Focus: Optimized for M-series chips with unified memory architecture

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - See LICENSE file for details

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.0

Apr 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_mlx_pipeline_llamasearch-0.1.0.tar.gz (25.1 kB view details)

Uploaded Apr 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl (18.5 kB view details)

Uploaded Apr 4, 2025 Python 3

File details

Details for the file llama_mlx_pipeline_llamasearch-0.1.0.tar.gz.

File metadata

Download URL: llama_mlx_pipeline_llamasearch-0.1.0.tar.gz
Upload date: Apr 4, 2025
Size: 25.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llama_mlx_pipeline_llamasearch-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2448191922af727ae005396268be2295a0ffd6814793f12e88d83e38549ca1cf`
MD5	`8047fbd0575e1fd4ff12d20958fc1731`
BLAKE2b-256	`e3cf6fda7fde34d5d037dfaf2b965e9bfa6aa4a65c4241c56af1d2d0bac96f61`

See more details on using hashes here.

File details

Details for the file llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl.

File metadata

Download URL: llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl
Upload date: Apr 4, 2025
Size: 18.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ab5ece01f6ebe27a0cd9af090b448a7248173bb112be74c942e3aca73d614bf`
MD5	`7e8a62cb1750f5ebd1d520d3e152763a`
BLAKE2b-256	`f7de703aa5f7fd168b808fac86888d27e18e6cfe9997c491d09bad397c9560ae`

See more details on using hashes here.

llama-mlx-pipeline-llamasearch 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

MLXPipeline

Overview

Features

Content Extraction

Content Processing

Local-First Machine Learning

Requirements

Installation

Model Setup

Usage

Command Line Usage

Python Library Usage

Differences from thepipe_api

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes