Data processing pipeline using MLX (scraper, chunker, extractor).
Project description
MLXPipeline
A Local-First Data Extraction Pipeline powered by Apple MLX
Overview
MLXPipeline is a Python package for extracting, chunking, and analyzing data from various sources, optimized specifically for Apple Silicon. Unlike cloud-based solutions, MLXPipeline performs all processing, including machine learning inference tasks, locally on your machine.
Built on Apple's MLX framework, it efficiently leverages Apple's M-series chips to provide:
- 🔒 Enhanced Privacy: All data stays on your device
- 💰 Cost Efficiency: No API fees or usage limits
- ⚡ Speed: Optimized for Apple Silicon (M1/M2/M3 chips)
- 🔄 Versatility: Handles text, PDFs, webpages, images, audio, video, and more
Features
Content Extraction
- Document Parsing: Extract text from PDFs, DOCX, PPTX, HTML, Markdown, JSON, CSV
- Web Scraping: Extract content from websites with clean-up of navigation, ads, etc.
- OCR: Extract text from images using Optical Character Recognition
- Audio Transcription: Convert audio files to text using MLX-powered Whisper
- Video Transcription: Extract and transcribe audio from video files
Content Processing
- Text Chunking: Split large documents into manageable chunks based on size
- Semantic Chunking: Create chunks based on semantic similarity using embeddings
- Structured Information Extraction: Extract specific information using local LLMs
Local-First Machine Learning
- MLX-Powered: Uses Apple's MLX framework for ML tasks
- Embeddings: Generate text embeddings for semantic processing
- Transcription: Local audio transcription using MLX Whisper models
- LLM Integration: Use local Large Language Models for content extraction and analysis
Requirements
- macOS on Apple Silicon (M1/M2/M3 or newer)
- Python 3.9+
- MLX compatible models (not included in the package)
Installation
pip install mlxpipeline
Model Setup
MLXPipeline requires pre-downloaded MLX-compatible models to function. You should download:
- MLX LLM Model: For extraction and analysis (e.g., Llama-3-8B-MLX)
- MLX Whisper Model: For audio transcription
- MLX Embedding Model: For semantic chunking (e.g., all-MiniLM-L6-v2)
These models should be placed in the default model directory ~/.mlxpipeline/models or specified using the appropriate arguments.
You can find MLX-compatible models at:
- MLX Community Models
- Hugging Face (models with MLX support)
Usage
MLXPipeline can be used both as a command-line tool and as a Python library.
Command Line Usage
Scraping a webpage:
mlxpipeline --source https://example.com --output result.txt
Chunking a document:
mlxpipeline --source document.pdf --chunk-type text --chunk-size 1000 --output chunks.txt
Extracting structured information:
mlxpipeline --source article.txt --extract --schema schema.json --output result.json
Audio transcription:
mlxpipeline --source recording.mp3 --whisper-model-path ~/.mlxpipeline/models/whisper-small-mlx --output transcription.txt
Using a custom LLM for extraction:
mlxpipeline --source document.pdf --extract --schema schema.json --llm-model-path ~/models/llama-3-8b-mlx --output result.json
Python Library Usage
Basic document processing:
from mlxpipeline.scraper import scrape_pdf
from mlxpipeline.chunker import chunk_text
from mlxpipeline.extract import extract_from_chunk
# Extract text from PDF
text = scrape_pdf("document.pdf")
# Split into chunks
chunks = chunk_text(text, chunk_size=1000, chunk_overlap=100)
# Extract structured information
schema = {
"title": "string",
"author": "string",
"key_points": "array"
}
for i, chunk in enumerate(chunks):
result = extract_from_chunk(
chunk,
schema=schema,
llm_model_path="~/.mlxpipeline/models/llm/llama-3-8b-mlx"
)
print(f"Chunk {i+1} extraction: {result}")
Webpage scraping and analysis:
from mlxpipeline.scraper import scrape_webpage, ai_extract_webpage_content
# Simple webpage scraping
content = scrape_webpage("https://example.com/article")
# AI-powered extraction (focuses on main content)
main_content = ai_extract_webpage_content(
"https://example.com/article",
llm_model_path="~/.mlxpipeline/models/llm/llama-3-8b-mlx"
)
Audio transcription:
from mlxpipeline.scraper import scrape_audio
# Transcribe audio
transcription = scrape_audio(
"interview.mp3",
whisper_model_path="~/.mlxpipeline/models/whisper/whisper-small-mlx"
)
Differences from thepipe_api
MLXPipeline is a fork of thepipe_api that has been completely redesigned to be local-first and Apple Silicon optimized. Key differences include:
- Local-Only Processing: All ML tasks run locally with no cloud API dependencies
- MLX Integration: Uses Apple's MLX framework for optimized performance on Apple Silicon
- Model Management: Users must download and provide MLX-compatible models
- Simplified Architecture: Removed all API interactions and authentication requirements
- Performance Focus: Optimized for M-series chips with unified memory architecture
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT License - See LICENSE file for details
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file llama_mlx_pipeline_llamasearch-0.1.0.tar.gz.
File metadata
- Download URL: llama_mlx_pipeline_llamasearch-0.1.0.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2448191922af727ae005396268be2295a0ffd6814793f12e88d83e38549ca1cf
|
|
| MD5 |
8047fbd0575e1fd4ff12d20958fc1731
|
|
| BLAKE2b-256 |
e3cf6fda7fde34d5d037dfaf2b965e9bfa6aa4a65c4241c56af1d2d0bac96f61
|
File details
Details for the file llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl.
File metadata
- Download URL: llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl
- Upload date:
- Size: 18.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1ab5ece01f6ebe27a0cd9af090b448a7248173bb112be74c942e3aca73d614bf
|
|
| MD5 |
7e8a62cb1750f5ebd1d520d3e152763a
|
|
| BLAKE2b-256 |
f7de703aa5f7fd168b808fac86888d27e18e6cfe9997c491d09bad397c9560ae
|