Skip to main content

Data processing pipeline using MLX (scraper, chunker, extractor).

Project description

MLXPipeline

A Local-First Data Extraction Pipeline powered by Apple MLX

MLXPipeline Banner

Overview

MLXPipeline is a Python package for extracting, chunking, and analyzing data from various sources, optimized specifically for Apple Silicon. Unlike cloud-based solutions, MLXPipeline performs all processing, including machine learning inference tasks, locally on your machine.

Built on Apple's MLX framework, it efficiently leverages Apple's M-series chips to provide:

  • 🔒 Enhanced Privacy: All data stays on your device
  • 💰 Cost Efficiency: No API fees or usage limits
  • Speed: Optimized for Apple Silicon (M1/M2/M3 chips)
  • 🔄 Versatility: Handles text, PDFs, webpages, images, audio, video, and more

Features

Content Extraction

  • Document Parsing: Extract text from PDFs, DOCX, PPTX, HTML, Markdown, JSON, CSV
  • Web Scraping: Extract content from websites with clean-up of navigation, ads, etc.
  • OCR: Extract text from images using Optical Character Recognition
  • Audio Transcription: Convert audio files to text using MLX-powered Whisper
  • Video Transcription: Extract and transcribe audio from video files

Content Processing

  • Text Chunking: Split large documents into manageable chunks based on size
  • Semantic Chunking: Create chunks based on semantic similarity using embeddings
  • Structured Information Extraction: Extract specific information using local LLMs

Local-First Machine Learning

  • MLX-Powered: Uses Apple's MLX framework for ML tasks
  • Embeddings: Generate text embeddings for semantic processing
  • Transcription: Local audio transcription using MLX Whisper models
  • LLM Integration: Use local Large Language Models for content extraction and analysis

Requirements

  • macOS on Apple Silicon (M1/M2/M3 or newer)
  • Python 3.9+
  • MLX compatible models (not included in the package)

Installation

pip install mlxpipeline

Model Setup

MLXPipeline requires pre-downloaded MLX-compatible models to function. You should download:

  1. MLX LLM Model: For extraction and analysis (e.g., Llama-3-8B-MLX)
  2. MLX Whisper Model: For audio transcription
  3. MLX Embedding Model: For semantic chunking (e.g., all-MiniLM-L6-v2)

These models should be placed in the default model directory ~/.mlxpipeline/models or specified using the appropriate arguments.

You can find MLX-compatible models at:

Usage

MLXPipeline can be used both as a command-line tool and as a Python library.

Command Line Usage

Scraping a webpage:

mlxpipeline --source https://example.com --output result.txt

Chunking a document:

mlxpipeline --source document.pdf --chunk-type text --chunk-size 1000 --output chunks.txt

Extracting structured information:

mlxpipeline --source article.txt --extract --schema schema.json --output result.json

Audio transcription:

mlxpipeline --source recording.mp3 --whisper-model-path ~/.mlxpipeline/models/whisper-small-mlx --output transcription.txt

Using a custom LLM for extraction:

mlxpipeline --source document.pdf --extract --schema schema.json --llm-model-path ~/models/llama-3-8b-mlx --output result.json

Python Library Usage

Basic document processing:

from mlxpipeline.scraper import scrape_pdf
from mlxpipeline.chunker import chunk_text
from mlxpipeline.extract import extract_from_chunk

# Extract text from PDF
text = scrape_pdf("document.pdf")

# Split into chunks
chunks = chunk_text(text, chunk_size=1000, chunk_overlap=100)

# Extract structured information
schema = {
    "title": "string",
    "author": "string",
    "key_points": "array"
}

for i, chunk in enumerate(chunks):
    result = extract_from_chunk(
        chunk, 
        schema=schema,
        llm_model_path="~/.mlxpipeline/models/llm/llama-3-8b-mlx"
    )
    print(f"Chunk {i+1} extraction: {result}")

Webpage scraping and analysis:

from mlxpipeline.scraper import scrape_webpage, ai_extract_webpage_content

# Simple webpage scraping
content = scrape_webpage("https://example.com/article")

# AI-powered extraction (focuses on main content)
main_content = ai_extract_webpage_content(
    "https://example.com/article",
    llm_model_path="~/.mlxpipeline/models/llm/llama-3-8b-mlx"
)

Audio transcription:

from mlxpipeline.scraper import scrape_audio

# Transcribe audio
transcription = scrape_audio(
    "interview.mp3",
    whisper_model_path="~/.mlxpipeline/models/whisper/whisper-small-mlx"
)

Differences from thepipe_api

MLXPipeline is a fork of thepipe_api that has been completely redesigned to be local-first and Apple Silicon optimized. Key differences include:

  1. Local-Only Processing: All ML tasks run locally with no cloud API dependencies
  2. MLX Integration: Uses Apple's MLX framework for optimized performance on Apple Silicon
  3. Model Management: Users must download and provide MLX-compatible models
  4. Simplified Architecture: Removed all API interactions and authentication requirements
  5. Performance Focus: Optimized for M-series chips with unified memory architecture

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

MIT License - See LICENSE file for details

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_mlx_pipeline_llamasearch-0.1.0.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file llama_mlx_pipeline_llamasearch-0.1.0.tar.gz.

File metadata

File hashes

Hashes for llama_mlx_pipeline_llamasearch-0.1.0.tar.gz
Algorithm Hash digest
SHA256 2448191922af727ae005396268be2295a0ffd6814793f12e88d83e38549ca1cf
MD5 8047fbd0575e1fd4ff12d20958fc1731
BLAKE2b-256 e3cf6fda7fde34d5d037dfaf2b965e9bfa6aa4a65c4241c56af1d2d0bac96f61

See more details on using hashes here.

File details

Details for the file llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for llama_mlx_pipeline_llamasearch-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ab5ece01f6ebe27a0cd9af090b448a7248173bb112be74c942e3aca73d614bf
MD5 7e8a62cb1750f5ebd1d520d3e152763a
BLAKE2b-256 f7de703aa5f7fd168b808fac86888d27e18e6cfe9997c491d09bad397c9560ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page