Skip to main content

A Python library for downloading sample files in various formats for AI experimentation

Project description

Aifiles

PyPI version Python 3.8+ License: MIT

A Python library for instantly downloading up to 10 real, meaningful sample files in any format — perfect for generative AI, agentic AI, RAG pipelines, multimodal model testing, document parsing, prompt engineering, and general AI/ML experimentation.

✨ Features

  • 40+ file formats supported across documents, data, images, audio, video, code, and more
  • Smart sourcing: Downloads real files from public repositories, falls back to synthetic generation
  • AI-optimized: Formats supported by LangChain, LlamaIndex, OpenAI Vision, and other AI tools
  • Developer-friendly: Simple API, rich CLI, comprehensive error handling
  • Secure: MIME validation, path sanitization, no hardcoded credentials

🚀 Installation

pip install aifiles

# For audio generation (optional)
pip install aifiles[audio]

# For all optional dependencies
pip install aifiles[all]

📖 Quick Start

Python API

from aifiles import get_files, list_formats, info, preview

# Get 5 PDF samples for RAG pipeline testing
files = get_files("pdf", count=5, output_dir="./rag_data")
print(files)
# ['/rag_data/sample_1.pdf', '/rag_data/sample_2.pdf', ...]

# Get 3 CSV files with sales data variant
files = get_files("csv", count=3, variant="sales")
# ['/samples/sales_1.csv', '/samples/sales_2.csv', ...]

# Get 10 WAV files for speech model testing
files = get_files("wav", count=10, output_dir="./audio_test")

# List all supported formats
formats = list_formats()
print(formats["documents"])  # ['pdf', 'docx', 'txt', 'md', ...]

# Get format information
meta = info("json")
print(meta)
# {
#   "mime_type": "application/json",
#   "category": "structured",
#   "use_cases": ["agent logs", "API responses", "chat history"],
#   "supported_by": ["OpenAI", "LangChain", "LlamaIndex", ...]
# }

# Preview a downloaded file
preview("./samples/sample1.csv")

Command Line

# Get 3 PDF sample files
aifiles get pdf --count 3

# Get 5 CSV files with sales variant
aifiles get csv --count 5 --variant "sales data" --output ./my_samples

# List all supported formats
aifiles list-formats

# Show format info
aifiles info json

# Preview a file
aifiles preview ./samples/sample1.csv

📋 Supported Formats

📄 Documents

  • PDF - Multi-page documents, invoices, research papers
  • DOCX - Word documents, resumes, letters
  • TXT - Plain text, prompts, logs, poetry
  • MD - Markdown docs, README files, notes
  • RTF - Rich text with formatting
  • ODT - OpenDocument text files

📊 Structured / Data

  • CSV - Tabular datasets, sales data, sensor readings
  • TSV - Tab-separated data
  • JSON - API responses, agent logs, chat histories
  • YAML - Configurations, workflows, agent definitions
  • XML - Structured markup, RSS feeds, SOAP data
  • XLSX - Excel spreadsheets with charts
  • PARQUET - Columnar data for ML pipelines
  • SQLITE - Embedded database for agent memory

🖼️ Images

  • PNG - Charts, diagrams, screenshots
  • JPG - Photographs, real-world scenes
  • WEBP - Modern compressed images
  • TIFF - High-quality scanning/OCR
  • GIF - Animated images for UI testing
  • SVG - Vector graphics, logos, icons

🎵 Audio

  • WAV - Raw speech audio for STT models
  • MP3 - Compressed speech or music
  • FLAC - Lossless audio for high-fidelity testing
  • OGG - Open-source compressed audio

🎥 Video

  • MP4 - General-purpose video for multimodal models
  • MOV - Apple QuickTime video
  • AVI - Legacy video format
  • MKV - High-quality video container
  • WEBM - Web-friendly open video format

💻 Code & Notebooks

  • PY - Python scripts
  • JS - JavaScript/Node.js scripts
  • TS - TypeScript files
  • IPYNB - Jupyter Notebooks with AI/ML examples
  • HTML - Web pages for scraping/parsing
  • CSS - Stylesheets
  • SQL - Database queries, DDL scripts

📧 Email & Communication

  • EML - Email messages with attachments
  • MSG - Outlook email format
  • ICS - Calendar events (iCalendar)
  • VCF - Contact cards

🗜️ Archives

  • ZIP - Compressed archive with mixed files
  • TAR - Unix archive
  • GZ - Gzipped content

🔬 Scientific / ML

  • HDF5 - Hierarchical scientific data
  • ARROW - Apache Arrow columnar data
  • FEATHER - Fast columnar data storage
  • NPY - NumPy array binary format
  • PKL - Python pickle (serialized objects)

🏗️ 3D / Spatial

  • OBJ - 3D object mesh
  • STL - 3D printing / mesh format
  • GLTF - 3D scene for multimodal/spatial AI

🔐 Config / Infra

  • ENV - Environment config files
  • TOML - Project config (like pyproject.toml)
  • INI - Legacy configuration files
  • DOCKERFILE - Docker build files

🛠️ API Reference

get_files(format, count=1, output_dir="./samples", variant=None)

Download sample files in the specified format.

Parameters:

  • format (str): File format/extension (e.g., "pdf", "csv", "png")
  • count (int): Number of files to fetch (1-10)
  • output_dir (str): Directory to save files
  • variant (str, optional): Content variant hint

Returns: List of absolute file paths

Raises:

  • InvalidCountError: Count not between 1-10
  • FormatNotSupportedError: Format not supported
  • FormatNotAvailableError: Cannot fetch or generate files

list_formats()

Returns: Dictionary of categorized formats

info(format)

Returns: Dictionary with format metadata or None

preview(filepath)

Prints file preview or metadata to console.

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Public sample file repositories for real file sources
  • Open source libraries: requests, faker, Pillow, fpdf2, etc.
  • AI community for inspiration and use cases

👤 Author


Made with ❤️ for the AI developer community

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aifiles-0.1.0.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aifiles-0.1.0-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file aifiles-0.1.0.tar.gz.

File metadata

  • Download URL: aifiles-0.1.0.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aifiles-0.1.0.tar.gz
Algorithm Hash digest
SHA256 491ac54049534854d843156fb19b83b5d8360feaa94c1b726be367650717f28e
MD5 9c3435135731c389e051bddee0a2aa62
BLAKE2b-256 240aa9ab1c261a25ed93cc113f12862ae469a89cca640d90150b7934bd7b2bff

See more details on using hashes here.

File details

Details for the file aifiles-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: aifiles-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.1

File hashes

Hashes for aifiles-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8424c6c6e0cc819f12117e3b849a68833589712e6ee47cd0f0dcaf936aa3c66b
MD5 f3600245d701216c180ac99f6bfd1795
BLAKE2b-256 0aa561f658b14845d40333ad041aa5e92e36531e16c4f5fa077e6413f4cd2bd7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page