A Python library for downloading sample files in various formats for AI experimentation
Project description
Aifiles
A Python library for instantly downloading up to 10 real, meaningful sample files in any format — perfect for generative AI, agentic AI, RAG pipelines, multimodal model testing, document parsing, prompt engineering, and general AI/ML experimentation.
✨ Features
- 40+ file formats supported across documents, data, images, audio, video, code, and more
- Smart sourcing: Downloads real files from public repositories, falls back to synthetic generation
- AI-optimized: Formats supported by LangChain, LlamaIndex, OpenAI Vision, and other AI tools
- Developer-friendly: Simple API, rich CLI, comprehensive error handling
- Secure: MIME validation, path sanitization, no hardcoded credentials
🚀 Installation
pip install aifiles
# For audio generation (optional)
pip install aifiles[audio]
# For all optional dependencies
pip install aifiles[all]
📖 Quick Start
Python API
from aifiles import get_files, list_formats, info, preview
# Get 5 PDF samples for RAG pipeline testing
files = get_files("pdf", count=5, output_dir="./rag_data")
print(files)
# ['/rag_data/sample_1.pdf', '/rag_data/sample_2.pdf', ...]
# Get 3 CSV files with sales data variant
files = get_files("csv", count=3, variant="sales")
# ['/samples/sales_1.csv', '/samples/sales_2.csv', ...]
# Get 10 WAV files for speech model testing
files = get_files("wav", count=10, output_dir="./audio_test")
# List all supported formats
formats = list_formats()
print(formats["documents"]) # ['pdf', 'docx', 'txt', 'md', ...]
# Get format information
meta = info("json")
print(meta)
# {
# "mime_type": "application/json",
# "category": "structured",
# "use_cases": ["agent logs", "API responses", "chat history"],
# "supported_by": ["OpenAI", "LangChain", "LlamaIndex", ...]
# }
# Preview a downloaded file
preview("./samples/sample1.csv")
Command Line
# Get 3 PDF sample files
aifiles get pdf --count 3
# Get 5 CSV files with sales variant
aifiles get csv --count 5 --variant "sales data" --output ./my_samples
# List all supported formats
aifiles list-formats
# Show format info
aifiles info json
# Preview a file
aifiles preview ./samples/sample1.csv
📋 Supported Formats
📄 Documents
- PDF - Multi-page documents, invoices, research papers
- DOCX - Word documents, resumes, letters
- TXT - Plain text, prompts, logs, poetry
- MD - Markdown docs, README files, notes
- RTF - Rich text with formatting
- ODT - OpenDocument text files
📊 Structured / Data
- CSV - Tabular datasets, sales data, sensor readings
- TSV - Tab-separated data
- JSON - API responses, agent logs, chat histories
- YAML - Configurations, workflows, agent definitions
- XML - Structured markup, RSS feeds, SOAP data
- XLSX - Excel spreadsheets with charts
- PARQUET - Columnar data for ML pipelines
- SQLITE - Embedded database for agent memory
🖼️ Images
- PNG - Charts, diagrams, screenshots
- JPG - Photographs, real-world scenes
- WEBP - Modern compressed images
- TIFF - High-quality scanning/OCR
- GIF - Animated images for UI testing
- SVG - Vector graphics, logos, icons
🎵 Audio
- WAV - Raw speech audio for STT models
- MP3 - Compressed speech or music
- FLAC - Lossless audio for high-fidelity testing
- OGG - Open-source compressed audio
🎥 Video
- MP4 - General-purpose video for multimodal models
- MOV - Apple QuickTime video
- AVI - Legacy video format
- MKV - High-quality video container
- WEBM - Web-friendly open video format
💻 Code & Notebooks
- PY - Python scripts
- JS - JavaScript/Node.js scripts
- TS - TypeScript files
- IPYNB - Jupyter Notebooks with AI/ML examples
- HTML - Web pages for scraping/parsing
- CSS - Stylesheets
- SQL - Database queries, DDL scripts
📧 Email & Communication
- EML - Email messages with attachments
- MSG - Outlook email format
- ICS - Calendar events (iCalendar)
- VCF - Contact cards
🗜️ Archives
- ZIP - Compressed archive with mixed files
- TAR - Unix archive
- GZ - Gzipped content
🔬 Scientific / ML
- HDF5 - Hierarchical scientific data
- ARROW - Apache Arrow columnar data
- FEATHER - Fast columnar data storage
- NPY - NumPy array binary format
- PKL - Python pickle (serialized objects)
🏗️ 3D / Spatial
- OBJ - 3D object mesh
- STL - 3D printing / mesh format
- GLTF - 3D scene for multimodal/spatial AI
🔐 Config / Infra
- ENV - Environment config files
- TOML - Project config (like pyproject.toml)
- INI - Legacy configuration files
- DOCKERFILE - Docker build files
🛠️ API Reference
get_files(format, count=1, output_dir="./samples", variant=None)
Download sample files in the specified format.
Parameters:
format(str): File format/extension (e.g., "pdf", "csv", "png")count(int): Number of files to fetch (1-10)output_dir(str): Directory to save filesvariant(str, optional): Content variant hint
Returns: List of absolute file paths
Raises:
InvalidCountError: Count not between 1-10FormatNotSupportedError: Format not supportedFormatNotAvailableError: Cannot fetch or generate files
list_formats()
Returns: Dictionary of categorized formats
info(format)
Returns: Dictionary with format metadata or None
preview(filepath)
Prints file preview or metadata to console.
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Public sample file repositories for real file sources
- Open source libraries: requests, faker, Pillow, fpdf2, etc.
- AI community for inspiration and use cases
👤 Author
- Prajyot Birajdar
- Email: work.prajyotbirajadar@gmail.com
- GitHub: https://github.com/itsbilyatt
- LinkedIn: https://www.linkedin.com/in/prajyot-birajdar-1b09a1173
- Portfolio: https://prajyotb.netlify.app/
Made with ❤️ for the AI developer community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aifiles-0.1.0.tar.gz.
File metadata
- Download URL: aifiles-0.1.0.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
491ac54049534854d843156fb19b83b5d8360feaa94c1b726be367650717f28e
|
|
| MD5 |
9c3435135731c389e051bddee0a2aa62
|
|
| BLAKE2b-256 |
240aa9ab1c261a25ed93cc113f12862ae469a89cca640d90150b7934bd7b2bff
|
File details
Details for the file aifiles-0.1.0-py3-none-any.whl.
File metadata
- Download URL: aifiles-0.1.0-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8424c6c6e0cc819f12117e3b849a68833589712e6ee47cd0f0dcaf936aa3c66b
|
|
| MD5 |
f3600245d701216c180ac99f6bfd1795
|
|
| BLAKE2b-256 |
0aa561f658b14845d40333ad041aa5e92e36531e16c4f5fa077e6413f4cd2bd7
|