A CSV importer for MongoDB

These details have not been verified by PyPI

Project description

PyImport - A Powerful CSV Importer for MongoDB

PyImport is a Python command-line tool for importing CSV data into MongoDB with automatic type detection, parallel processing, and graceful handling of "dirty" data.

Unlike MongoDB's native mongoimport, PyImport focuses on handling real-world messy data, automatic type inference, and high-performance parallel imports.

Version: 1.9.0 Author: Joe Drumgoole (joe@joedrumgoole.com | @jdrumgoole) License: Apache 2.0 Source: github.com/jdrumgoole/pyimport Documentation: pyimport.readthedocs.io

Key Features

Automatic Type Detection - Generate field files with inferred types using --genfieldfile
Graceful Error Handling - Falls back to strings on type conversion errors instead of failing
Multiple Import Strategies - Sync, async, multi-process, and threaded imports
Parallel Processing - Split large files and import in parallel for maximum throughput
Restart Capability - Resume failed imports from where they left off
Flexible Date Parsing - Multiple date formats with fast ISO date parsing (100x faster)
Performance Optimized - Recent improvements provide 20-35% faster imports
URL Support - Import directly from URLs or local files

Performance

Sync: ~24,000-32,000 docs/sec
Async: ~30,000-40,000 docs/sec
Multi-process: ~50,000+ docs/sec

Requirements

Python: 3.11 or higher
MongoDB: 4.0 or higher

Installation

From PyPI (Recommended)

pip install pyimport

From Source

git clone https://github.com/jdrumgoole/pyimport.git
cd pyimport
poetry install

Verify Installation

pyimport --version
# Output: pyimport 1.9.0

Quick Start

Step 1: Create a Simple CSV File

# Create a test CSV file
echo "name,age,city" > test.csv
echo "Alice,30,NYC" >> test.csv
echo "Bob,25,LA" >> test.csv

Step 2: Generate Field File (Type Definitions)

pyimport --genfieldfile test.csv
# Output: Created field filename 'test.tff' from 'test.csv'

This creates a test.tff file that defines the type of each column (string, int, date, etc.).

Step 3: Import to MongoDB

pyimport --database mydb --collection people test.csv
# Imports data using the auto-generated test.tff field file

Step 4: Verify Import

mongosh mydb --eval "db.people.find().pretty()"

Advanced Usage

Fast Parallel Import for Large Files

pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --database mydb --collection mycol largefile.csv

This splits the file into 8 chunks and processes them with 4 parallel workers.

Async Import (High Performance)

pyimport --asyncpro --database mydb --collection mycol data.csv

Import from URL

pyimport --database mydb --collection taxi \
         https://jdrumgoole.s3.eu-west-1.amazonaws.com/2018_Yellow_Taxi_Trip_Data_1000.csv

Resume Failed Imports

# First import with audit enabled
pyimport --audit --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

# Resume from where it left off
pyimport --restart --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

Why PyImport?

MongoDB's native mongoimport is excellent, but PyImport offers several additional capabilities:

PyImport Advantages

Feature	PyImport	mongoimport
Type inference	Automatic with `--genfieldfile`	Manual with `--columnsHaveTypes`
Dirty data handling	Graceful fallback to string	Strict, may fail
Date formats	Multiple formats, automatic detection	Limited
Parallel processing	Built-in `--multi`, `--asyncpro`, `--threads`	Requires external scripting
Restart capability	Built-in `--restart` and `--audit`	Not built-in
URL imports	Direct URL support	Requires pre-download
File splitting	Automatic with `--splitfile`	Manual
Performance optimization	Pre-compiled converters, fast ISO dates	Standard

mongoimport Advantages

Richer security options (Kerberos, LDAP, x.509)
MongoDB Enterprise Advanced features
JSON file imports (in addition to CSV)
Official MongoDB support

When to Use PyImport

Choose PyImport when you need to:

Handle messy, inconsistent, or "dirty" CSV data
Automatically infer types from CSV columns
Import large files quickly with parallel processing
Resume failed imports without starting over
Import data directly from URLs
Add metadata (timestamps, filenames, line numbers) to documents

Field Files (`.tff`)

Field files are TOML-formatted files that define column types and formats for CSV imports. They enable automatic type conversion during import.

Automatic Generation

The easiest way to create a field file is to generate it automatically:

pyimport --genfieldfile data.csv
# Creates data.tff with inferred types

Supported Types

str - String (text)
int - Integer
float - Floating point number
date - Date without time
datetime - Date with time
isodate - ISO format date (YYYY-MM-DD) - fastest parsing
bool - Boolean (true/false)
timestamp - Unix timestamp

Field File Naming

PyImport automatically looks for field files with the .tff extension:

For data.csv, it looks for data.tff
You can specify a custom field file with --fieldfile

Example Field File

For a CSV file with inventory data:

Inventory Item	Amount	Last Order
Screws	300	1-Jan-2016
Bolts	150	3-Feb-2017
Nails	25	31-Dec-2017

Running pyimport --genfieldfile inventory.csv generates:

# Created 'inventory.tff'
# at UTC: 2025-10-12 by pyimport.fieldfile

["Inventory Item"]
type = "str"
name = "Inventory Item"

["Amount"]
type = "int"
name = "Amount"

["Last Order"]
type = "date"
name = "Last Order"
format = "%d-%b-%Y"  # Date format string

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

Type Inference

PyImport analyzes the first data row after the header to infer types:

Tries to parse as int
If that fails, tries float
If that fails, tries date
Falls back to str

You can manually edit .tff files to correct types if inference is incorrect.

Graceful Error Handling

If type conversion fails during import, PyImport falls back to storing the value as a string instead of failing the entire import (unless --onerror fail is specified).

Date Format Strings

Date and datetime fields support strptime format strings:

["order_date"]
type = "date"
format = "%Y-%m-%d"  # 2024-12-31

Common format codes:

%Y - 4-digit year (2024)
%m - Month (01-12)
%d - Day (01-31)
%H - Hour (00-23)
%M - Minute (00-59)
%S - Second (00-59)

Date Parsing Performance

For best performance, choose the right date type:

isodate (fastest) - Use for ISO format dates (YYYY-MM-DD)
- 100x faster than generic date parsing
```
["created_date"]
type = "isodate"
```
date/datetime with format (fast) - Use when all dates have the same format
```
["order_date"]
type = "datetime"
format = "%Y-%m-%d %H:%M:%S"
```

date/datetime without format (slow) - Use only for inconsistent date formats

["flexible_date"]
type = "date"  # No format - uses slow dateutil.parser

Complete Documentation

For comprehensive documentation including all CLI options, advanced features, and examples, visit:

📖 Full Documentation at readthedocs.io

Documentation includes:

Installation Guide - Setup and configuration
Quick Start - Step-by-step tutorials
CLI Reference - All 45+ command-line options
Field Files Guide - Complete .tff format reference
Advanced Usage - Parallel processing, optimization, production tips

Common Options

Basic Options

-h, --help              Show help message
--version               Show version number
--database NAME         Database name [default: PYIM]
--collection NAME       Collection name [default: imported]
--mdburi URI           MongoDB connection URI [default: mongodb://localhost:27017]

Field File Options

--genfieldfile          Generate field file from CSV
--fieldfile FILE        Specify custom field file path
--delimiter CHAR        Field delimiter [default: ,]
--hasheader             CSV has header line

Performance Options

--multi                 Multi-process parallel import
--asyncpro             Async parallel import (high performance)
--threads              Thread-based parallel import
--poolsize N           Number of parallel workers [default: 4]
--batchsize N          Batch size for bulk inserts [default: 1000]

File Splitting Options

--splitfile            Split file for parallel processing
--autosplit N          Split into N chunks
--keepsplits           Don't delete split files after import

Restart Options

--audit                Enable audit tracking
--restart              Resume from last successful import
--audithost URI        MongoDB URI for audit records

Data Enrichment Options

--addfilename          Add filename to each document
--addtimestamp now     Add current timestamp
--addtimestamp gen     Add generated ObjectId timestamp
--locator              Add filename and line number
--addfield key=value   Add custom field to all documents

Error Handling Options

--onerror fail         Stop on first error
--onerror warn         Log errors and continue [default]
--onerror ignore       Silently skip errors

Example Workflows

Simple Import

pyimport --genfieldfile data.csv
pyimport --database mydb --collection mycol data.csv

High-Performance Import

pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --batchsize 5000 --database mydb --collection mycol \
         largefile.csv

Import with Metadata

pyimport --addfilename --addtimestamp now --locator \
         --database mydb --collection mycol data.csv

Resume Failed Import

pyimport --audit --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

# If it fails, resume with:
pyimport --restart --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

git clone https://github.com/jdrumgoole/pyimport.git
cd pyimport
poetry install --with dev

# Run tests
poetry run pytest

# Run all tests with coverage
invoke test-all

Testing

PyImport has comprehensive test coverage (72%+):

# Run all tests
invoke test-all

# Run specific test suites
cd test/test_command && poetry run pytest
cd test/test_e2e && poetry run pytest

# Quick smoke tests
invoke quick-test

Version History

1.9.0 (Current)

Comprehensive documentation (2,700+ lines)
Version centralization with single source of truth
Read the Docs integration
Performance improvements (20-35% faster)
Test coverage improvements (72%)
Bug fixes for --version flag

1.8.2

Previous stable release

See CHANGELOG for complete version history.

Support

Email: joe@joedrumgoole.com
X/Twitter: @jdrumgoole
GitHub Issues: Report bugs or request features

License

Apache License 2.0 - See LICENSE file for details.

Made with ❤️ by Joe Drumgoole

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.0.9

Oct 29, 2025

2.0.8

Oct 29, 2025

2.0.7

Oct 15, 2025

2.0.6

Oct 14, 2025

2.0.3

Oct 14, 2025

2.0.2

Oct 14, 2025

2.0.1

Oct 14, 2025

1.10.9

Oct 13, 2025

1.10.8

Oct 13, 2025

1.10.7

Oct 13, 2025

1.10.6

Oct 13, 2025

1.10.5

Oct 13, 2025

1.10.4

Oct 13, 2025

1.10.3

Oct 13, 2025

1.10.2

Oct 13, 2025

1.10.1

Oct 13, 2025

1.10.0

Oct 13, 2025

1.9.1

Oct 12, 2025

This version

1.9.0

Oct 12, 2025

1.8.2

Jun 30, 2024

1.8.1

Jun 22, 2024

1.8

Jun 22, 2024

1.7b1 pre-release

Jun 21, 2024

0.1.0

May 27, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyimport-1.9.0.tar.gz (58.6 kB view details)

Uploaded Oct 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pyimport-1.9.0-py3-none-any.whl (77.9 kB view details)

Uploaded Oct 12, 2025 Python 3

File details

Details for the file pyimport-1.9.0.tar.gz.

File metadata

Download URL: pyimport-1.9.0.tar.gz
Upload date: Oct 12, 2025
Size: 58.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for pyimport-1.9.0.tar.gz
Algorithm	Hash digest
SHA256	`876a1b79b046943edd58e17f5a6a6f3d6ba0055e426e8b29001041397e006a84`
MD5	`e0f7641cc3951360b49ce6bb2775ffbb`
BLAKE2b-256	`19853e198edd183496d95bf4f416b24435b78a28fbeaf00dc01383592ea24c4a`

See more details on using hashes here.

File details

Details for the file pyimport-1.9.0-py3-none-any.whl.

File metadata

Download URL: pyimport-1.9.0-py3-none-any.whl
Upload date: Oct 12, 2025
Size: 77.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for pyimport-1.9.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`af3a590d51df5d1ae54ac9297b0b13f8192fd7f6c777f2bf27db61fdc6894b6e`
MD5	`f1f5663a68ff275786ff84141edc807a`
BLAKE2b-256	`2702f0a32925b611312e324fec1ea56c329023dc64b6c08917b68e480c4a0603`

See more details on using hashes here.

pyimport 1.9.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

PyImport - A Powerful CSV Importer for MongoDB

Key Features

Performance

Requirements

Installation

From PyPI (Recommended)

From Source

Verify Installation

Quick Start

Step 1: Create a Simple CSV File

Step 2: Generate Field File (Type Definitions)

Step 3: Import to MongoDB

Step 4: Verify Import

Advanced Usage

Fast Parallel Import for Large Files

Async Import (High Performance)

Import from URL

Resume Failed Imports

Why PyImport?

PyImport Advantages

mongoimport Advantages

When to Use PyImport

Field Files (.tff)

Automatic Generation

Supported Types

Field File Naming

Example Field File

Type Inference

Graceful Error Handling

Date Format Strings

Date Parsing Performance

Complete Documentation

Common Options

Basic Options

Field File Options

Performance Options

File Splitting Options

Restart Options

Data Enrichment Options

Error Handling Options

Example Workflows

Simple Import

High-Performance Import

Import with Metadata

Resume Failed Import

Contributing

Development Setup

Testing

Version History

Links

Support

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Field Files (`.tff`)