Skip to main content

A CSV importer for MongoDB

Project description

PyImport - A Powerful CSV Importer for MongoDB

Documentation Status Python 3.11+ License

PyImport is a Python command-line tool for importing CSV data into MongoDB with automatic type detection, parallel processing, and graceful handling of "dirty" data.

Unlike MongoDB's native mongoimport, PyImport focuses on handling real-world messy data, automatic type inference, and high-performance parallel imports.

Version: 1.9.0 Author: Joe Drumgoole (joe@joedrumgoole.com | @jdrumgoole) License: Apache 2.0 Source: github.com/jdrumgoole/pyimport Documentation: pyimport.readthedocs.io

Key Features

  • Automatic Type Detection - Generate field files with inferred types using --genfieldfile
  • Graceful Error Handling - Falls back to strings on type conversion errors instead of failing
  • Multiple Import Strategies - Sync, async, multi-process, and threaded imports
  • Parallel Processing - Split large files and import in parallel for maximum throughput
  • Restart Capability - Resume failed imports from where they left off
  • Flexible Date Parsing - Multiple date formats with fast ISO date parsing (100x faster)
  • Performance Optimized - Recent improvements provide 20-35% faster imports
  • URL Support - Import directly from URLs or local files

Performance

  • Sync: ~24,000-32,000 docs/sec
  • Async: ~30,000-40,000 docs/sec
  • Multi-process: ~50,000+ docs/sec

Requirements

  • Python: 3.11 or higher
  • MongoDB: 4.0 or higher

Installation

From PyPI (Recommended)

pip install pyimport

From Source

git clone https://github.com/jdrumgoole/pyimport.git
cd pyimport
poetry install

Verify Installation

pyimport --version
# Output: pyimport 1.9.0

Quick Start

Step 1: Create a Simple CSV File

# Create a test CSV file
echo "name,age,city" > test.csv
echo "Alice,30,NYC" >> test.csv
echo "Bob,25,LA" >> test.csv

Step 2: Generate Field File (Type Definitions)

pyimport --genfieldfile test.csv
# Output: Created field filename 'test.tff' from 'test.csv'

This creates a test.tff file that defines the type of each column (string, int, date, etc.).

Step 3: Import to MongoDB

pyimport --database mydb --collection people test.csv
# Imports data using the auto-generated test.tff field file

Step 4: Verify Import

mongosh mydb --eval "db.people.find().pretty()"

Advanced Usage

Fast Parallel Import for Large Files

pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --database mydb --collection mycol largefile.csv

This splits the file into 8 chunks and processes them with 4 parallel workers.

Async Import (High Performance)

pyimport --asyncpro --database mydb --collection mycol data.csv

Import from URL

pyimport --database mydb --collection taxi \
         https://jdrumgoole.s3.eu-west-1.amazonaws.com/2018_Yellow_Taxi_Trip_Data_1000.csv

Resume Failed Imports

# First import with audit enabled
pyimport --audit --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

# Resume from where it left off
pyimport --restart --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

Why PyImport?

MongoDB's native mongoimport is excellent, but PyImport offers several additional capabilities:

PyImport Advantages

Feature PyImport mongoimport
Type inference Automatic with --genfieldfile Manual with --columnsHaveTypes
Dirty data handling Graceful fallback to string Strict, may fail
Date formats Multiple formats, automatic detection Limited
Parallel processing Built-in --multi, --asyncpro, --threads Requires external scripting
Restart capability Built-in --restart and --audit Not built-in
URL imports Direct URL support Requires pre-download
File splitting Automatic with --splitfile Manual
Performance optimization Pre-compiled converters, fast ISO dates Standard

mongoimport Advantages

  • Richer security options (Kerberos, LDAP, x.509)
  • MongoDB Enterprise Advanced features
  • JSON file imports (in addition to CSV)
  • Official MongoDB support

When to Use PyImport

Choose PyImport when you need to:

  • Handle messy, inconsistent, or "dirty" CSV data
  • Automatically infer types from CSV columns
  • Import large files quickly with parallel processing
  • Resume failed imports without starting over
  • Import data directly from URLs
  • Add metadata (timestamps, filenames, line numbers) to documents

Field Files (.tff)

Field files are TOML-formatted files that define column types and formats for CSV imports. They enable automatic type conversion during import.

Automatic Generation

The easiest way to create a field file is to generate it automatically:

pyimport --genfieldfile data.csv
# Creates data.tff with inferred types

Supported Types

  • str - String (text)
  • int - Integer
  • float - Floating point number
  • date - Date without time
  • datetime - Date with time
  • isodate - ISO format date (YYYY-MM-DD) - fastest parsing
  • bool - Boolean (true/false)
  • timestamp - Unix timestamp

Field File Naming

PyImport automatically looks for field files with the .tff extension:

  • For data.csv, it looks for data.tff
  • You can specify a custom field file with --fieldfile

Example Field File

For a CSV file with inventory data:

Inventory Item Amount Last Order
Screws 300 1-Jan-2016
Bolts 150 3-Feb-2017
Nails 25 31-Dec-2017

Running pyimport --genfieldfile inventory.csv generates:

# Created 'inventory.tff'
# at UTC: 2025-10-12 by pyimport.fieldfile

["Inventory Item"]
type = "str"
name = "Inventory Item"

["Amount"]
type = "int"
name = "Amount"

["Last Order"]
type = "date"
name = "Last Order"
format = "%d-%b-%Y"  # Date format string

[DEFAULTS_SECTION]
delimiter = ","
has_header = true

Type Inference

PyImport analyzes the first data row after the header to infer types:

  1. Tries to parse as int
  2. If that fails, tries float
  3. If that fails, tries date
  4. Falls back to str

You can manually edit .tff files to correct types if inference is incorrect.

Graceful Error Handling

If type conversion fails during import, PyImport falls back to storing the value as a string instead of failing the entire import (unless --onerror fail is specified).

Date Format Strings

Date and datetime fields support strptime format strings:

["order_date"]
type = "date"
format = "%Y-%m-%d"  # 2024-12-31

Common format codes:

  • %Y - 4-digit year (2024)
  • %m - Month (01-12)
  • %d - Day (01-31)
  • %H - Hour (00-23)
  • %M - Minute (00-59)
  • %S - Second (00-59)

Date Parsing Performance

For best performance, choose the right date type:

  1. isodate (fastest) - Use for ISO format dates (YYYY-MM-DD)

    • 100x faster than generic date parsing
    ["created_date"]
    type = "isodate"
    
  2. date/datetime with format (fast) - Use when all dates have the same format

    ["order_date"]
    type = "datetime"
    format = "%Y-%m-%d %H:%M:%S"
    
  3. date/datetime without format (slow) - Use only for inconsistent date formats

    ["flexible_date"]
    type = "date"  # No format - uses slow dateutil.parser
    

Complete Documentation

For comprehensive documentation including all CLI options, advanced features, and examples, visit:

📖 Full Documentation at readthedocs.io

Documentation includes:

Common Options

Basic Options

-h, --help              Show help message
--version               Show version number
--database NAME         Database name [default: PYIM]
--collection NAME       Collection name [default: imported]
--mdburi URI           MongoDB connection URI [default: mongodb://localhost:27017]

Field File Options

--genfieldfile          Generate field file from CSV
--fieldfile FILE        Specify custom field file path
--delimiter CHAR        Field delimiter [default: ,]
--hasheader             CSV has header line

Performance Options

--multi                 Multi-process parallel import
--asyncpro             Async parallel import (high performance)
--threads              Thread-based parallel import
--poolsize N           Number of parallel workers [default: 4]
--batchsize N          Batch size for bulk inserts [default: 1000]

File Splitting Options

--splitfile            Split file for parallel processing
--autosplit N          Split into N chunks
--keepsplits           Don't delete split files after import

Restart Options

--audit                Enable audit tracking
--restart              Resume from last successful import
--audithost URI        MongoDB URI for audit records

Data Enrichment Options

--addfilename          Add filename to each document
--addtimestamp now     Add current timestamp
--addtimestamp gen     Add generated ObjectId timestamp
--locator              Add filename and line number
--addfield key=value   Add custom field to all documents

Error Handling Options

--onerror fail         Stop on first error
--onerror warn         Log errors and continue [default]
--onerror ignore       Silently skip errors

Example Workflows

Simple Import

pyimport --genfieldfile data.csv
pyimport --database mydb --collection mycol data.csv

High-Performance Import

pyimport --multi --splitfile --autosplit 8 --poolsize 4 \
         --batchsize 5000 --database mydb --collection mycol \
         largefile.csv

Import with Metadata

pyimport --addfilename --addtimestamp now --locator \
         --database mydb --collection mycol data.csv

Resume Failed Import

pyimport --audit --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

# If it fails, resume with:
pyimport --restart --audithost mongodb://localhost:27017 \
         --database mydb --collection mycol largefile.csv

Contributing

Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.

Development Setup

git clone https://github.com/jdrumgoole/pyimport.git
cd pyimport
poetry install --with dev

# Run tests
poetry run pytest

# Run all tests with coverage
invoke test-all

Testing

PyImport has comprehensive test coverage (72%+):

# Run all tests
invoke test-all

# Run specific test suites
cd test/test_command && poetry run pytest
cd test/test_e2e && poetry run pytest

# Quick smoke tests
invoke quick-test

Version History

1.9.0 (Current)

  • Comprehensive documentation (2,700+ lines)
  • Version centralization with single source of truth
  • Read the Docs integration
  • Performance improvements (20-35% faster)
  • Test coverage improvements (72%)
  • Bug fixes for --version flag

1.8.2

  • Previous stable release

See CHANGELOG for complete version history.

Links

Support

License

Apache License 2.0 - See LICENSE file for details.


Made with ❤️ by Joe Drumgoole

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyimport-1.9.0.tar.gz (58.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyimport-1.9.0-py3-none-any.whl (77.9 kB view details)

Uploaded Python 3

File details

Details for the file pyimport-1.9.0.tar.gz.

File metadata

  • Download URL: pyimport-1.9.0.tar.gz
  • Upload date:
  • Size: 58.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for pyimport-1.9.0.tar.gz
Algorithm Hash digest
SHA256 876a1b79b046943edd58e17f5a6a6f3d6ba0055e426e8b29001041397e006a84
MD5 e0f7641cc3951360b49ce6bb2775ffbb
BLAKE2b-256 19853e198edd183496d95bf4f416b24435b78a28fbeaf00dc01383592ea24c4a

See more details on using hashes here.

File details

Details for the file pyimport-1.9.0-py3-none-any.whl.

File metadata

  • Download URL: pyimport-1.9.0-py3-none-any.whl
  • Upload date:
  • Size: 77.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/24.6.0

File hashes

Hashes for pyimport-1.9.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af3a590d51df5d1ae54ac9297b0b13f8192fd7f6c777f2bf27db61fdc6894b6e
MD5 f1f5663a68ff275786ff84141edc807a
BLAKE2b-256 2702f0a32925b611312e324fec1ea56c329023dc64b6c08917b68e480c4a0603

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page