A simple DBF file converter for Python3
A Python3 utility for converting DBF version 5 files to CSV files, Pandas DataFrames, SQL tables, or HDF5 tables. (There is limited Python2 support. See below.) This code was designed to be very simple, fast and memory efficient; therefore, it lacks many features (such as a DBF file writer) that other packages might provide. The conversion to CSV format is entirely written in Python, so no additional dependencies are necessary. For other formats, see Requirements. DBF version 7, the most recent DBF file spec, is not currently supported by this package.
Python 2 Support Except for HDF file export, this code should work fine with Python2. HDF files created using simpledbf in Python3 are compatible with Python2 HDF packages, so in principle, you could make any HDF files in a temporary Python3 environment. If you are using the Anaconda Python distribution (recommended), then you can make a small Python3 environment as follows:
$ conda create -n dbf python=3 pip pandas pytables sqlalchemy # Lots of output... $ source activate dbf dbf>$ pip install simpledbf dbf>$ python my_py3_hdf_creation_script.py # This is using Python3 dbf>$ source deactivate $ python my_py2_stuff_with_hdf.py # This is using Python2 again
Bug fixes, questions, and update requests are encouraged and can be filed at the GitHub repo.
This code is derived from an ActiveState DBF example that works with Python2 and is distributed under a PSF license.
Note on Empty/Bad Data
The current version of this package attempts to convert blank strings and poorly formatted values to an empty value of your choosing. This is controlled by the na keyword argument to all export functions. The default for CSV is an empty string (‘’), and for all other exports, it is ‘nan’ which is converted to float(‘nan’). Pandas has very powerful methods and algorithms for working with missing data, including converting NaN to other values (e.g. empty strings).
This module was tested with the following package versions. Python is the only requirement if you only want to export to CSV. In that case, comment out the other imports in the source code (only one file) to avoid using the optional packages.
Python >=3.4, >=2.7.9 (no HDF export)
Pandas >= 0.15.2 (Required for DataFrame)
PyTables >= 3.1 (with Pandas required for HDF tables)
SQLalchemy >= 0.9 (with Pandas required for SQL tables)
It’s probably easiest to install this package using pip:
$ pip install simpledbf
Or from GitHub:
$ pip install git+https://github.com/rnelsonchem/simpledbf.git
However, this package is currently a single file, so you can just download it into any folder of your choosing.
Load a DBF file
This module currently only defines a single class, Dbf5, which is instantiated with a DBF file name (can contain path info).
In : from simpledbf import Dbf5 In : dbf = Dbf5('fake_file_name.dbf')
The Dbf5 object will initially only read the header information from the file, so you can inspect some of the properties. For example, numrec is the number of records in the DBF file, and fields is a list of tuples with information about the data columns. (See the DBF file spec for info on the column type characters. The “DeletionFlag” column is not exported, but simply checks if a record has been deleted.)
In : dbf.numrec Out: 10000 In : dbf.fields Out: [('DeletionFlag', 'C', 1), ('col_1', 'C', 15), ('col_2', 'N', 2)]
The docstring for this object contains a complete listing of attributes and their descriptions.
The mem method gives an approximate memory requirement for processing this DBF file. (~2x the total file size.) In addition, all of the output methods in this object take a chunksize keyword argument, which lets you split up the processing of large files into smaller chunks, to limit the total memory usage of the conversion process. When this keyword argument is passed into mem, the approximate memory footprint of the chunk will also be given, which can be useful when trying to determine the maximum chunksize your memory will allow.
In : dbf.mem() This total process would require more than 350.2 MB of RAM. In : dbf.mem(chunksize=1000) Each chunk will require 4.793 MB of RAM. This total process would require more than 350.2 MB of RAM.
To export the data to a CSV file, use the to_csv method, which takes the name of a CSV file as an input. The default behavior is to append new data to an existing file, so be careful if the file already exists. If chunksize is passed as a keyword argument, the file buffer will be flushed after processing that many records. (May not be necessary.) The na keyword changes the value used for missing/bad entries (default is ‘’).
In : dbf = Dbf5('fake_file_name.dbf') In : dbf.to_csv('junk.csv')
If you are unhappy with the default CSV output of this module, Pandas also has very powerful CSV export capabilities, among other formats.
The to_dataframe method returns the DBF records as a Pandas DataFrame. Obviously, this method requires that Pandas is installed. If the size of the DBF file exceeds available memory, then passing the chunksize keyword argument will return a generator function. This generator yields DataFrames of len(<=chunksize) until all of the records have been processed. The na keyword changes the value used for missing/bad entries (default is ‘nan’ which inserts float(‘nan’)).
In : dbf = Dbf5('fake_file_name.dbf') In : df = dbf.to_dataframe() # df is a DataFrame with all records In : dbf = Dbf5('fake_file_name.dbf') In : for df in dbf.to_dataframe(chunksize=10000) .... do_cool_stuff(df) # Here a generator is returned
To an SQL Table
The to_pandassql method will transfer the DBF entries to an SQL database table of your choice. This method uses a combination of Pandas DataFrames and SQLalchemy, so both of these packages must be installed. This method requires an SQLalchemy engine string, which is used to initialize the database connection. This will be limited to the SQL databases supported by SQLalchemy, see the SQLalchemy docs for more info. (This has been tested with SQLite and Postgresql.)
In : dbf = Dbf5('fake_file_name.dbf') In : dbf = dbf.to_pandassql('sqlite:///foo.db')
This method takes three optional arguments. table is the name of the table you’d like to use. If this is not passed, your new table will have the same name as the DBF file without file extension. Again, the default here is to append to an existing table. If you want to start fresh, delete the existing table before using this function. The chunksize keyword processes the DBF file in chunks of records no larger than this size. The na keyword changes the value used for missing/bad entries (default is ‘nan’ which inserts float(‘nan’)).
In : dbf = Dbf5('fake_file_name.dbf') In : dbf = dbf.to_pandassql('sqlite:///foo.db', table="fake_tbl", .... chunksize=100000)
To an HDF5 Table
The to_pandashdf method will transfer the DBF entries to an HDF5 table of your choice. This method uses a combination of Pandas DataFrames and PyTables, so both of these packages must be installed. This method requires a file name string for the HDF file you’d like to use. This file will be created if it does not exist. Again, the default is to append to an existing file of that name, so be careful here. The HDF file will be created using the highest level of compression (9) with the ‘blosc’ compression lib. This saves an enormous amount of disk space, with little degradation of performance.
In : dbf = Dbf5('fake_file_name.dbf') In : dbf = dbf.to_pandashdf('fake.h5')
This method uses the same optional arguments, and corresponding defaults, as to_pandassql. See above.
In : dbf = Dbf5('fake_file_name.dbf') In : dbf = dbf.to_pandassql('fake.h5', table="fake_tbl", chunksize=100000)
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.