Skip to main content

A Python library for converting between CSV and ARFF (Weka) file formats

Project description

ARFF-CSV Converter

Tests PyPI version https://img.shields.io/badge/python-3.11%2B-blue Ask DeepWiki

A Python library for converting between CSV and ARFF (Weka) file formats. ARFF (Attribute-Relation File Format) is the standard file format used by the Weka machine learning toolkit.

Features

  • Bidirectional conversion: Convert CSV to ARFF and ARFF to CSV
  • Automatic type detection: Automatically infers numeric, nominal, string, and date types
  • CSV analysis mode: Analyze CSV files and get suggestions for column types
  • Missing value handling: Properly handles missing values using ARFF standard (?)
  • Sparse format support: Read and write sparse ARFF format
  • Command-line interface: Easy-to-use CLI for quick conversions
  • Pandas integration: Seamlessly works with pandas DataFrames
  • Type hints: Full type annotation support for better IDE integration
  • Well tested: Comprehensive test suite with high coverage

Installation

pip install arff-csv-converter

For development dependencies:

pip install arff-csv-converter[dev]

Quick Start

Python API

Convert CSV to ARFF

from arff_csv import csv_to_arff

# Basic conversion
csv_to_arff("data.csv", "data.arff")

# With options
csv_to_arff(
    "data.csv",
    "data.arff",
    relation_name="my_dataset",
    nominal_columns=["class", "category"],
    comments=["Generated by my application"]
)

Convert ARFF to CSV

from arff_csv import arff_to_csv

# Basic conversion
df = arff_to_csv("data.arff", "data.csv")

# Access the DataFrame directly
print(df.head())

Using the Converter Class

from arff_csv import ArffConverter

converter = ArffConverter()

# CSV to ARFF
arff_data = converter.csv_to_arff("input.csv", "output.arff")
print(f"Relation: {arff_data.relation_name}")
print(f"Attributes: {len(arff_data.attributes)}")
print(f"Instances: {len(arff_data.data)}")

# ARFF to CSV
df = converter.arff_to_csv("input.arff", "output.csv")

# Work with DataFrames directly
df = converter.arff_to_dataframe("data.arff")
converter.dataframe_to_arff(df, "output.arff", relation_name="my_data")

# Get ARFF as string
arff_string = converter.dataframe_to_arff_string(df, relation_name="my_data")

Working with ArffData

from arff_csv import ArffParser

parser = ArffParser()
arff_data = parser.parse_file("data.arff")

# Access metadata
print(f"Relation: {arff_data.relation_name}")
print(f"Comments: {arff_data.comments}")

# Access attributes
for attr in arff_data.attributes:
    print(f"  {attr.name}: {attr.type.name}")
    if attr.nominal_values:
        print(f"    Values: {attr.nominal_values}")

# Access data as DataFrame
df = arff_data.data
print(df.describe())

# Get attribute lists
numeric_attrs = arff_data.get_numeric_attributes()
nominal_attrs = arff_data.get_nominal_attributes()

Command Line Interface

The package installs a command-line tool arff-csv:

Analyze CSV (Recommended First Step)

Before converting, you can analyze your CSV file to get suggestions for column types:

arff-csv csv2arff iris.csv --analyze

This will output:

======================================================================
CSV ANALYSIS: iris.csv
======================================================================

Rows: 150
Columns: 6

DATA PREVIEW (first 5 rows):
----------------------------------------------------------------------
   Unnamed_0  sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  class
0          0                5.1               3.5                1.4               0.2      0
1          1                4.9               3.0                1.4               0.2      0
2          2                4.7               3.2                1.3               0.2      0
3          3                4.6               3.1                1.5               0.2      0
4          4                5.0               3.6                1.4               0.2      0

COLUMN ANALYSIS:
----------------------------------------------------------------------
Column                    Type       Unique   Nulls    Reason
----------------------------------------------------------------------
Unnamed_0                 INTEGER    150      0        Integer values
sepal length (cm)         NUMERIC    35       0        Floating point values
sepal width (cm)          NUMERIC    23       0        Floating point values
petal length (cm)         NUMERIC    43       0        Floating point values
petal width (cm)          NUMERIC    22       0        Floating point values
class                     NOMINAL    3        0        Common target/class column name

COLUMNS SUGGESTED FOR EXCLUSION:
----------------------------------------------------------------------
  - Unnamed_0: Unique value for every row

SUGGESTED COMMAND:
----------------------------------------------------------------------

arff-csv csv2arff iris.csv iris.arff --relation "iris" --nominal \
    class --exclude Unnamed_0

SUMMARY:
----------------------------------------------------------------------
  Numeric columns:  5
  Nominal columns:  1
  String columns:   0
  Suggested excludes: 1

  Nominal: class
  Exclude: Unnamed_0

Analysis options:

Option Description Default
-a, --analyze Enable analysis mode (no conversion) -
--preview-rows N Number of rows to preview 5
--nominal-threshold N Max unique values to consider nominal 10

Detection criteria:

  • Nominal columns: Binary values (0/1, yes/no, true/false), columns named "class"/"target"/"label", integer columns with few unique values
  • String columns: Text with many unique values, long text (avg > 50 chars)
  • Numeric columns: Floating point values, integers with many unique values
  • Exclusion suggestions: Columns with a single unique value or an identifier-like unique value for every row

Convert CSV to ARFF

# Basic conversion
arff-csv csv2arff input.csv output.arff

# With options
arff-csv csv2arff input.csv output.arff \
    --relation "my_dataset" \
    --nominal class category \
    --string description \
    --exclude id \
    --comment "Generated on 2024-01-15" \
    --verbose

Conversion options:

Option Description Default
-r, --relation NAME Relation name Input filename
-n, --nominal COL... Columns to treat as nominal -
-s, --string COL... Columns to treat as string -
--exclude COL... Columns to exclude from conversion -
-m, --missing VALUE Missing value representation ?
-c, --comment TEXT... Comments to add -
--delimiter CHAR CSV delimiter ,
--encoding ENC File encoding utf-8
-v, --verbose Verbose output -

Convert ARFF to CSV

# Basic conversion
arff-csv arff2csv input.arff output.csv

# With options
arff-csv arff2csv input.arff output.csv \
    --delimiter ";" \
    --include-index \
    --verbose

Display ARFF file information

arff-csv info data.arff

Output:

ARFF File: data.arff
Relation: iris
Instances: 150
Attributes: 5

Attribute Information:
------------------------------------------------------------
  sepallength: NUMERIC
  sepalwidth: NUMERIC
  petallength: NUMERIC
  petalwidth: NUMERIC
  class: NOMINAL {Iris-setosa, Iris-versicolor, Iris-virginica}

Data Preview (first 5 rows):
------------------------------------------------------------
   sepallength  sepalwidth  petallength  petalwidth        class
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
...

ARFF Format Reference

ARFF (Attribute-Relation File Format) is a text format that describes a dataset as a relation with named attributes. The format consists of:

  1. Header section: Relation name and attribute definitions
  2. Data section: The actual data instances

Example ARFF File

% This is a comment
@RELATION iris

@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor

Supported Attribute Types

Type Description Example
NUMERIC Floating-point numbers @ATTRIBUTE value NUMERIC
INTEGER Integer numbers @ATTRIBUTE count INTEGER
REAL Alias for NUMERIC @ATTRIBUTE value REAL
STRING Text strings @ATTRIBUTE name STRING
NOMINAL Categorical values @ATTRIBUTE class {a, b, c}
DATE Date/time values @ATTRIBUTE date DATE 'yyyy-MM-dd'

Missing Values

Missing values are represented by ? in ARFF format:

@DATA
5.1,3.5,?,0.2,Iris-setosa
?,3.0,1.4,0.2,?

API Reference

Main Functions

  • csv_to_arff(csv_path, arff_path, ...) - Convert CSV file to ARFF
  • arff_to_csv(arff_path, csv_path, ...) - Convert ARFF file to CSV

Classes

  • ArffConverter - Main converter class with full functionality
  • ArffParser - Parser for reading ARFF files
  • ArffWriter - Writer for creating ARFF files
  • ArffData - Container for parsed ARFF data
  • Attribute - ARFF attribute definition

Exceptions

  • ArffCsvError - Base exception for all errors
  • ArffParseError - Error parsing ARFF files
  • ArffWriteError - Error writing ARFF files
  • CsvParseError - Error parsing CSV files
  • InvalidAttributeError - Invalid attribute definition
  • MissingDataError - Required data missing

Development

Setup

# Clone the repository
git clone https://github.com/rmontanana/arff-csv-converter.git
cd arff-csv-converter

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=arff_csv --cov-report=html

# Run specific test file
pytest tests/test_parser.py

# Run with verbose output
pytest -v

Code Quality

# Run linter
ruff check src tests

# Run formatter
ruff format src tests

# Run type checker
mypy src

Building

# Install build tools
pip install build twine

# Build the package
python -m build

# Check the package
twine check dist/*

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Projects

  • Weka - The original machine learning toolkit that uses ARFF format
  • liac-arff - Another Python library for ARFF files
  • scipy.io.arff - SciPy's ARFF reader

Changelog

See CHANGELOG.md for a list of changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arff_csv_converter-1.0.0.tar.gz (39.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arff_csv_converter-1.0.0-py3-none-any.whl (25.3 kB view details)

Uploaded Python 3

File details

Details for the file arff_csv_converter-1.0.0.tar.gz.

File metadata

  • Download URL: arff_csv_converter-1.0.0.tar.gz
  • Upload date:
  • Size: 39.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for arff_csv_converter-1.0.0.tar.gz
Algorithm Hash digest
SHA256 54216ef456ac9a51e3307b03c34c6972c8af7d1e9a8a3da529a7dca884cdd498
MD5 b8e26941e1fd1e5402ae944f16eba8bc
BLAKE2b-256 db5041bd01936f817d38a80a4516682f36d86adf7b8df175a8a5de7e7f2f75db

See more details on using hashes here.

File details

Details for the file arff_csv_converter-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for arff_csv_converter-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2289ecf2cbc684cb4452d48e764cae9aaccb0d454593b868ae5824f8476661b
MD5 8be84fcd01dac9e6dcecedc371cae39d
BLAKE2b-256 46bc62a9dcef1307453a0a51d60e6db427cfc6f2a68ca7a8218a72c20bf4cd9b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page