Skip to main content

A Python library for converting between CSV and ARFF (Weka) file formats

Project description

ARFF-CSV Converter

Tests PyPI version https://img.shields.io/badge/python-3.11%2B-blue Ask DeepWiki

A Python library for converting between CSV and ARFF (Weka) file formats. ARFF (Attribute-Relation File Format) is the standard file format used by the Weka machine learning toolkit.

Features

  • Bidirectional conversion: Convert CSV to ARFF and ARFF to CSV
  • Automatic type detection: Automatically infers numeric, nominal, string, and date types
  • CSV analysis mode: Analyze CSV files and get suggestions for column types
  • Missing value handling: Properly handles missing values using ARFF standard (?)
  • Sparse format support: Read and write sparse ARFF format
  • Command-line interface: Easy-to-use CLI for quick conversions
  • Pandas integration: Seamlessly works with pandas DataFrames
  • Type hints: Full type annotation support for better IDE integration
  • Well tested: Comprehensive test suite with high coverage

Installation

pip install arff-csv-converter

For development dependencies:

pip install arff-csv-converter[dev]

Quick Start

Python API

Convert CSV to ARFF

from arff_csv import csv_to_arff

# Basic conversion
csv_to_arff("data.csv", "data.arff")

# With options
csv_to_arff(
    "data.csv",
    "data.arff",
    relation_name="my_dataset",
    nominal_columns=["class", "category"],
    comments=["Generated by my application"]
)

Convert ARFF to CSV

from arff_csv import arff_to_csv

# Basic conversion
df = arff_to_csv("data.arff", "data.csv")

# Access the DataFrame directly
print(df.head())

Using the Converter Class

from arff_csv import ArffConverter

converter = ArffConverter()

# CSV to ARFF
arff_data = converter.csv_to_arff("input.csv", "output.arff")
print(f"Relation: {arff_data.relation_name}")
print(f"Attributes: {len(arff_data.attributes)}")
print(f"Instances: {len(arff_data.data)}")

# ARFF to CSV
df = converter.arff_to_csv("input.arff", "output.csv")

# Work with DataFrames directly
df = converter.arff_to_dataframe("data.arff")
converter.dataframe_to_arff(df, "output.arff", relation_name="my_data")

# Get ARFF as string
arff_string = converter.dataframe_to_arff_string(df, relation_name="my_data")

Working with ArffData

from arff_csv import ArffParser

parser = ArffParser()
arff_data = parser.parse_file("data.arff")

# Access metadata
print(f"Relation: {arff_data.relation_name}")
print(f"Comments: {arff_data.comments}")

# Access attributes
for attr in arff_data.attributes:
    print(f"  {attr.name}: {attr.type.name}")
    if attr.nominal_values:
        print(f"    Values: {attr.nominal_values}")

# Access data as DataFrame
df = arff_data.data
print(df.describe())

# Get attribute lists
numeric_attrs = arff_data.get_numeric_attributes()
nominal_attrs = arff_data.get_nominal_attributes()

Command Line Interface

The package installs a command-line tool arff-csv:

Analyze CSV (Recommended First Step)

Before converting, you can analyze your CSV file to get suggestions for column types:

arff-csv csv2arff iris.csv --analyze

This will output:

======================================================================
CSV ANALYSIS: iris.csv
======================================================================

Rows: 150
Columns: 6

DATA PREVIEW (first 5 rows):
----------------------------------------------------------------------
   Unnamed_0  sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  class
0          0                5.1               3.5                1.4               0.2      0
1          1                4.9               3.0                1.4               0.2      0
2          2                4.7               3.2                1.3               0.2      0
3          3                4.6               3.1                1.5               0.2      0
4          4                5.0               3.6                1.4               0.2      0

COLUMN ANALYSIS:
----------------------------------------------------------------------
Column                    Type       Unique   Nulls    Reason
----------------------------------------------------------------------
Unnamed_0                 INTEGER    150      0        Integer values
sepal length (cm)         NUMERIC    35       0        Floating point values
sepal width (cm)          NUMERIC    23       0        Floating point values
petal length (cm)         NUMERIC    43       0        Floating point values
petal width (cm)          NUMERIC    22       0        Floating point values
class                     NOMINAL    3        0        Common target/class column name

COLUMNS SUGGESTED FOR EXCLUSION:
----------------------------------------------------------------------
  - Unnamed_0: Unique value for every row

SUGGESTED COMMAND:
----------------------------------------------------------------------

arff-csv csv2arff iris.csv iris.arff --relation "iris" --nominal \
    class --exclude Unnamed_0

SUMMARY:
----------------------------------------------------------------------
  Numeric columns:  5
  Nominal columns:  1
  String columns:   0
  Suggested excludes: 1

  Nominal: class
  Exclude: Unnamed_0

Analysis options:

Option Description Default
-a, --analyze Enable analysis mode (no conversion) -
--preview-rows N Number of rows to preview 5
--nominal-threshold N Max unique values to consider nominal 10

Detection criteria:

  • Nominal columns: Binary values (0/1, yes/no, true/false), columns named "class"/"target"/"label", integer columns with few unique values
  • String columns: Text with many unique values, long text (avg > 50 chars)
  • Numeric columns: Floating point values, integers with many unique values
  • Exclusion suggestions: Columns with a single unique value or an identifier-like unique value for every row

Convert CSV to ARFF

# Basic conversion
arff-csv csv2arff input.csv output.arff

# With options
arff-csv csv2arff input.csv output.arff \
    --relation "my_dataset" \
    --nominal class category \
    --string description \
    --exclude id \
    --comment "Generated on 2024-01-15" \
    --verbose

Conversion options:

Option Description Default
-r, --relation NAME Relation name Input filename
-n, --nominal COL... Columns to treat as nominal -
-s, --string COL... Columns to treat as string -
--exclude COL... Columns to exclude from conversion -
-m, --missing VALUE Missing value representation ?
-c, --comment TEXT... Comments to add -
--delimiter CHAR CSV delimiter ,
--encoding ENC File encoding utf-8
-v, --verbose Verbose output -

Convert ARFF to CSV

# Basic conversion
arff-csv arff2csv input.arff output.csv

# With options
arff-csv arff2csv input.arff output.csv \
    --delimiter ";" \
    --include-index \
    --verbose

Display ARFF file information

arff-csv info data.arff

Output:

ARFF File: data.arff
Relation: iris
Instances: 150
Attributes: 5

Attribute Information:
------------------------------------------------------------
  sepallength: NUMERIC
  sepalwidth: NUMERIC
  petallength: NUMERIC
  petalwidth: NUMERIC
  class: NOMINAL {Iris-setosa, Iris-versicolor, Iris-virginica}

Data Preview (first 5 rows):
------------------------------------------------------------
   sepallength  sepalwidth  petallength  petalwidth        class
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
...

ARFF Format Reference

ARFF (Attribute-Relation File Format) is a text format that describes a dataset as a relation with named attributes. The format consists of:

  1. Header section: Relation name and attribute definitions
  2. Data section: The actual data instances

Example ARFF File

% This is a comment
@RELATION iris

@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor

Supported Attribute Types

Type Description Example
NUMERIC Floating-point numbers @ATTRIBUTE value NUMERIC
INTEGER Integer numbers @ATTRIBUTE count INTEGER
REAL Alias for NUMERIC @ATTRIBUTE value REAL
STRING Text strings @ATTRIBUTE name STRING
NOMINAL Categorical values @ATTRIBUTE class {a, b, c}
DATE Date/time values @ATTRIBUTE date DATE 'yyyy-MM-dd'

Missing Values

Missing values are represented by ? in ARFF format:

@DATA
5.1,3.5,?,0.2,Iris-setosa
?,3.0,1.4,0.2,?

API Reference

Main Functions

  • csv_to_arff(csv_path, arff_path, ...) - Convert CSV file to ARFF
  • arff_to_csv(arff_path, csv_path, ...) - Convert ARFF file to CSV

Classes

  • ArffConverter - Main converter class with full functionality
  • ArffParser - Parser for reading ARFF files
  • ArffWriter - Writer for creating ARFF files
  • ArffData - Container for parsed ARFF data
  • Attribute - ARFF attribute definition

Exceptions

  • ArffCsvError - Base exception for all errors
  • ArffParseError - Error parsing ARFF files
  • ArffWriteError - Error writing ARFF files
  • CsvParseError - Error parsing CSV files
  • InvalidAttributeError - Invalid attribute definition
  • MissingDataError - Required data missing

Development

Setup

# Clone the repository
git clone https://github.com/rmontanana/arff-csv-converter.git
cd arff-csv-converter

# Create virtual environment
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows

# Install in development mode
pip install -e ".[dev]"

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=arff_csv --cov-report=html

# Run specific test file
pytest tests/test_parser.py

# Run with verbose output
pytest -v

Code Quality

# Run linter
ruff check src tests

# Run formatter
ruff format src tests

# Run type checker
mypy src

Building

# Install build tools
pip install build twine

# Build the package
python -m build

# Check the package
twine check dist/*

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Related Projects

  • Weka - The original machine learning toolkit that uses ARFF format
  • liac-arff - Another Python library for ARFF files
  • scipy.io.arff - SciPy's ARFF reader

Changelog

See CHANGELOG.md for a list of changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arff_csv_converter-1.1.0.tar.gz (40.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

arff_csv_converter-1.1.0-py3-none-any.whl (25.5 kB view details)

Uploaded Python 3

File details

Details for the file arff_csv_converter-1.1.0.tar.gz.

File metadata

  • Download URL: arff_csv_converter-1.1.0.tar.gz
  • Upload date:
  • Size: 40.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for arff_csv_converter-1.1.0.tar.gz
Algorithm Hash digest
SHA256 d9548b6cb3ca2ee52759b0bf7d3b274b2d39fcccd35eea955567cd6bb979017d
MD5 a74516ba5e539bf317f2b5c450f3c622
BLAKE2b-256 fb3a5a6394a15871726782a1d3f8a89849d7342634dae9a262fc7f11697d7eb5

See more details on using hashes here.

Provenance

The following attestation bundles were made for arff_csv_converter-1.1.0.tar.gz:

Publisher: publish.yml on rmontanana/arff-csv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file arff_csv_converter-1.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for arff_csv_converter-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5172ccee6b5cfb53c32c769c1870fcfb3ba27f6cf2a371a9725fa4cb3ac73ff6
MD5 942f38e32310b9c67d8d08639bb9e53e
BLAKE2b-256 304a2f1857631c59e96da79d5526843444f30999410490a69155f8cc3114bf7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for arff_csv_converter-1.1.0-py3-none-any.whl:

Publisher: publish.yml on rmontanana/arff-csv

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page