A Python library for converting between CSV and ARFF (Weka) file formats
Project description
ARFF-CSV Converter
A Python library for converting between CSV and ARFF (Weka) file formats. ARFF (Attribute-Relation File Format) is the standard file format used by the Weka machine learning toolkit.
Features
- Bidirectional conversion: Convert CSV to ARFF and ARFF to CSV
- Automatic type detection: Automatically infers numeric, nominal, string, and date types
- CSV analysis mode: Analyze CSV files and get suggestions for column types
- Missing value handling: Properly handles missing values using ARFF standard (?)
- Sparse format support: Read and write sparse ARFF format
- Command-line interface: Easy-to-use CLI for quick conversions
- Pandas integration: Seamlessly works with pandas DataFrames
- Type hints: Full type annotation support for better IDE integration
- Well tested: Comprehensive test suite with high coverage
Installation
pip install arff-csv-converter
For development dependencies:
pip install arff-csv-converter[dev]
Quick Start
Python API
Convert CSV to ARFF
from arff_csv import csv_to_arff
# Basic conversion
csv_to_arff("data.csv", "data.arff")
# With options
csv_to_arff(
"data.csv",
"data.arff",
relation_name="my_dataset",
nominal_columns=["class", "category"],
comments=["Generated by my application"]
)
Convert ARFF to CSV
from arff_csv import arff_to_csv
# Basic conversion
df = arff_to_csv("data.arff", "data.csv")
# Access the DataFrame directly
print(df.head())
Using the Converter Class
from arff_csv import ArffConverter
converter = ArffConverter()
# CSV to ARFF
arff_data = converter.csv_to_arff("input.csv", "output.arff")
print(f"Relation: {arff_data.relation_name}")
print(f"Attributes: {len(arff_data.attributes)}")
print(f"Instances: {len(arff_data.data)}")
# ARFF to CSV
df = converter.arff_to_csv("input.arff", "output.csv")
# Work with DataFrames directly
df = converter.arff_to_dataframe("data.arff")
converter.dataframe_to_arff(df, "output.arff", relation_name="my_data")
# Get ARFF as string
arff_string = converter.dataframe_to_arff_string(df, relation_name="my_data")
Working with ArffData
from arff_csv import ArffParser
parser = ArffParser()
arff_data = parser.parse_file("data.arff")
# Access metadata
print(f"Relation: {arff_data.relation_name}")
print(f"Comments: {arff_data.comments}")
# Access attributes
for attr in arff_data.attributes:
print(f" {attr.name}: {attr.type.name}")
if attr.nominal_values:
print(f" Values: {attr.nominal_values}")
# Access data as DataFrame
df = arff_data.data
print(df.describe())
# Get attribute lists
numeric_attrs = arff_data.get_numeric_attributes()
nominal_attrs = arff_data.get_nominal_attributes()
Command Line Interface
The package installs a command-line tool arff-csv:
Analyze CSV (Recommended First Step)
Before converting, you can analyze your CSV file to get suggestions for column types:
arff-csv csv2arff iris.csv --analyze
This will output:
======================================================================
CSV ANALYSIS: iris.csv
======================================================================
Rows: 150
Columns: 6
DATA PREVIEW (first 5 rows):
----------------------------------------------------------------------
Unnamed_0 sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) class
0 0 5.1 3.5 1.4 0.2 0
1 1 4.9 3.0 1.4 0.2 0
2 2 4.7 3.2 1.3 0.2 0
3 3 4.6 3.1 1.5 0.2 0
4 4 5.0 3.6 1.4 0.2 0
COLUMN ANALYSIS:
----------------------------------------------------------------------
Column Type Unique Nulls Reason
----------------------------------------------------------------------
Unnamed_0 INTEGER 150 0 Integer values
sepal length (cm) NUMERIC 35 0 Floating point values
sepal width (cm) NUMERIC 23 0 Floating point values
petal length (cm) NUMERIC 43 0 Floating point values
petal width (cm) NUMERIC 22 0 Floating point values
class NOMINAL 3 0 Common target/class column name
COLUMNS SUGGESTED FOR EXCLUSION:
----------------------------------------------------------------------
- Unnamed_0: Unique value for every row
SUGGESTED COMMAND:
----------------------------------------------------------------------
arff-csv csv2arff iris.csv iris.arff --relation "iris" --nominal \
class --exclude Unnamed_0
SUMMARY:
----------------------------------------------------------------------
Numeric columns: 5
Nominal columns: 1
String columns: 0
Suggested excludes: 1
Nominal: class
Exclude: Unnamed_0
Analysis options:
| Option | Description | Default |
|---|---|---|
-a, --analyze |
Enable analysis mode (no conversion) | - |
--preview-rows N |
Number of rows to preview | 5 |
--nominal-threshold N |
Max unique values to consider nominal | 10 |
Detection criteria:
- Nominal columns: Binary values (0/1, yes/no, true/false), columns named "class"/"target"/"label", integer columns with few unique values
- String columns: Text with many unique values, long text (avg > 50 chars)
- Numeric columns: Floating point values, integers with many unique values
- Exclusion suggestions: Columns with a single unique value or an identifier-like unique value for every row
Convert CSV to ARFF
# Basic conversion
arff-csv csv2arff input.csv output.arff
# With options
arff-csv csv2arff input.csv output.arff \
--relation "my_dataset" \
--nominal class category \
--string description \
--exclude id \
--comment "Generated on 2024-01-15" \
--verbose
Conversion options:
| Option | Description | Default |
|---|---|---|
-r, --relation NAME |
Relation name | Input filename |
-n, --nominal COL... |
Columns to treat as nominal | - |
-s, --string COL... |
Columns to treat as string | - |
--exclude COL... |
Columns to exclude from conversion | - |
-m, --missing VALUE |
Missing value representation | ? |
-c, --comment TEXT... |
Comments to add | - |
--delimiter CHAR |
CSV delimiter | , |
--encoding ENC |
File encoding | utf-8 |
-v, --verbose |
Verbose output | - |
Convert ARFF to CSV
# Basic conversion
arff-csv arff2csv input.arff output.csv
# With options
arff-csv arff2csv input.arff output.csv \
--delimiter ";" \
--include-index \
--verbose
Display ARFF file information
arff-csv info data.arff
Output:
ARFF File: data.arff
Relation: iris
Instances: 150
Attributes: 5
Attribute Information:
------------------------------------------------------------
sepallength: NUMERIC
sepalwidth: NUMERIC
petallength: NUMERIC
petalwidth: NUMERIC
class: NOMINAL {Iris-setosa, Iris-versicolor, Iris-virginica}
Data Preview (first 5 rows):
------------------------------------------------------------
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
...
ARFF Format Reference
ARFF (Attribute-Relation File Format) is a text format that describes a dataset as a relation with named attributes. The format consists of:
- Header section: Relation name and attribute definitions
- Data section: The actual data instances
Example ARFF File
% This is a comment
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
Supported Attribute Types
| Type | Description | Example |
|---|---|---|
| NUMERIC | Floating-point numbers | @ATTRIBUTE value NUMERIC |
| INTEGER | Integer numbers | @ATTRIBUTE count INTEGER |
| REAL | Alias for NUMERIC | @ATTRIBUTE value REAL |
| STRING | Text strings | @ATTRIBUTE name STRING |
| NOMINAL | Categorical values | @ATTRIBUTE class {a, b, c} |
| DATE | Date/time values | @ATTRIBUTE date DATE 'yyyy-MM-dd' |
Missing Values
Missing values are represented by ? in ARFF format:
@DATA
5.1,3.5,?,0.2,Iris-setosa
?,3.0,1.4,0.2,?
API Reference
Main Functions
csv_to_arff(csv_path, arff_path, ...)- Convert CSV file to ARFFarff_to_csv(arff_path, csv_path, ...)- Convert ARFF file to CSV
Classes
ArffConverter- Main converter class with full functionalityArffParser- Parser for reading ARFF filesArffWriter- Writer for creating ARFF filesArffData- Container for parsed ARFF dataAttribute- ARFF attribute definition
Exceptions
ArffCsvError- Base exception for all errorsArffParseError- Error parsing ARFF filesArffWriteError- Error writing ARFF filesCsvParseError- Error parsing CSV filesInvalidAttributeError- Invalid attribute definitionMissingDataError- Required data missing
Development
Setup
# Clone the repository
git clone https://github.com/rmontanana/arff-csv-converter.git
cd arff-csv-converter
# Create virtual environment
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
# Install in development mode
pip install -e ".[dev]"
Running Tests
# Run all tests
pytest
# Run with coverage
pytest --cov=arff_csv --cov-report=html
# Run specific test file
pytest tests/test_parser.py
# Run with verbose output
pytest -v
Code Quality
# Run linter
ruff check src tests
# Run formatter
ruff format src tests
# Run type checker
mypy src
Building
# Install build tools
pip install build twine
# Build the package
python -m build
# Check the package
twine check dist/*
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
License
This project is licensed under the MIT License - see the LICENSE file for details.
Related Projects
- Weka - The original machine learning toolkit that uses ARFF format
- liac-arff - Another Python library for ARFF files
- scipy.io.arff - SciPy's ARFF reader
Changelog
See CHANGELOG.md for a list of changes.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arff_csv_converter-1.1.0.tar.gz.
File metadata
- Download URL: arff_csv_converter-1.1.0.tar.gz
- Upload date:
- Size: 40.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d9548b6cb3ca2ee52759b0bf7d3b274b2d39fcccd35eea955567cd6bb979017d
|
|
| MD5 |
a74516ba5e539bf317f2b5c450f3c622
|
|
| BLAKE2b-256 |
fb3a5a6394a15871726782a1d3f8a89849d7342634dae9a262fc7f11697d7eb5
|
Provenance
The following attestation bundles were made for arff_csv_converter-1.1.0.tar.gz:
Publisher:
publish.yml on rmontanana/arff-csv
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arff_csv_converter-1.1.0.tar.gz -
Subject digest:
d9548b6cb3ca2ee52759b0bf7d3b274b2d39fcccd35eea955567cd6bb979017d - Sigstore transparency entry: 803760289
- Sigstore integration time:
-
Permalink:
rmontanana/arff-csv@0ec8cb48b42fb49c592bb0a098f30aa80d611ae9 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/rmontanana
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0ec8cb48b42fb49c592bb0a098f30aa80d611ae9 -
Trigger Event:
release
-
Statement type:
File details
Details for the file arff_csv_converter-1.1.0-py3-none-any.whl.
File metadata
- Download URL: arff_csv_converter-1.1.0-py3-none-any.whl
- Upload date:
- Size: 25.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5172ccee6b5cfb53c32c769c1870fcfb3ba27f6cf2a371a9725fa4cb3ac73ff6
|
|
| MD5 |
942f38e32310b9c67d8d08639bb9e53e
|
|
| BLAKE2b-256 |
304a2f1857631c59e96da79d5526843444f30999410490a69155f8cc3114bf7d
|
Provenance
The following attestation bundles were made for arff_csv_converter-1.1.0-py3-none-any.whl:
Publisher:
publish.yml on rmontanana/arff-csv
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
arff_csv_converter-1.1.0-py3-none-any.whl -
Subject digest:
5172ccee6b5cfb53c32c769c1870fcfb3ba27f6cf2a371a9725fa4cb3ac73ff6 - Sigstore transparency entry: 803760329
- Sigstore integration time:
-
Permalink:
rmontanana/arff-csv@0ec8cb48b42fb49c592bb0a098f30aa80d611ae9 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/rmontanana
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@0ec8cb48b42fb49c592bb0a098f30aa80d611ae9 -
Trigger Event:
release
-
Statement type: