Skip to main content

yamconv converts the file formats of machine learning datasets

Project description

yamconv

yamconv coverts a machine learning dataset from one format to another format.

Installation

yamconv is published on PyPI. You can install yamconv using pip as follows:

pip install yamconv

Alternatively, you can install it from the source code by running pip in the project directory where setup.py is located:

pip install .

Usage

yamconv -c converter_name -i input_file -o ouput_file -v
  • -c: converter name
  • -i: input file path
  • -o: output file path
  • -v: verbose

Supported converters

The following are the supported converters:

  • fasttext2sqlite: fastText text file to SQLite database file
  • sqlite2fasttext: SQLite database file to fastText text file

Supported dataset formats

Multi-label text classificaiton

fastText text file

The fastText format is a text file that contains a series of lines. Each line represents a text classified by multiple labels. A line starts with multiple labels, followed by the text content. Each label is marked with the __label__ prefix and the labels are separated by a space. The following is a fragment of an example fastText dataset file:

__label__food __label__region Dimsum is popular in Hong Kong restaurants.
__label__region __label__plant __label__business The Netherlands is center of the production for the European floral market.

SQLite database

A SQLite database is used to store the classifications of texts. The database schema is as follows:

CREATE TABLE IF NOT EXISTS texts (
    id TEXT NOT NULL PRIMARY KEY,
    text TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS labels (
    label TEXT NOT NULL,
    text_id text NOT NULL,
    FOREIGN KEY (text_id) REFERENCES texts(id)
);
CREATE INDEX IF NOT EXISTS label_index ON labels (label);
CREATE INDEX IF NOT EXISTS text_id_index ON labels (text_id);

The texts table contains the text contents in the text field, and each row is uniquely identified by the id field. The labels table contains the labels in the label field. Each row has a text_id foreign key that links the label to the text in the texts table, where the text is classified with the label. In other words, each row in texts is associated with zero or more rows in labels.

Profesional services

If you need any supporting resources or consultancy services from YAM AI Machinery, please find us at:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yamconv-0.1.2.tar.gz (2.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page