yamconv converts the file formats of machine learning datasets

These details have not been verified by PyPI

Project links

Homepage

Project description

yamconv

yamconv converts a machine learning dataset from one format to another format.

Installation

yamconv is published on PyPI. You can install yamconv using pip as follows:

pip install yamconv

Alternatively, you can install it from the source code by running pip in the project directory where setup.py is located:

pip install .

Usage

yamconv.py -c converter -i input_file -o output_file -s settings -v

-c: converter name
-i: input file path
-o: output file path
-s: converter settings in JSON
-v: verbose, to display the processing progress and information

Supported converters

The following are the supported converters:

mlt.sqlite2fasttext: SQLite database file to fastText text file
mlt.sqlite2csv: SQLite database file to CSV text file
mlt.fasttext2sqlite: fastText text file to SQLite database file
mlt.csv2sqlite: CSV text file to SQLite database file
mlt.csv2fasttext: CSV text file to fastText text file
mlt.sqlite2sqlite: SQLite database file to SQLite database file (with normalization)
mlt.fasttext2fasttext: fastText text file to fastText text file (with normalization)
mlt.csv2csv: CSV text file to CSV text file (with normalization)

Settings

Settings for converters are given in the -s option as a JSON string, e.g., '{"cache_labels": true}'.

Setting	Values	Description	Applicable converters
`normalize_labels`	`true` (default), `false`	When `normalize_labels` is `true`, all labels are normalized. That is, all symbols are removed; all alphabets are converted to lower case.	Any
`word_seq`	`true`, `false` (default)	When `word_seq` is `true`, each text is normalized into a sequence of lower-case words. That is, all symbols are removed, all alphabets are converted to lower case; and all unicode word characters (e.g., Chinese characters) are delimited by a space.	Any
`cache_labels`	`true`, `false` (default)	When `cache_labels` is `true`, the normalized labels are cached in memory. It can be set to `false` if there is insufficient memory to cache a huge number of different labels in the dataset.	Any

Supported dataset formats

Multi-label text classificaiton

SQLite database

A SQLite database is used to store the classifications of texts. The database schema is as follows:

CREATE TABLE IF NOT EXISTS texts (
    id TEXT NOT NULL PRIMARY KEY,
    text TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS labels (
    label TEXT NOT NULL,
    text_id text NOT NULL,
    FOREIGN KEY (text_id) REFERENCES texts(id)
);
CREATE INDEX IF NOT EXISTS label_index ON labels (label);
CREATE INDEX IF NOT EXISTS text_id_index ON labels (text_id);

The texts table contains the text contents in the text field, and each row is uniquely identified by the id field. The labels table contains the labels in the label field. Each row has a text_id foreign key that links the label to the text in the texts table, where the text is classified with the label. In other words, each row in texts is associated with zero or more rows in labels.

fastText text file

The fastText format is a text file that contains a series of lines. Each line represents a text classified by multiple labels. A line starts with multiple labels, followed by the text content. Each label is marked with the __label__ prefix and the labels are separated by a space. The following is a fragment of an example fastText dataset file:

__label__food __label__region Many people love having dim sum in Hong Kong restaurants.
__label__region __label__plant __label__business The Netherlands is the major supplier to the European floral market.

CSV text file

The dataset is in form of a CSV (Common Separated Values) file. The first row is the header. Each of the second row and the following rows stores a single record. The CSV file can be in either of one of the following formats.

Format 1

Suppose the format of the header row is like the follwoing:

"id", "text", "region", "business", "food", "plant"

That is:

Cell 1: id
Cell 2: any arbitary value
Cell n where n >= 3: the name of label n, e.g., region, business, food, plant.

Each record row looks like:

"10", "Many people love having dim sum in Hong Kong restaurants.", 1, 0, 1, 0

That is:

Cell 1: the id string
Cell 2: the text content
Cell n where n >= 3: 1 or 0 representing whether the text is classified with label n or not respectively.

Format 2

Suppose the format of the header row is like the follwoing:

"text", "region", "business", "food", "plant"

That is:

Cell 1: any arbitary value
Cell n where n >= 2: the name of label n, e.g., region, business, food, plant.

Each record row looks like:

"Many people love having dim sum in Hong Kong restaurants.", 1, 0, 1, 0

That is:

Cell 1: the text content
Cell n where n >= 2: 1 or 0 representing whether the text is classified with label n or not respectively.

Profesional services

If you need any supporting resources or consultancy services from YAM AI Machinery, please find us at:

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

0.1.7

Sep 10, 2019

0.1.5

Sep 6, 2019

0.1.4

Aug 29, 2019

0.1.3

Aug 28, 2019

0.1.2

Aug 27, 2019

0.1.1

Aug 27, 2019

0.1

Aug 27, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yamconv-0.1.7.tar.gz (9.4 kB view details)

Uploaded Sep 10, 2019 Source

File details

Details for the file yamconv-0.1.7.tar.gz.

File metadata

Download URL: yamconv-0.1.7.tar.gz
Upload date: Sep 10, 2019
Size: 9.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.35.0 CPython/3.6.8

File hashes

Hashes for yamconv-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`185395dd8c3f6ef787e7ce8d728604690e002e828a3deb3d8ad6ac1a9cf0fb41`
MD5	`76c7214eaab11de113457609d3137492`
BLAKE2b-256	`300cfc176c6661b1e533c031491021fdac8f3dcab93a2d6dda7b67ca5698e3c0`

See more details on using hashes here.

yamconv 0.1.7

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

yamconv

Installation

Usage

Supported converters

Settings

Supported dataset formats

Multi-label text classificaiton

SQLite database

fastText text file

CSV text file

Format 1

Format 2

Profesional services

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes