Skip to main content

A utility library for working with Table Schema in Python

Project description

# tableschema-py

[![Travis](https://travis-ci.org/frictionlessdata/tableschema-py.svg?branch=master)](https://travis-ci.org/frictionlessdata/tableschema-py)
[![Coveralls](http://img.shields.io/coveralls/frictionlessdata/tableschema-py.svg?branch=master)](https://coveralls.io/r/frictionlessdata/tableschema-py?branch=master)
[![PyPi](https://img.shields.io/pypi/v/tableschema.svg)](https://pypi.python.org/pypi/tableschema)
[![SemVer](https://img.shields.io/badge/versions-SemVer-brightgreen.svg)](http://semver.org/)
[![Gitter](https://img.shields.io/gitter/room/frictionlessdata/chat.svg)](https://gitter.im/frictionlessdata/chat)

A library for working with [Table Schema](http://specs.frictionlessdata.io/table-schema/) in Python.

## Features

- `Table` to work with data tables described by Table Schema
- `Schema` representing Table Schema
- `Field` representing Table Schema field
- `validate` to validate Table Schema
- `infer` to infer Table Schema from data
- built-in command-line interface to validate and infer schemas
- storage/plugins system to connect tables to different storage backends like SQL Database

## Important Notes

- There are BREAKING changes in `v1` (pre-release):
- package on PyPi has been renamed to `tableschema`
- following deprecated API has been removed the package:
- `tableschema.push/pull_resource` (use `tableschema.Table`)
- `tableschema.Validator` (use `tableschema.validate`)
- `tableschema.storage` (use `tableschema.Storage`)
- `tableschema.model` (use `tableschema.Schema`)
- `tableschema.types` (use `tableschema.Field`)
- rebased on Table Schema v1 null/types/constraints symantics
- `Field.cast/test_value` now accepts `constraints=bool/list` argument instead of `skip_constraints=bool` and `constraint=str`
- other changes could be introduced before final release
- documentation for previous release (`v0.10`) could be found [here](https://github.com/frictionlessdata/tableschema-py/tree/v0.10.0)
- There are deprecating changes in `v0.7`:
- renewed API has been introduced in non breaking manner
- documentation for deprecated API could be found [here](https://github.com/frictionlessdata/tableschema-py/tree/0.6.5)

## Gettings Started

### Installation

```bash
$ pip install jsontableschema # v0.10
$ pip install tableschema --pre # v1.0-alpha
```
### Example

```python
from tableschema import Table

# Create table
table = Table('path.csv', schema='schema.json')

# Print schema descriptor
print(table.schema.descriptor)

# Print cast rows in a dict form
for keyed_row in table.iter(keyed=True):
print(keyed_row)
```

### Table

Table represents data described by Table Schema:

```python
# pip install sqlalchemy tableschema-sql
import sqlalchemy as sa
from pprint import pprint
from tableschema import Table

# Data source
SOURCE = 'https://raw.githubusercontent.com/frictionlessdata/tableschema-py/master/data/data_infer.csv'

# Create SQL database
db = sa.create_engine('sqlite://')

# Data processor
def skip_under_30(erows):
for number, headers, row in erows:
krow = dict(zip(headers, row))
if krow['age'] >= 30:
yield (number, headers, row)

# Work with table
table = Table(SOURCE, post_cast=[skip_under_30])
table.schema.save('tmp/persons.json') # Save INFERRED schema
table.save('persons', backend='sql', engine=db) # Save data to SQL
table.save('tmp/persons.csv') # Save data to DRIVE

# Check the result
pprint(Table('persons', backend='sql', engine=db).read(keyed=True))
pprint(Table('tmp/persons.csv').read(keyed=True))
# Will print (twice)
# [{'age': 39, 'id': 1, 'name': 'Paul'},
# {'age': 36, 'id': 3, 'name': 'Jane'}]
```

### Schema

A model of a schema with helpful methods for working with the schema and supported data. Schema instances can be initialized with a schema source as a filepath or url to a JSON file, or a Python dict. The schema is initially validated (see [validate](#validate) below), and will raise an exception if not a valid Table Schema.

```python
from tableschema import Schema

# Init schema
schema = Schema('path.json')

# Cast a row
schema.cast_row(['12345', 'a string', 'another field'])
```

Methods available to `Schema` instances:

- `descriptor` - return schema descriptor
- `fields` - an array of the schema's Field instances
- `headers` - an array of the schema headers
- `primary_key` - the primary key field for the schema as an array
- `foreignKey` - the foreign key property for the schema as an array
- `get_field(name)` - return the field object for given name
- `has_field(name)` - return a bool if the field exists in the schema
- `cast_row(row, no_fail_fast=False)` - return row cast against schema
- `save(target)` - save schema to filesystem

Where the option `no_fail_fast` is given, it will collect all errors it encouters and an exceptions.MultipleInvalid will be raised (if there are errors).

### Field

```python
from tableschema import Field

# Init field
field = Field({'name': 'name', type': 'number'})

# Cast a value
field.cast_value('12345') # -> 12345
```

Data values can be cast to native Python objects with a Field instance. Type instances can be initialized with [field descriptors](https://specs.frictionlessdata.io/table-schema/). This allows formats and constraints to be defined.

Casting a value will check the value is of the expected type, is in the correct format, and complies with any constraints imposed by a schema. E.g. a date value (in ISO 8601 format) can be cast with a DateType instance. Values that can't be cast will raise an `InvalidCastError` exception.

Casting a value that doesn't meet the constraints will raise a `ConstraintError` exception.

### validate

Given a schema as JSON file, url to JSON file, or a Python dict, `validate` returns `True` for a valid Table Schema, or raises an exception, `SchemaValidationError`. It validates only **schema**, not data against schema!

```python
import io
import json

from tableschema import validate

with io.open('schema_to_validate.json') as stream:
descriptor = json.load(stream)

try:
tableschema.validate(descriptor)
except tableschema.exceptions.SchemaValidationError as exception:
# handle error

```

It may be useful to report multiple errors when validating a schema. This can be done with `no_fail_fast` flag set to True.

```python
try:
tableschema.validate(descriptor, no_fail_fast=True)
except tableschema.exceptions.MultipleInvalid as exception:
for error in exception.errors:
# handle error
```

### infer

Given headers and data, `infer` will return a Table Schema as a Python dict based on the data values. Given the data file, data_to_infer.csv:

```
id,age,name
1,39,Paul
2,23,Jimmy
3,36,Jane
4,28,Judy
```

Call `infer` with headers and values from the datafile:

```python
import io
import csv

from tableschema import infer

filepath = 'data_to_infer.csv'
with io.open(filepath) as stream:
headers = stream.readline().rstrip('\n').split(',')
values = csv.reader(stream)

schema = infer(headers, values)
```

`schema` is now a schema dict:

```python
{u'fields': [
{
u'description': u'',
u'format': u'default',
u'name': u'id',
u'title': u'',
u'type': u'integer'
},
{
u'description': u'',
u'format': u'default',
u'name': u'age',
u'title': u'',
u'type': u'integer'
},
{
u'description': u'',
u'format': u'default',
u'name': u'name',
u'title': u'',
u'type': u'string'
}]
}
```

The number of rows used by `infer` can be limited with the `row_limit` argument.

### CLI

> It's a provisional API excluded from SemVer. If you use it as a part of other program please pin concrete `goodtables` version to your requirements file.

Table Schema features a CLI called `tableschema`. This CLI exposes the `infer` and `validate` functions for command line use.

Example of `validate` usage:

```
$ tableschema validate path/to-schema.json
```

Example of `infer` usage:

```
$ tableschema infer path/to/data.csv
```

The response is a schema as JSON. The optional argument `--encoding` allows a character encoding to be specified for the data file. The default is utf-8.

### Storage

The library includes interface declaration to implement tabular `Storage`:

![Storage](data/storage.png)

An implementor should follow `tableschema.Storage` interface to write his
own storage backend. This backend could be used with `Table` class. See `plugins`
system below to know how to integrate custom storage plugin.

### plugins

Table Schema has a plugin system. Any package with the name like `tableschema_<name>` could be imported as:

```python
from tableschema.plugins import <name>
```

If a plugin is not installed `ImportError` will be raised with a message describing how to install the plugin.

A list of officially supported plugins:
- BigQuery Storage - https://github.com/frictionlessdata/tableschema-bigquery-py
- Pandas Storage - https://github.com/frictionlessdata/tableschema-pandas-py
- SQL Storage - https://github.com/frictionlessdata/tableschema-sql-py

## API Reference

### Snapshot

```
Table(source, schema=None, post_cast=None, backend=None, **options)
stream -> tabulator.Stream
schema -> Schema
name -> str
iter(keyed/extended=False) -> (generator) (keyed/extended)row[]
read(keyed/extended=False, limit=None) -> (keyed/extended)row[]
save(target, backend=None, **options)
Schema(descriptor)
descriptor -> dict
fields -> Field[]
headers -> str[]
primary_key -> str[]
foreign_keys -> str[]
get_field(name) -> Field
has_field(name) -> bool
cast_row(row, no_fail_fast=False) -> row
save(target)
Field(descriptor)
descriptor -> dict
name -> str
type -> str
format -> str
constraints -> dict
cast_value(value, constraints=True) -> value
test_value(value, constraints=True) -> bool
validate(descriptor, no_fail_fast=False) -> bool
infer(headers, values) -> descriptor
exceptions
~cli
---
Storage(**options)
buckets -> str[]
create(bucket, descriptor, force=False)
delete(bucket=None, ignore=False)
describe(bucket, descriptor=None) -> descriptor
iter(bucket) -> (generator) row[]
read(bucket) -> row[]
write(bucket, rows)
plugins
```

### Detailed

- [Docstrings](https://github.com/frictionlessdata/tableschema-py/tree/master/tableschema)
- [Changelog](https://github.com/frictionlessdata/tableschema-py/commits/master)

## Contributing

Please read the contribution guideline:

[How to Contribute](CONTRIBUTING.md)

Thanks!

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tableschema-1.0.0a5.tar.gz (35.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tableschema-1.0.0a5-py2.py3-none-any.whl (55.1 kB view details)

Uploaded Python 2Python 3

File details

Details for the file tableschema-1.0.0a5.tar.gz.

File metadata

  • Download URL: tableschema-1.0.0a5.tar.gz
  • Upload date:
  • Size: 35.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for tableschema-1.0.0a5.tar.gz
Algorithm Hash digest
SHA256 891ff9b3ea4a01de7ef5f9608dd70929106836f9ef49706324040a2082cb7063
MD5 2f2c8dd79f222dd33ca697d7e04f9696
BLAKE2b-256 42e589007e147c6eb80f85fba1b31ad476936eb404713f4f6a4943443e7d2c49

See more details on using hashes here.

File details

Details for the file tableschema-1.0.0a5-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for tableschema-1.0.0a5-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 58436b4042c536f17541017d96b1001358450f402e9b0f90d3364b9581c2d121
MD5 a23d3dd91d82ca5ee75db07b16de3182
BLAKE2b-256 14d1ac49f9515af40ce03cc764a80d882bb05ac22f7260262c618acf47eb1baa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page