Skip to main content

YAML based data loader

Project description

author:

Lele Gaifax

contact:

lele@metapensiero.it

license:

GNU General Public License version 3 or later

Data loader

Load new instances in the database, or update/delete existing ones, given a data structure represented by a YAML stream, as the following:

- entity: gam.model.Fascicolo
  key: descrizione
  # no data, just "declare" the entity

- entity: gam.model.TipologiaFornitore
  key: tipologiafornitore
  rows:
    - &tf_onesto
      tipologiafornitore: Test fornitori onesti

- entity: gam.model.ClienteFornitore
  key: descrizione
  rows:
    - descrizione: Test altro fornitore onesto
      tipologiafornitore: *tf_onesto
      partitaiva: 01234567890
    - &cf_lele
      codicefiscale: GFSMNL68C18H612V
      descrizione: Dipendente A

- entity: gam.model.Dipendente
  key: codicefiscale
  rows:
    - &lele
      codicefiscale: GFSMNL68C18H612V
      nome: Emanuele
      cognome: Gaifas
      clientefornitore: *cf_lele
      foto: !File {path: ../img/lele.jpg}

- entity: gam.model.Attrezzature
  key: descrizione
  rows:
    - &macchina
      descrizione: Fiat 500
      foto: !File
        compressor: lzma
        content: !!binary |
          /Td6WFoAAATm1rRGAgAhA...

- entity: gam.model.Prestiti
  key:
    - dipendente
    - attrezzatura
  rows:
    - dipendente: *lele
      attrezzatura: *macchina

As you can see, the YAML document is a sequence of entries, each one defining the content of a set of instances of a particular entity.

The entity must be the fully qualified dotted name of the SQLAlchemy mapped class.

The key entry may be either a single attribute name or a list of them, not necessarily corresponding to the primary key of the entity, provided that it uniquely identifies a single instance. To handle the simplest case of structured values (for example, when a field is backed by a PostgreSQL HSTORE), the key attribute name may be in the form name->slot:

- entity: model.Product
  key: description->en
  rows:
    - &cage
      description:
        en: "Roadrunner cage"
        it: "Gabbia per struzzi"

The rows (or data) may be either a single item or a list of them, each containing the data of a single instance, usually a dictionary.

When all (or most of) the instances share the same fields, a more compact representation may be used:

- entity: model.Values
  key:
    - product
    - attribute
  fields: [ product, attribute, value ]
  rows:
    - [ *cage, *size, 110cm x 110cm x 120cm ]
    - [ *cage, *weight, 230kg ]

where fields contains a list of field names and rows is a sequence of lists, each containing the values of a single instance. The two sintaxes may be mixed though, so you can say:

- entity: model.Person
  key: [ lastname, firstname ]
  fields: [ lastname, firstname, password ]
  rows:
    - [ gaifax, lele, "123456" ]
    - [ foobar, john, "abcdef" ]
    - lastname: rossi
      firstname: paolo
      birthdate: 1950-02-03

If you have a tab-separated-values file, you may say:

- entity: model.Cities
  key:
    - name
    - country
  fields: [ name, country ]
  rows: !TSV {path: ../data/cities.txt, encoding: utf-8}

and if the field names are included in the the first row of the file, simply omit the fields slot:

- entity: model.Countries
  key:
    - code
  rows: !TSV {path: ../data/countries.txt, encoding: utf-8}

The dbloady tool iterates over all the entities, and for each instance it determines if it already exists querying the database with the given key: if it’s there, it updates it otherwise it creates a new one and initializes it with its data.

Test fixture facility

With the option --save-new-instances newly created instances will be written (actually added) to the given file in YAML format, so that at some point they can be deleted using the option --delete on that file. Ideally

dbloady -u postgresql://localhost/test -s new.yaml fixture.yaml
dbloady -u postgresql://localhost/test -D new.yaml

should remove fixture’s traces from the database, if it contains only new data.

Pre and post load scripts

The option --preload may be used to execute an arbitrary Python script before any load happens. This is useful either to tweak the YAML context or to alter the set of file names specified on the command line (received as the fnames global variable).

The following script registers a custom costructor that recognizes the tag !time or a value like T12:34 as a datetime.time value:

import datetime, re
from ruamel import yaml

def time_constructor(loader, node):
    value = loader.construct_scalar(node)
    if value.startswith('T'):
        value = value[1:]
    parts = map(int, value.split(':'))
    return datetime.time(*parts)

yaml.add_constructor('!time', time_constructor)
yaml.add_implicit_resolver('!time', re.compile(r'^T?\d{2}:\d{2}(:\d{2})?$'), ['T'])

As another example, the following script handles input files with a .gpg suffix decrypting them on the fly to a temporary file that will be deleted when the program exits:

import atexit, os, subprocess, tempfile

def decipher(fname):
    print("Input file %s is encrypted, please enter passphrase" % fname)
    with tempfile.NamedTemporaryFile(suffix='.yaml') as f:
        tmpfname = f.name
    subprocess.run(['gpg', '-q', '-o', tmpfname, fname], check=True)
    atexit.register(lambda n=tmpfname: os.unlink(n))
    return tmpfname

fnames = [decipher(fname) if fname.endswith('.gpg') else fname for fname in fnames]

Then you have:

dbloady -u postgresql://localhost/test -p preload.py data.yaml.gpg
Input file data.yaml.gpg is encrypted, please enter passphrase
/tmp/tmpfhjrdqgf.yaml: ......
Committing changes

The option --postload may be used to perform additional steps after all YAML files have been loaded but before the DB transaction is committed.

The pre/post load scripts are executed with a context containing the following variables:

session

the SQLAlchemy session

dry_run

the value of the --dry-run option

fnames

the list of file names specified on the command line

Generic foreign keys

Version 1.6 introduced rudimentary and experimental support for the generic foreign keys trick. It assumes that they are implemented with a hybrid property that exposes a custom comparator. See tests/generic_fk/model.py for an example.

With a proper configuration, the following works:

- entity: model.Customer
  key: name
  data:
    - &customer
      name: Best customer

- entity: model.Supplier
  key: company_name
  data:
    - &supplier
      company_name: ACME

- entity: model.Address
  key:
    - related_object
    - street
  data:
    - related_object: *customer
      street: 123 anywhere street
    - related_object: *supplier
      street: 321 long winding road

Direct assignment of primary keys

When the attribute does not correspond to a relationship property, assignment of an instance reference will set the attribute to the instance’s primary key:

- entity: model.Person
  key:
    - lastname
    - firstname
  fields:
    - lastname
    - firstname
  data:
    - &johndoe [ Doe, John ]

- entity: model.CannedFilter
  key: description
  data:
    - &onlyjohndoe
      description: "Only John Doe"

- entity: model.Condition
  key:
    - filter
    - fieldname
  data:
    - filter: *onlyjohndoe
      fieldname: "persons.id"
      fieldvalue: *johndoe

Raw SQL values

Sometime a value requires executing an arbitrary query on the database, maybe because it is computed by a trigger or more generally because it cannot be determined by the YAML content:

- entity: model.Number
  key:
    id
  data:
    - id: 1
      absolute: !SQL {query: "SELECT abs(:num)", params: {num: -1}}
    - id: !SQL {query: "SELECT abs(:num)", params: {num: -2}}
      absolute: !SQL {query: "SELECT abs(:num)", params: {num: -2}}
    - id: 3
      absolute: !SQL {query: "SELECT count(*) FROM numbers"}

The specified query must return a single value, as it is executed with session.scalar().

Data dumper

With the complementary tool, dbdumpy, you can obtain a YAML representation out of a database in the same format used by dbloady. It’s rather simple and in particular it does not handle reference cycles.

The tool is driven by a specs file, a YAML document composed by two parts: the first defines the pivots instances (that is, the entry points), the second describes how each entity must be serialized and in which order.

Consider the following document:

- entity: model.Result
---
- entity: model.Person
  key:
    - lastname
    - firstname

- entity: model.Exam
  fields: description
  key: description

- entity: model.Result
  key:
    - person
    - exam
  other:
    - vote

It tells dbdumpy to consider all instances of model.Result as the pivots, then defines how each entity must be serialized, simply by listing the key attribute(s) and any further other field. Alternatively, you can specify a list of fields names, to obtain the more compact form described above.

Executing

dbdumpy -u sqlite:////foo/bar.sqlite spec.yaml

will emit the following on stdout:

- entity: model.Person
  key:
  - lastname
  - firstname
  rows:
  - &id002
    firstname: John
    lastname: Doe
  - &id003
    firstname: Bar
    lastname: Foo
- entity: model.Exam
  fields: description
  key: description
  rows:
  - &id001
    - Drive license
- entity: model.Result
  key:
  - person
  - exam
  rows:
  - exam: *id001
    person: *id002
    vote: 10
  - exam: *id001
    person: *id003
    vote: 5

Changes

2.10 (2020-01-18)

  • Fix an issue with loading many-to-many relationships

2.9 (2019-06-24)

  • Mimic how PostgreSQL decodes \N as None in the TSV tag

2.8 (2019-06-24)

  • Ability to load data from an external tab-separated-values file

2.7 (2019-05-10)

  • Emit a critical log on attribute assignment failure, to aid debugging bad input

2.6 (2018-04-17)

  • Remove the fixup to progressbar2 issue #162, solved in its 3.7.1 release

2.5 (2018-04-09)

  • Try to fix different behaviour in progressbar2 3.7.0 w.r.t. multiple progress bars

2.4 (2018-04-08)

  • Now File elements can read text files

  • Support dumping hstore values (not tested enough, though)

2.3 (2017-06-07)

  • Fix handling of property based attributes

2.2 (2017-05-18)

  • The File elements may now contain their content, without accessing external files

2.1 (2017-05-02)

  • New --quiet option to omit the progress bar

2.0 (2017-04-06)

1.11 (2017-03-22)

  • Spring cleanup, no externally visible changes

1.10 (2016-11-16)

1.9 (2016-11-15)

  • Ability to execute raw SQL statements to fetch a value from the database

1.8 (2016-11-15)

  • Better tests

  • Handle assignments to non-relationship properties

1.7 (2016-11-05)

  • Make Python 3 happy by explicitly use binary mode to read external files

1.6 (2016-10-29)

1.5 (2016-03-12)

  • New complementary dump functionality, exposed by a new cli tool, dbdumpy

  • Cosmetic, backward compatible, changes to the YAML format, for nicer sorting

1.4 (2016-02-10)

  • Data files and preload/postload scripts may be specified also as package relative resources

1.3 (2016-01-14)

  • New –preload and –postload options to execute arbitrary Python scripts before or after the load

1.2 (2016-01-09)

  • Fix source distribution

1.1 (2016-01-09)

  • Fix data refs when loading from compact representation

1.0 (2016-01-07)

  • Allow more compact representation when all instances share the same fields

  • Extract dbloady from metapensiero.sphinx.patchdb 1.4.2 into a standalone package

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

metapensiero.sqlalchemy.dbloady-2.10.tar.gz (32.4 kB view details)

Uploaded Source

File details

Details for the file metapensiero.sqlalchemy.dbloady-2.10.tar.gz.

File metadata

  • Download URL: metapensiero.sqlalchemy.dbloady-2.10.tar.gz
  • Upload date:
  • Size: 32.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.21.0 setuptools/40.8.0 requests-toolbelt/0.9.1 tqdm/4.31.1 CPython/3.7.6

File hashes

Hashes for metapensiero.sqlalchemy.dbloady-2.10.tar.gz
Algorithm Hash digest
SHA256 c02f79f242bd196b20f24edf4d8150e05a461934e9dd6fd0c7d95cda007bbf64
MD5 2438a8add801dec42983770890cf247d
BLAKE2b-256 fa33942b6b9e2830382c562cb16140d4cb14006ba1177e470d123d738d49a183

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page