Skip to main content

The OpenRefine Python Client Library provides an interface to communicating with an OpenRefine server. This fork extends the command line interface (CLI).

Project description

OpenRefine Python Client with extended command line interface

Codacy Badge Docker PyPI Binder

The OpenRefine Python Client from PaulMakepeace provides a library for communicating with an OpenRefine server. This fork extends the command line interface (CLI) and is distributed as a convenient one-file-executable (Windows, Linux, macOS). It is also available via Docker Hub, PyPI and Binder.

works with OpenRefine 2.7, 2.8, 3.0, 3.1, 3.2, 3.3, 3.4, 3.4.1

Download

One-file-executables:

For Docker containers, native Python installation and free Binder on-demand server see the corresponding chapters below.

Peek

A short video loop that demonstrates the basic features (list, create, apply, export):

video loop that demonstrates basic features

Usage

Ensure you have OpenRefine running (i.e. available at http://localhost:3333 or another URL).

To use the client:

  1. Open a terminal pointing to the folder where you have downloaded the one-file-executable (e.g. Downloads in your home directory).

    • Windows: Open PowerShell and enter following command

      cd ~\Downloads
      
    • macOS: Open Terminal (Finder > Applications > Utilities > Terminal) and enter following command

      cd ~/Downloads
      
    • Linux: Open terminal app (Terminal, Konsole, xterm, ...) and enter following command

      cd ~/Downloads
      
  2. Make the file executable.

    • Windows: not necessary

    • macOS:

      chmod +x openrefine-client_0-3-10_macos
      
    • Linux:

      chmod +x openrefine-client_0-3-10_linux
      
  3. Execute the file.

    • Windows:

      .\openrefine-client_0-3-10_windows.exe
      
    • macOS:

      ./openrefine-client_0-3-10_macos
      
    • Linux:

      ./openrefine-client_0-3-10_linux
      

Using tab completion and command history is highly recommended:

  • autocomplete filenames: enter a few characters and press
  • recall previous command: press

Basic commands

Execute the client by entering its filename followed by the desired command.

The following example will download two small files (duplicates.csv and duplicates-deletion.json) into the current directory and will create a new OpenRefine project from file duplicates.csv.

Download example data (--download) and create project from file (--create):

  • Windows:

    .\openrefine-client_0-3-10_windows.exe --download "https://git.io/fj5hF" --output=duplicates.csv
    .\openrefine-client_0-3-10_windows.exe --download "https://git.io/fj5ju" --output=duplicates-deletion.json
    .\openrefine-client_0-3-10_windows.exe --create duplicates.csv
    
  • macOS:

    ./openrefine-client_0-3-10_macos --download "https://git.io/fj5hF" --output=duplicates.csv
    ./openrefine-client_0-3-10_macos --download "https://git.io/fj5ju" --output=duplicates-deletion.json
    ./openrefine-client_0-3-10_macos --create duplicates.csv
    
  • Linux:

    ./openrefine-client_0-3-10_linux --download "https://git.io/fj5hF" --output=duplicates.csv
    ./openrefine-client_0-3-10_linux --download "https://git.io/fj5ju" --output=duplicates-deletion.json
    ./openrefine-client_0-3-10_linux --create duplicates.csv
    

Other commands:

  • list all projects: --list
  • show project metadata: --info "duplicates"
  • export project to terminal: --export "duplicates"
  • apply rules from json file: --apply duplicates-deletion.json "duplicates"
  • export project to file: --export --output=deduped.xls "duplicates"
  • delete project: --delete "duplicates"

Getting help

Check --help for further options.

Please file an issue if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!

Change URL

By default the client connects to the usual URL of OpenRefine http://localhost:3333. If your OpenRefine server is running somewhere else then you may set hostname and port with additional command line options (e.g. http://example.com):

  • set host: -H example.com
  • set port: -P 80

Templating

The OpenRefine Templating supports exporting data in any text format (i.e. to construct JSON or XML). The graphical user interface offers four input fields:

  1. prefix
  2. row template
    • supports GREL inside two curly brackets, e.g. {{jsonize(cells["name"].value)}}
  3. row separator
  4. suffix

This templating functionality is available via the openrefine-client command line interface. It even provides an additional feature for splitting results into multiple files.

To try out the functionality create another project from the example file above.

--create duplicates.csv --projectName=advanced

The following example code will export...

  • the columns "name" and "purchase" in JSON format
  • from the project "advanced"
  • for rows matching the regex text filter ^F$ in column "gender"

macOS/Linux Terminal (multi-line input with \ ):

"advanced" \
--prefix='{ "events" : [
' \
--template='    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender'

Windows PowerShell (multi-line input with `; quotes needs to be doubled):

"advanced" `
--prefix='{ ""events"" : [
' `
--template='    { ""name"" : {{jsonize(cells[""name""].value)}}, ""purchase"" : {{jsonize(cells[""purchase""].value)}} }' `
--rowSeparator=',
' `
--suffix='
] }' `
--filterQuery='^F$' `
--filterColumn='gender'

Add the following options to the last command (recall with ) to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix.

--output=advanced.json --splitToFiles=true

Filenames are suffixed with the row number by default (e.g. advanced_1.json, advanced_2.json etc.). There is another option to use the value in the first column instead:

--output=advanced.json --splitToFiles=true --suffixById=true

Because our project "advanced" contains duplicates in the first column "email" this command will overwrite files (e.g. advanced_melanie.white@example2.edu.json). When using this option, the first column should contain unique identifiers.

Append data to an existing project

OpenRefine does not support appending rows to an existing project. As long as the feature request is not yet implemented, you can use the openrefine-client to script a workaround:

  1. export existing project as csv
  2. put old and new data into a zip archive
  3. create new project by importing the zip archive

Here is an example that replaces the existing project:

openrefine-client --export myproject --output old.csv
openrefine-client --delete myproject
zip combined.zip old.csv new.csv
openrefine-client --create combined.zip --format csv --projectName myproject

Note that the project id will change. If you want to distinguish between old and new data, you can use the additional flag includeFileSources:

openrefine-client --create combined.zip --format csv --projectName myproject --includeFileSources true

See also

Docker

felixlohmeier/openrefine-client Docker

docker pull felixlohmeier/openrefine-client:v0.3.10

Option 1: Dockerized client

Run client and mount current directory as workspace:

docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10

The docker option --network=host allows you to connect to a local or remote OpenRefine via the host network:

  • list projects on default URL (http://localhost:3333)

    docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --list
    
  • list projects on a remote server (http://example.com)

    docker run --rm --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H example.com -P 80 --list
    

Usage: same commands as explained above (see Basic Commands and Advanced Templating)

Option 2: Dockerized client and dockerized OpenRefine

Run openrefine-client linked to a dockerized OpenRefine (felixlohmeier/openrefine Docker):

  1. Create docker network

    docker network create openrefine
    
  2. Run server (will be available at http://localhost:3333)

    docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.4.1
    
  3. Run client with some basic commands: 1. download example files, 2. create project from file, 3. list projects, 4. show metadata, 5. export to terminal, 6. apply transformation rules (deduplication), 7. export again to terminal, 8. export to xls file and 9. delete project

    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --download "https://git.io/fj5hF" --output=duplicates.csv
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --download "https://git.io/fj5ju" --output=duplicates-deletion.json
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --create duplicates.csv
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --list
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --info "duplicates"
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --export "duplicates"
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --apply duplicates-deletion.json "duplicates"
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --export "duplicates"
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --export --output=deduped.xls "duplicates"
    docker run --rm --network=openrefine -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 -H openrefine-server --delete "duplicates"
    
  4. Stop and delete server:

    docker stop openrefine-server
    docker rm openrefine-server
    
  5. Delete docker network:

    docker network rm openrefine
    

Customize OpenRefine server:

  • If you want to add an OpenRefine startup option you need to repeat the default commands (cf. Dockerfile)

    • -i 0.0.0.0 sets OpenRefine to be accessible from outside the container, i.e. from host
    • -d /data sets OpenRefine workspace
  • Example for allocating more memory to OpenRefine with additional option -m 4G

    docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:3.4.1 -i 0.0.0.0 -d /data -m 4G
    
  • The OpenRefine version is defined by the docker tag. Check the DockerHub repository for available tags. Example for OpenRefine 2.8 with same options as above:

    docker run -d -p 3333:3333 --network=openrefine --name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G
    
  • If you want OpenRefine to read and write persistent data in host directory (i.e. store projects) you can mount the container path /data. Example for host directory /home/felix/refine:

    docker run -d -p 3333:3333 -v /home/felix/refine:/data:z --network=openrefine name=openrefine-server felixlohmeier/openrefine:2.8 -i 0.0.0.0 -d /data -m 4G
    

See also:

Python

openrefine-client PyPI (requires Python 2.x)

python2 -m pip install openrefine-client --user

This will install the package openrefine-client containing modules in google.refine.

A command line script openrefine-client will also be installed.

Option 1: command line script

openrefine-client --help

Usage: same commands as explained above (see Basic Commands and Advanced Templating)

Option 2: using cli functions in Python 2.x environment

Import module cli:

from google.refine import cli

Change URL (if necessary):

cli.refine.REFINE_HOST = 'localhost'
cli.refine.REFINE_PORT = '3333'

Help screen:

help(cli)

Commands:

  • download (e.g. example data):

    cli.download('https://git.io/fj5hF','duplicates.csv')
    cli.download('https://git.io/fj5ju','duplicates-deletion.json')
    
  • list projects:

    cli.ls()
    
  • create project:

    p1 = cli.create('duplicates.csv')
    
  • show metadata:

    cli.info(p1.project_id)
    
  • apply rules from file to project:

    cli.apply(p1.project_id, 'duplicates-deletion.json')
    
  • export project to terminal:

    cli.export(p1.project_id)
    
  • export project to file in xls format:

    cli.export(p1.project_id, 'deduped.xls')
    
  • export templating (see Advanced Templating above):

    cli.templating(
        p1.project_id,
        prefix='''{ "events" : [
    ''',template='''    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''',
        rowSeparator=''',
    ''',suffix='''
    ] }''')
    
  • delete project:

    cli.delete(p1.project_id)
    

Option 3: the upstream way

This fork can be used in the same way as the upstream Python client library.

Some functions in the python client library are not yet compatible with OpenRefine >=3.0 (cf. issue #19 in refine-client-py).

Import module refine:

from google.refine import refine

Server Commands:

  • set up connection:

    server1 = refine.Refine('http://localhost:3333')
    
  • show version:

    server1.server.get_version()
    server1.server.version
    
  • list projects:

    server1.list_projects()
    
    • pretty print the returned dict with json.dumps:

      import json
      print(json.dumps(server1.list_projects(), indent=1))
      
  • create project:

    server1.new_project(project_file='duplicates.csv')
    
    • create and open the returned project in one step:

      project1 = server1.new_project(project_file='duplicates.csv')
      

Project commands:

  • open project:

    project1 = server1.open_project('1234567890123')
    
  • print full URL to project:

    project1.project_url()
    
  • list columns:

    project1.columns
    
  • compute text facet on first column (fails with OpenRefine >=3.2):

    project1.compute_facets(facet.TextFacet(project1.columns[0]))
    
    • print returned object

      facets = project1.compute_facets(facet.TextFacet(project1.columns[0])).facets[0]
      for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True):
          print(facets.choices[k].count, k)
      
  • compute clusters on first column:

    project1.compute_clusters(project1.columns[0])
    
  • apply rules from file to project:

    project1.apply_operations('duplicates-deletion.json')
    
  • export project:

    project1.export(export_format='tsv')
    
    • print the returned fileobject:

      print(project1.export(export_format='tsv').read())
      
    • save the returned fileobject to file:

      with open('export.tsv', 'wb') as f:
          f.write(project1.export(export_format='tsv').read())
      
  • templating export (function was added in this fork, see Advanced Templating above):

    data = project1.export_templating(
        prefix='''{ "events" : [
    ''',template='''    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }''',
        rowSeparator=''',
    ''',suffix='''
    ] }''')
    print(data.read())
    
  • print help screen with available commands (many more!):

    help(project1)
    
  • example for custom commands:

    project1.do_json('get-rows')['total']
    
  • delete project:

    project1.delete()
    

See also:

Binder

Binder

  • free to use on-demand server with Jupyter notebook, OpenRefine and Bash
  • no registration needed, will start within a few minutes
  • restricted to 2 GB RAM and server will be deleted after 10 minutes of inactivity
  • bash_kernel demo notebook for using the openrefine-client in a Linux Bash environment Binder
  • python2 demo notebook for using the openrefine-client in a Python 2 environment Binder

Development

If you would like to contribute to the Python client library please consider a pull request to the upstream repository refine-client-py.

Tests

Ensure you have OpenRefine running (i.e. available at http://localhost:3333). If necessary set the environment variables OPENREFINE_HOST and OPENREFINE_PORT to change the URL.

The Python client library includes several unit tests.

  • run all tests

    python setup.py test
    
  • run subset test_facet

    python setup.py --test-suite tests.test_facet
    

There is also a script that uses docker images to run the unit tests with different versions of OpenRefine.

  • run tests on all OpenRefine versions (from 2.0 up to 3.4.1)

    ./tests.sh -a
    
  • run tests on tag 3.4.1

    ./tests.sh -t 3.4.1
    
  • run tests on tag 3.4.1 interactively (pause before and after tests)

    ./tests.sh -t 3.4.1 -i
    
  • run tests on tags 3.4.1 and 2.7

    ./tests.sh -t 3.4.1 -t 2.7
    

For Linux there are also functional tests for all command line options.

  • run all functional tests on OpenRefine 3.4

    ./tests-cli.sh 3.4.1
    
  • run all functional tests on OpenRefine 3.4 with one-file-executable

    ./tests-cli.sh 3.4.1 openrefine-client_0-3-7_linux
    

Distributing

Note to myself: When releasing a new version...

  1. Run functional tests

    for v in 2.7 2.8 3.0 3.1 3.2 3.3 3.4 3.4.1; do
       ./tests-cli.sh $v
    done
    
  2. Make final changes in Git

  3. Build executables with PyInstaller

    • Run PyInstaller in Python 2 environments on native Windows, macOS and Linux. Should be "the oldest version of the OS you need to support"! Current release is built with:

      • Ubuntu 16.04 LTS (64-bit)
      • macOS Sierra 10.12 (64-bit)
      • Windows 7 (32-bit)
    • One-file-executables will be available in dist/.

      git clone https://github.com/opencultureconsulting/openrefine-client.git
      cd openrefine-client
      python2 -m pip install pyinstaller --user
      python2 -m pip install urllib2_file --user
      python2 -m PyInstaller --onefile refine.py --hidden-import google.refine.__main__
      
  4. Run functional tests with Linux executable

    for v in 2.7 2.8 3.0 3.1 3.2 3.3 3.4 3.4.1; do
       ./tests-cli.sh $v openrefine-client_0-3-7_linux
    done
    
  5. Create release in GitHub

  6. Build package and upload to PyPI

    python3 setup.py sdist bdist_wheel
    python3 -m twine upload dist/*
    
  7. Update Docker container

    • add new autobuild for release version
    • trigger latest build
  8. Bump openrefine-client version in related projects

Credits

Paul Makepeace, author

David Huynh, [initial cut](<http://markmail.org/message/jsxzlcu3gn6drtb7)

Artfinder, inspiration

Felix Lohmeier, extended the CLI features

Some data used in the test suite has been used from publicly available sources:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

openrefine-client-0.3.10.tar.gz (574.1 kB view details)

Uploaded Source

Built Distribution

openrefine_client-0.3.10-py3-none-any.whl (40.9 kB view details)

Uploaded Python 3

File details

Details for the file openrefine-client-0.3.10.tar.gz.

File metadata

  • Download URL: openrefine-client-0.3.10.tar.gz
  • Upload date:
  • Size: 574.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.5.2

File hashes

Hashes for openrefine-client-0.3.10.tar.gz
Algorithm Hash digest
SHA256 bfeeaf407c3ea1041957d8346d995db8e594d9c1485389f0eab8add53699c01b
MD5 ca175c747048cd4bcab11f12bd36e453
BLAKE2b-256 61d31a3959ff61ec080ec063e9957fd57833f19deba4a3ac83a7872a65699a2c

See more details on using hashes here.

File details

Details for the file openrefine_client-0.3.10-py3-none-any.whl.

File metadata

  • Download URL: openrefine_client-0.3.10-py3-none-any.whl
  • Upload date:
  • Size: 40.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/49.2.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/3.5.2

File hashes

Hashes for openrefine_client-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 8a13926e42f51632b6369ba2ddc475afc72e68256f10a55a6f9224e905b631a3
MD5 f777ce5ebd1e4ff82e4e5c38baf6dc4f
BLAKE2b-256 d322459e3b4667f2ae114f00340c63f41da660ced333de06a21a1c7c4cc51fe4

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page