Skip to main content

Extract a list from JSON or XML, save to excel, csv, etc.

Project description

extract-list

Background

extract-list is a command line tool for turning selected data from JSON or XML files into a list of columns. It is useful when the source data is structured, but the result is needed in a spreadsheet, CSV file, text file, or another table-like format.

What it does

extract-list:

  • reads data from an XML file or a JSON file
  • extracts the configured parts of that input data
  • writes the extracted data as rows and columns

Examples of output formats include:

  • Excel
  • CSV
  • ODS
  • HTML
  • Markdown
  • plain text
  • JSON
  • XML

The exact output choices available in an installed environment can be listed with the cfg-example --help command. Some output formats are provided through TableIO, and some are handled directly by extract-list.

Installing it

extract-list requires Python 3.13 or newer. Python can be downloaded from https://www.python.org/downloads/.

Installing on macOS and Linux

pip3 install --upgrade extract-list

Installing on Microsoft Windows

pip install --upgrade extract-list

Version history

Version Date Python version Description
0.2 06 Jan 2025 3.12.6 or newer First released version
0.2.2 23 Mar 2025 3.13.2 or newer Adapted to Python 3.13.2
0.2.3 23 Mar 2025 3.13.2 or newer Fix in README only
0.2.5 09 Apr 2025 3.13.3 or newer Using newer dependencies
0.2.7 09 Jun 2025 3.12.x add version sub-command
0.2.8 09 Jun 2025 3.13 or newer add version sub-command
0.2.10 23 Jul 2025 3.12.x updated dependencies
0.2.11 23 Jul 2025 3.13 or newer updated dependencies
0.2.13 23 Nov 2025 3.12.x updated dependencies
0.2.14 23 Nov 2025 3.13 or newer updated dependencies
0.3 01 Jun 2026 3.13 or newer New output config; more formats

Running the application

Running on macOS and Linux

python3 -m extract_list --help
python3 -m extract_list cfg-example --help
python3 -m extract_list extract --help
python3 -m extract_list cfg-example -k example_json -t Excel -o example.cfg
python3 -m extract_list extract -c example.cfg -i input.json -o output.xlsx

Running on Microsoft Windows

python -m extract_list --help
python -m extract_list cfg-example --help
python -m extract_list extract --help
python -m extract_list cfg-example -k example_json -t Excel -o example.cfg
python -m extract_list extract -c example.cfg -i input.json -o output.xlsx

Suggested way to get started

  1. Use the cfg-example sub-command to generate a few example configuration files.
  2. Read each generated .cfg file together with its generated .txt description file.
  3. Find the example that is closest to what you want to extract.
  4. Modify that configuration file to match your input data and desired output.
  5. Use the extract sub-command to read your data and write the output file.
  6. Check the produced output. If needed, adjust the configuration and run the extraction again.

Example configuration files

When cfg-example creates a configuration file, it also creates a text file that describes the example and the configuration syntax. If the configuration file is named example.cfg, the description file is named example.txt.

The generated .txt file is the best place to look for detailed configuration reference text, because it is generated by the same installed version of extract-list that will read the configuration.

Migrating old configuration files

Version 0.3 uses the nested output configuration object. Older 0.2-style configuration files that use settings such as outfile_type, outfile_encoding, out_csv_dialect, or outfile_excel_library can be migrated with the migrate-cfg sub-command.

Migrating on macOS and Linux

python3 -m extract_list migrate-cfg -i old.cfg -o new.cfg

Migrating on Microsoft Windows

python -m extract_list migrate-cfg -i old.cfg -o new.cfg

extract-list can still read the old configuration format with backward compatibility handling, but migration is recommended before editing the configuration further.

Configuration file overview

The configuration file is written in JSON syntax: https://en.wikipedia.org/wiki/JSON. The keywords and nesting are important. The order of keywords is not significant. Indentation and line breaks are not significant.

The encoding for the configuration file must be UTF-8. US-ASCII is a subset of UTF-8.

It is recommended to generate a configuration file with cfg-example and then edit that file. Writing a configuration file from scratch is possible, but the generated examples are a much safer starting point.

Type of input file

The type of input file to read is determined by infile_type. infile_type can have these values:

  • "JSON"
  • "XML"

Both JSON and XML are text file formats, so the input file has a character encoding. This is specified with infile_encoding. Unless you know that you need another encoding, leave this as it is in the generated example configuration.

Type of output file

Output files are configured in the nested output object. The output format is selected with output.format_name. The value is case-insensitive and is normalized when the configuration is read.

Examples of commonly useful output.format_name values include:

  • "Excel"
  • "CSV"
  • "ODS"
  • "HTML"
  • "md"
  • "txt"
  • "JSON"
  • "XML"
  • "Plaintxt"

Run python3 -m extract_list cfg-example --help on macOS or Linux, or python -m extract_list cfg-example --help on Windows, to see the full list of output formats available in your installed environment.

JSON, XML, and Plaintxt output are handled inside extract-list. Other table-style output is handled through TableIO. The same output.format_name configuration member is used for both groups.

Text-based output formats can have a character encoding. This is configured with output.character_encoding. Unless you know that you need another encoding, leave this as it is in the generated example configuration.

Comma separated values files may differ slightly depending on the programs used to read and write them and the locale used. CSV settings are configured in the optional nested output.csv object.

TableIO may provide several implementations for a file format. output.implementation can force a specific implementation. If it is omitted, TableIO chooses the best available implementation.

The outfile_border and outfile_filtered_area values can be "NO", "IF_AVAILABLE", or "NEEDED". "NO" disables the feature request. "IF_AVAILABLE" requests the feature when the selected output implementation can provide it. "NEEDED" requires support for the feature.

Data to extract

The input file is likely to contain more data than should appear in the output. The data to extract is specified with main_line and linked_lines.

The main_line setting describes the main input records that become output rows. Linked lines describe additional records that are tied to the main line. A linked line is tied to a main line when a configured value in the linked line has the same value as a configured value in the main line.

The line member of main_line and linked_lines is a list of strings. This list is the path of keys to the input records. Directly below that path there is either a list or a dictionary of records.

The records of the main_line are indexed in the input, either by integer position in a list or by key in a dictionary. If you want this index or key in the output, set include_key to true. If not, set it to false. The output column name for this key is configured with column_name_for_key.

The columns member of main_line and linked_lines maps output column names to paths inside each input record. Each path is written as a list of strings.

Sometimes one input record contains several nested records. For example, a purchase order may contain several purchased items. Because the output is a list of columns, such an input record must be expanded into several output rows. The shared values from the input record are duplicated on those output rows. The expand_at member contains the relative paths where this expansion should happen.

There can be only one main_line. There may be any number of linked lines. The linked lines are configured as an array in linked_lines.

For each linked line, linked_column and linked_main_column define how it is tied to the main line. Both values are relative paths in the input records. The value at linked_column in the linked-line record must equal the value at linked_main_column in the main-line record.

The values used to link records may also be extracted as output columns, but they do not have to be.

Several linked lines can be tied to the same main line. one_output_line_per_main_line controls how this is handled. If it is false, the main-line data is duplicated so the output has one row for each tied combination of main-line and linked-line records. If it is true, multiple linked lines for one main line are reported as an error.

Input records may be missing optional values. missing_input_for_column controls how missing input data is handled. Its values are "EMPTY" and "ERROR". If it is "EMPTY", the output column resulting from missing input data is left empty.

XML attributes

XML member values may be written either as nested objects or as attributes. If the input has XML attributes, the attribute key has an @ prefix. To handle this, in_xml_strip_at can be set to true or false. If it is true, a leading @ character is stripped from each key.

The out_xml_attributes setting specifies column names that should be written as XML attributes in XML output instead of as nested objects.

Output column order

The order of the columns in the output is specified with column_order. The value is a list of strings.

Specifying a column in column_order that has not been extracted is an error. Extracting a column without specifying it in column_order is also an error.

Output line order

The order_rows_by setting specifies which columns should be used for sorting the output rows. The most significant column should be first in the list.

The default row order is based on the list of columns in column_order. Leave order_rows_by as an empty list unless you need another specific order.

Source code

Source code and tests are available at https://bitbucket.org/tom-bjorkholm/extract-list.

Test summary

  • Test result: 1247 passed in 35s
  • No flake8 warnings.
  • No mypy errors found.
  • No python layout warnings.
  • Built version(s): 0.4
  • Build and test using Python 3.14.5

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_list-0.4.tar.gz (32.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract_list-0.4-py3-none-any.whl (40.5 kB view details)

Uploaded Python 3

File details

Details for the file extract_list-0.4.tar.gz.

File metadata

  • Download URL: extract_list-0.4.tar.gz
  • Upload date:
  • Size: 32.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.14.5

File hashes

Hashes for extract_list-0.4.tar.gz
Algorithm Hash digest
SHA256 e8ccc0f38fd36fe810aa39bd4a544f7fb8ad021fa9dd6e30179a5b16f16bef78
MD5 d97696a36c73341c2b3299b2cabf2503
BLAKE2b-256 9748e1a7048f784e938f7ed0da95258bc723f5224423bfde03d2aaab33562130

See more details on using hashes here.

File details

Details for the file extract_list-0.4-py3-none-any.whl.

File metadata

  • Download URL: extract_list-0.4-py3-none-any.whl
  • Upload date:
  • Size: 40.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.0.1 CPython/3.14.5

File hashes

Hashes for extract_list-0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 6cdb9608eb391aa42aaa0067c6805191080a3e177efdcfe9a6415e46538896ae
MD5 ea5eaee66f6c8560c3f721eaf480faec
BLAKE2b-256 b66d29cdc59cec8a6727cd28671851cdb120f834f8f3b4ffe4770d00cccc7370

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page