Extract a list from JSON or XML, save to excel, csv, etc.
Project description
extract-list
Background
extract-list is a command line tool for turning selected data from JSON or XML files into a list of columns. It is useful when the source data is structured, but the result is needed in a spreadsheet, CSV file, text file, or another table-like format.
What it does
extract-list:
- reads data from an XML file or a JSON file
- extracts the configured parts of that input data
- writes the extracted data as rows and columns
Examples of output formats include:
- Excel
- CSV
- ODS
- HTML
- Markdown
- plain text
- JSON
- XML
The exact output choices available in an installed environment can be listed
with the cfg-example --help command. Some output formats are provided through
TableIO, and some are handled directly by extract-list.
Installing it
extract-list requires Python 3.13 or newer. Python can be downloaded from https://www.python.org/downloads/.
Installing on macOS and Linux
pip3 install --upgrade extract-list
Installing on Microsoft Windows
pip install --upgrade extract-list
Version history
| Version | Date | Python version | Description |
|---|---|---|---|
| 0.2 | 06 Jan 2025 | 3.12.6 or newer | First released version |
| 0.2.2 | 23 Mar 2025 | 3.13.2 or newer | Adapted to Python 3.13.2 |
| 0.2.3 | 23 Mar 2025 | 3.13.2 or newer | Fix in README only |
| 0.2.5 | 09 Apr 2025 | 3.13.3 or newer | Using newer dependencies |
| 0.2.7 | 09 Jun 2025 | 3.12.x | add version sub-command |
| 0.2.8 | 09 Jun 2025 | 3.13 or newer | add version sub-command |
| 0.2.10 | 23 Jul 2025 | 3.12.x | updated dependencies |
| 0.2.11 | 23 Jul 2025 | 3.13 or newer | updated dependencies |
| 0.2.13 | 23 Nov 2025 | 3.12.x | updated dependencies |
| 0.2.14 | 23 Nov 2025 | 3.13 or newer | updated dependencies |
| 0.3 | 01 Jun 2026 | 3.13 or newer | New output config; more formats |
Running the application
Running on macOS and Linux
python3 -m extract_list --help
python3 -m extract_list cfg-example --help
python3 -m extract_list extract --help
python3 -m extract_list cfg-example -k example_json -t Excel -o example.cfg
python3 -m extract_list extract -c example.cfg -i input.json -o output.xlsx
Running on Microsoft Windows
python -m extract_list --help
python -m extract_list cfg-example --help
python -m extract_list extract --help
python -m extract_list cfg-example -k example_json -t Excel -o example.cfg
python -m extract_list extract -c example.cfg -i input.json -o output.xlsx
Suggested way to get started
- Use the
cfg-examplesub-command to generate a few example configuration files. - Read each generated
.cfgfile together with its generated.txtdescription file. - Find the example that is closest to what you want to extract.
- Modify that configuration file to match your input data and desired output.
- Use the
extractsub-command to read your data and write the output file. - Check the produced output. If needed, adjust the configuration and run the extraction again.
Example configuration files
When cfg-example creates a configuration file, it also creates a text file
that describes the example and the configuration syntax. If the configuration
file is named example.cfg, the description file is named example.txt.
The generated .txt file is the best place to look for detailed configuration
reference text, because it is generated by the same installed version of
extract-list that will read the configuration.
Migrating old configuration files
Version 0.3 uses the nested output configuration object. Older 0.2-style
configuration files that use settings such as outfile_type,
outfile_encoding, out_csv_dialect, or outfile_excel_library can be
migrated with the migrate-cfg sub-command.
Migrating on macOS and Linux
python3 -m extract_list migrate-cfg -i old.cfg -o new.cfg
Migrating on Microsoft Windows
python -m extract_list migrate-cfg -i old.cfg -o new.cfg
extract-list can still read the old configuration format with backward compatibility handling, but migration is recommended before editing the configuration further.
Configuration file overview
The configuration file is written in JSON syntax: https://en.wikipedia.org/wiki/JSON. The keywords and nesting are important. The order of keywords is not significant. Indentation and line breaks are not significant.
The encoding for the configuration file must be UTF-8. US-ASCII is a subset of UTF-8.
It is recommended to generate a configuration file with cfg-example and then
edit that file. Writing a configuration file from scratch is possible, but the
generated examples are a much safer starting point.
Type of input file
The type of input file to read is determined by infile_type. infile_type
can have these values:
"JSON""XML"
Both JSON and XML are text file formats, so the input file has a character
encoding. This is specified with infile_encoding. Unless you know that you
need another encoding, leave this as it is in the generated example
configuration.
Type of output file
Output files are configured in the nested output object. The output format is
selected with output.format_name. The value is case-insensitive and is
normalized when the configuration is read.
Examples of commonly useful output.format_name values include:
"Excel""CSV""ODS""HTML""md""txt""JSON""XML""Plaintxt"
Run python3 -m extract_list cfg-example --help on macOS or Linux, or
python -m extract_list cfg-example --help on Windows, to see the full list of
output formats available in your installed environment.
JSON, XML, and Plaintxt output are handled inside extract-list. Other
table-style output is handled through TableIO. The same output.format_name
configuration member is used for both groups.
Text-based output formats can have a character encoding. This is configured
with output.character_encoding. Unless you know that you need another
encoding, leave this as it is in the generated example configuration.
Comma separated values files may differ slightly depending on the programs
used to read and write them and the locale used. CSV settings are configured in
the optional nested output.csv object.
TableIO may provide several implementations for a file format.
output.implementation can force a specific implementation. If it is omitted,
TableIO chooses the best available implementation.
The outfile_border and outfile_filtered_area values can be "NO",
"IF_AVAILABLE", or "NEEDED". "NO" disables the feature request.
"IF_AVAILABLE" requests the feature when the selected output implementation
can provide it. "NEEDED" requires support for the feature.
Data to extract
The input file is likely to contain more data than should appear in the
output. The data to extract is specified with main_line and linked_lines.
The main_line setting describes the main input records that become output
rows. Linked lines describe additional records that are tied to the main line.
A linked line is tied to a main line when a configured value in the linked line
has the same value as a configured value in the main line.
The line member of main_line and linked_lines is a list of strings. This
list is the path of keys to the input records. Directly below that path there
is either a list or a dictionary of records.
The records of the main_line are indexed in the input, either by integer
position in a list or by key in a dictionary. If you want this index or key in
the output, set include_key to true. If not, set it to false. The output
column name for this key is configured with column_name_for_key.
The columns member of main_line and linked_lines maps output column names
to paths inside each input record. Each path is written as a list of strings.
Sometimes one input record contains several nested records. For example, a
purchase order may contain several purchased items. Because the output is a
list of columns, such an input record must be expanded into several output
rows. The shared values from the input record are duplicated on those output
rows. The expand_at member contains the relative paths where this expansion
should happen.
There can be only one main_line. There may be any number of linked lines.
The linked lines are configured as an array in linked_lines.
For each linked line, linked_column and linked_main_column define how it is
tied to the main line. Both values are relative paths in the input records.
The value at linked_column in the linked-line record must equal the value at
linked_main_column in the main-line record.
The values used to link records may also be extracted as output columns, but they do not have to be.
Several linked lines can be tied to the same main line.
one_output_line_per_main_line controls how this is handled. If it is false,
the main-line data is duplicated so the output has one row for each tied
combination of main-line and linked-line records. If it is true, multiple
linked lines for one main line are reported as an error.
Input records may be missing optional values. missing_input_for_column
controls how missing input data is handled. Its values are "EMPTY" and
"ERROR". If it is "EMPTY", the output column resulting from missing input
data is left empty.
XML attributes
XML member values may be written either as nested objects or as attributes. If
the input has XML attributes, the attribute key has an @ prefix. To handle
this, in_xml_strip_at can be set to true or false. If it is true, a
leading @ character is stripped from each key.
The out_xml_attributes setting specifies column names that should be written
as XML attributes in XML output instead of as nested objects.
Output column order
The order of the columns in the output is specified with column_order. The
value is a list of strings.
Specifying a column in column_order that has not been extracted is an error.
Extracting a column without specifying it in column_order is also an error.
Output line order
The order_rows_by setting specifies which columns should be used for sorting
the output rows. The most significant column should be first in the list.
The default row order is based on the list of columns in column_order. Leave
order_rows_by as an empty list unless you need another specific order.
Source code
Source code and tests are available at https://bitbucket.org/tom-bjorkholm/extract-list.
Test summary
- Test result: 1247 passed in 36s
- No flake8 warnings.
- No mypy errors found.
- No python layout warnings.
- Built version(s): 0.3
- Build and test using Python 3.14.5
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file extract_list-0.3.tar.gz.
File metadata
- Download URL: extract_list-0.3.tar.gz
- Upload date:
- Size: 32.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
733f70ed4d516c2a746757d94b69aab4e00dab95fbf65ffea5b77f76a034b10b
|
|
| MD5 |
598cd827fd776a9b61fb24eb1ac8bf99
|
|
| BLAKE2b-256 |
0e318d855b1e4399b7bf9299c8ed03995c0e0bb33c5c3580fc711c6f0fa08170
|
File details
Details for the file extract_list-0.3-py3-none-any.whl.
File metadata
- Download URL: extract_list-0.3-py3-none-any.whl
- Upload date:
- Size: 40.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.14.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c71ac0dc2d96c250afb5b4a0d8a16d119b80cb50152a83e27ae77cc288c0521c
|
|
| MD5 |
04e553b6f2dcbff87d42c27f7c9446ef
|
|
| BLAKE2b-256 |
f1a2ac4b03519c864c1532fa700d01cb7e51f027fa2faf34a922bf23abb1850b
|