extract plain text and minimal metadata from ALTO xml files

These details have not been verified by PyPI

Project description

Extract plain text from newspapers (alto2txt 0.3.1)

Converts XML (in METS 1.8/ALTO 1.4, METS 1.3/ALTO 1.4, BLN or UKP format) publications to plaintext articles and generates minimal metadata.

Full documentation and demo instructions.

Installation

Installation using an Anaconda environment

We recommend installation via Anaconda:

Refer to the Anaconda website and follow the instructions.
Create a new environment for alto2txt

conda create -n py37alto python=3.7

Activate the environment:

conda activate py37alto

Install alto2txt itself

Install alto2txt using pip:

pip install alto2txt

(For now it is still necessary to install using pip. In due course we plan to make alto2txt available through a conda channel, meaning that it can be installed directly using conda commands.)

Installation using pip, outside an Anaconda environment

Note, the use of `alto2txt`` outside a conda environment has not been as extensively tested as within a conda environment. Whilst we believe that this should work, please use with caution.

pip install alto2txt

Installation of a test release

If you need (or want) to install a test release of alto2txt you will likely be advised of the specific version number to install. This examaple command will install v0.3.1-alpha.20:

pip install -i https://test.pypi.org/simple/ alto2txt==0.3.1a20

Usage

Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one XML metadata file.

extract_publications_text.py [-h] [-d [DOWNSAMPLE]]
                                    [-p [PROCESS_TYPE]]
                                    [-l [LOG_FILE]]
                                    [-n [NUM_CORES]]
                                    xml_in_dir txt_out_dir

Converts XML publications to plaintext articles

positional arguments:
  xml_in_dir            Input directory with XML publications
  txt_out_dir           Output directory for plaintext articles

optional arguments:
  -h, --help            show this help message and exit
  -d [DOWNSAMPLE], --downsample [DOWNSAMPLE]
                        Downsample. Default 1
  -l [LOG_FILE], --log-file [LOG_FILE]
                        Log file. Default out.log
  -p [PROCESS_TYPE], --process-type [PROCESS_TYPE]
                        Process type.
                        One of: single,serial,multi,spark
                        Default: multi
  -n [NUM_CORES], --num-cores [NUM_CORES]
                        Number of cores (Spark only). Default 1")

xml_in_dir is expected to hold XML for multiple publications, in the following structure:

xml_in_dir
|-- publication
|   |-- year
|   |   |-- issue
|   |   |   |-- xml_content
|   |-- year
|-- publication

However, if -p|--process-type single is provided then xml_in_dir is expected to hold XML for a single publication, in the following structure:

xml_in_dir
|-- year
|   |-- issue
|   |   |-- xml_content
|-- year

txt_out_dir is created with an analogous structure to xml_in_dir.

PROCESS_TYPE can be one of:

single: Process single publication.
serial: Process publications serially.
multi: Process publications using multiprocessing (default).
spark: Process publications using Spark.

DOWNSAMPLE must be a positive integer, default 1.

The following XSLT files need to be in an extract_text.xslts module:

extract_text_mets18.xslt: METS 1.8 XSL file.
extract_text_mets13.xslt: METS 1.3 XSL file.
extract_text_bln.xslt: BLN XSL file.
extract_text_ukp.xslt: UKP XSL file.

Process publications

Assume ~/BNA exists and matches the structure above.

Extract text from every publication:

./extract_publications_text.py ~/BNA txt

Extract text from every 100th issue of every publication:

./extract_publications_text.py ~/BNA txt -d 100

Process a single publication

Extract text from every issue of a single publication:

./extract_publications_text.py -p single ~/BNA/0000151 txt

Extract text from every 100th issue of a single publication:

./extract_publications_text.py -p single ~/BNA/0000151 txt -d 100

Configure logging

By default, logs are put in out.log.

To specify an alternative location for logs, use the -l flag e.g.

./extract_publications_text.py -l mylog.txt ~/BNA txt -d 100 2> err.log

Process publications via Spark

Information on running on spark.

Future work

For a complete list of future plans see the GitHub issues list. Some highlights include:

Export more metadata from alto, probably by parsing mets first.
Check and ensure that articles that span multiple pages are pulled into a single article file.
Smarter handling of articles spanning multiple pages.

Copyright

Software

See LICENSE for more details.

Example Datasets

This repo contains example datasets, which have been taken from the British Library Research Repository (DOI link).

This data is "CC0 1.0 Universal Public Domain" - No Copyright - Other Known Legal Restrictions

There is a subset of the example data in the demo-files directory.
There are adapted copies of the data in the tests/tests/test_files directory. These have been edited to test errors and edge cases.

Funding and Acknowledgements

This software has been developed as part of the Living with Machines project.

This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.4

Jul 1, 2022

0.3.3

Jul 1, 2022

0.3.2

Jul 1, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alto2txt-0.3.4.tar.gz (17.0 kB view details)

Uploaded Jul 1, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

alto2txt-0.3.4-py3-none-any.whl (23.9 kB view details)

Uploaded Jul 1, 2022 Python 3

File details

Details for the file alto2txt-0.3.4.tar.gz.

File metadata

Download URL: alto2txt-0.3.4.tar.gz
Upload date: Jul 1, 2022
Size: 17.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for alto2txt-0.3.4.tar.gz
Algorithm	Hash digest
SHA256	`e96500f1703c51fd4aff8d03b6481261a7247f16d7b30a755806fafd53fba15c`
MD5	`447a642e8c0bc881e24e6580f7c42817`
BLAKE2b-256	`57838d87f48683e2169915e58847aafd42eacbd47110c1ef7c63474bf4a06d17`

See more details on using hashes here.

File details

Details for the file alto2txt-0.3.4-py3-none-any.whl.

File metadata

Download URL: alto2txt-0.3.4-py3-none-any.whl
Upload date: Jul 1, 2022
Size: 23.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.9.13

File hashes

Hashes for alto2txt-0.3.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`17e7182226d71c05bb335ef9a38d3d52fe86f29863a2c00b84f12cce89b9bd27`
MD5	`be2cdf1e723f242fa3040354c323afec`
BLAKE2b-256	`cba93c3882d29edb47f64d70346b21ee5367f737d11bcd4d435ea91115b9ca45`

See more details on using hashes here.

alto2txt 0.3.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Extract plain text from newspapers (alto2txt 0.3.1)

Full documentation and demo instructions.

Installation

Installation using an Anaconda environment

Installation using pip, outside an Anaconda environment

Installation of a test release

Usage

Process publications

Process a single publication

Configure logging

Process publications via Spark

Future work

Copyright

Software

Example Datasets

Funding and Acknowledgements

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes