extract plain text and minimal metadata from ALTO xml files
Project description
Extract plain text from newspapers (alto2txt 0.3.1)
Converts XML (in METS 1.8/ALTO 1.4, METS 1.3/ALTO 1.4, BLN or UKP format) publications to plaintext articles and generates minimal metadata.
Full documentation and demo instructions.
Installation
Installation using an Anaconda environment
We recommend installation via Anaconda:
-
Refer to the Anaconda website and follow the instructions.
-
Create a new environment for alto2txt
conda create -n py37alto python=3.7
- Activate the environment:
conda activate py37alto
- Install alto2txt itself
Install alto2txt
using pip:
pip install alto2txt
(For now it is still necessary to install using pip. In due course we plan to make alto2txt available through a conda channel, meaning that it can be installed directly using conda commands.)
Installation using pip, outside an Anaconda environment
Note, the use of `alto2txt`` outside a conda environment has not been as extensively tested as within a conda environment. Whilst we believe that this should work, please use with caution.
pip install alto2txt
Installation of a test release
If you need (or want) to install a test release of alto2txt
you will likely be advised of the specific version number to install. This examaple command will install v0.3.1-alpha.20
:
pip install -i https://test.pypi.org/simple/ alto2txt==0.3.1a20
Usage
Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one XML metadata file.
extract_publications_text.py [-h] [-d [DOWNSAMPLE]]
[-p [PROCESS_TYPE]]
[-l [LOG_FILE]]
[-n [NUM_CORES]]
xml_in_dir txt_out_dir
Converts XML publications to plaintext articles
positional arguments:
xml_in_dir Input directory with XML publications
txt_out_dir Output directory for plaintext articles
optional arguments:
-h, --help show this help message and exit
-d [DOWNSAMPLE], --downsample [DOWNSAMPLE]
Downsample. Default 1
-l [LOG_FILE], --log-file [LOG_FILE]
Log file. Default out.log
-p [PROCESS_TYPE], --process-type [PROCESS_TYPE]
Process type.
One of: single,serial,multi,spark
Default: multi
-n [NUM_CORES], --num-cores [NUM_CORES]
Number of cores (Spark only). Default 1")
xml_in_dir
is expected to hold XML for multiple publications, in the following structure:
xml_in_dir
|-- publication
| |-- year
| | |-- issue
| | | |-- xml_content
| |-- year
|-- publication
However, if -p|--process-type single
is provided then xml_in_dir
is expected to hold XML for a single publication, in the following structure:
xml_in_dir
|-- year
| |-- issue
| | |-- xml_content
|-- year
txt_out_dir
is created with an analogous structure to xml_in_dir
.
PROCESS_TYPE
can be one of:
single
: Process single publication.serial
: Process publications serially.multi
: Process publications using multiprocessing (default).spark
: Process publications using Spark.
DOWNSAMPLE
must be a positive integer, default 1.
The following XSLT files need to be in an extract_text.xslts
module:
extract_text_mets18.xslt
: METS 1.8 XSL file.extract_text_mets13.xslt
: METS 1.3 XSL file.extract_text_bln.xslt
: BLN XSL file.extract_text_ukp.xslt
: UKP XSL file.
Process publications
Assume ~/BNA
exists and matches the structure above.
Extract text from every publication:
./extract_publications_text.py ~/BNA txt
Extract text from every 100th issue of every publication:
./extract_publications_text.py ~/BNA txt -d 100
Process a single publication
Extract text from every issue of a single publication:
./extract_publications_text.py -p single ~/BNA/0000151 txt
Extract text from every 100th issue of a single publication:
./extract_publications_text.py -p single ~/BNA/0000151 txt -d 100
Configure logging
By default, logs are put in out.log
.
To specify an alternative location for logs, use the -l
flag e.g.
./extract_publications_text.py -l mylog.txt ~/BNA txt -d 100 2> err.log
Process publications via Spark
Information on running on spark.
Future work
For a complete list of future plans see the GitHub issues list. Some highlights include:
- Export more metadata from alto, probably by parsing
mets
first. - Check and ensure that articles that span multiple pages are pulled into a single article file.
- Smarter handling of articles spanning multiple pages.
Copyright
Software
Copyright 2022 The Alan Turing Institute, British Library Board, Queen Mary University of London, University of Exeter, University of East Anglia and University of Cambridge.
See LICENSE for more details.
Example Datasets
This repo contains example datasets, which have been taken from the British Library Research Repository (DOI link).
This data is "CC0 1.0 Universal Public Domain" - No Copyright - Other Known Legal Restrictions
- There is a subset of the example data in the
demo-files
directory. - There are adapted copies of the data in the
tests/tests/test_files
directory. These have been edited to test errors and edge cases.
Funding and Acknowledgements
This software has been developed as part of the Living with Machines project.
This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file alto2txt-0.3.4.tar.gz
.
File metadata
- Download URL: alto2txt-0.3.4.tar.gz
- Upload date:
- Size: 17.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e96500f1703c51fd4aff8d03b6481261a7247f16d7b30a755806fafd53fba15c |
|
MD5 | 447a642e8c0bc881e24e6580f7c42817 |
|
BLAKE2b-256 | 57838d87f48683e2169915e58847aafd42eacbd47110c1ef7c63474bf4a06d17 |
File details
Details for the file alto2txt-0.3.4-py3-none-any.whl
.
File metadata
- Download URL: alto2txt-0.3.4-py3-none-any.whl
- Upload date:
- Size: 23.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.9.13
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 17e7182226d71c05bb335ef9a38d3d52fe86f29863a2c00b84f12cce89b9bd27 |
|
MD5 | be2cdf1e723f242fa3040354c323afec |
|
BLAKE2b-256 | cba93c3882d29edb47f64d70346b21ee5367f737d11bcd4d435ea91115b9ca45 |