Extract amendments from European Parliament docx files
Project description
EuroParl-Amendment-Extract
This is a simple script to convert the amendments from the EP to JSON. It follow the format define in that dataset https://zenodo.org/record/4709248#.YXesJS8itqs for the article War Of Word https://github.com/indy-lab/war-of-words
prerequisites
Build the MEPs dataset with the script
python3 meps.py
Debugger
To run the debugger we simply need to run 'streamlit run diff_visualizer.py', it will run the am labeler on the ep8 dataset and visualize the first error that it comes across.
Sequence matcher update
In diff.py there is now a extract_opcodes and extract_opcodes_v2. The v2 takes into account to merge consecutive 'replace' operations as well as 'delete' operations followed by a replace. We also display now at the end an accuracy metric that sort of roughly sketches the am labeler's performance. The accuracy is calculated by penalizing the algorithm for each edit that it gets wrong relative to the size / length of text that was attributed to the edit in question. With this evaluation metric we observed a high accuracy (99+ %) on the ep8 dataset.
Usage
from ep_amendment_extract import extract_amendments
extract_amendments('file.docx')
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for europarl_amendment_extract-1.0.3.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 577855f78e006decc4114d75190a956c8c0586407c2ae721f5993a3897b9836c |
|
MD5 | f693239a4fbf937d42288adfe23c7c2e |
|
BLAKE2b-256 | f8522778034b6832edc1f78f4126d9b6e1f2992c36a9bf1adf225f03f1e06b2b |
Hashes for europarl_amendment_extract-1.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c9acea35d4b960140951e9158c5bc45006ec8cd14f8201c294c131e06584ee64 |
|
MD5 | 78bc27901ad0b8a7d2ac651868cb126a |
|
BLAKE2b-256 | 809e52ef97730d91d943e7ae82f1b1111beebc463e61eeaade5f762597d91d8f |