Convert text format RFCs and Internet-Drafts to .xml format
Project description
Internet-Draft text to XML Conversion Tool
This tool, ‘id2xml’, is intended for use by the RFC-Editor staff, in order to produce a first xml2rfc-compatible XML version from text-only Internet-Draft submissions.
id2xml may also be useful for Internet-Draft authors who wish to start working on a new version of an older draft or RFC, for which no xml2rfc-compatible XML source is available.
The releases of the 0.9.x release series are preview releases, with a number of known deficiencies. They are released in order that potential users may provide feedback on the most desired improvements ahead of a 1.0.0 release.
Version 0.9.0 can process the drafts specified in the development Statement of Work to XML files acceptable to xml2rfc, and can also process a number of other test files to acceptable XML. However, adding new drafts to the test set still reveals weaknesses in many cases, so trouble-free processing of any arbitrary draft should not be expected from the 0.9.x series.
The XML produced follows RFC 7749 [1] in version 0.9.x and 1.x of the tool, and will follow RFC 7991 [2] in version 2.x, which will be released once support is available to process XML sources which follows the RFC 7991 vocabulary.
Changelog
Version 1.0.0 (30 May 2017)
The number of lines in the corpus of test documents now show a percentage of lines which differ from the original input file to the text file generated from id2xml’s xml file of just over 2%, and in some cases the generated text is an improvement over the original text. The tool should now be functionally complete for vocabulary v2 output, so this seems like a good time for a 1.0.0 release.
Changes since 1.0.0rc3:
Split the functionality up into separate run.py, parser.py and utils.py files, and adjusted Makefile and MANIFEST accordingly.
Entries in the <references/> sections are now entity references for drafts and RFCs, instead of inserting the reference xml as generated from the input document.
There’s a slight refactoring of how the reference_anchors and section_anchors lists are generated.
Added xref elements for Section N.nn strings which reference document sections.
There has been multiple rounds of refactoring, to clean up and organise the code better.
The generated xml has also been cleaned up, to avoid long lines and tags bunched up on the same line. It’s still not super pretty, but should be readable.
Added a check on coupled debug trace switches, where setting a trace start option also requires that a trace stop option be set.
The regular expression which identifies code has been further refined.
Refined the header stripping to not join pararaphs where the first part has a short line.
Added more cases where list hangIndent is derived and set.
Added modification of the text-list-symbols PI in order to better match the source. Since this is a global setting, it can’t handle inconsistent bullet styles in a document (for instance created with hangText=”*” …).
Improved the error message for missing stream information when attempting to process older RFCs
Fixed a bug in the handling of the xml tree for xrefs found in text interspersed with vspace elements.
Code optimisations.
Added the last two changelog sections to the release information shown onl PyPi.
Version 1.0.0rc3 (26 May 2017)
This release reduces the diff between the text input file and the text file resulting from the generated xml even more. The average number of lines in the input which is rendered differently in the output is now below 3%.
From the changelog:
Committed updated (smaller) diff files for test baseline
Added more alternatives to the code recognition regex, for xml tags and C statements
Refined the header/footer stripping a bit, to not join text broken across pages into one paragraph when there are too many intervening blank lines, or when the last line is a table or figure label.
Added handling of blank lines in list items, by inserting <vspace> as needed
Added isertion of subcompact PIs for compact list. Fixed some warning message issues.
Added another comment delimiter to the code regex, and applied it to whole text blocks, not only to their first line.
Moved list block normalisation functions into the DraftParser class, and added recognition of compact lists. Also some refactoring.
Added more descriptive manpage text, and tweaked the making of the manpage.
Added switches for trace start and stop on line number, and renamed the trace-related switches.
Refined guess_list_style().
Added code to recognise ‘centered’ titles when they span the whole line
Rewrote the code which parses the top left column of the titlepage to not assume any ordering of the lines, but permit them to occur in almost any order. The only exception is that if there’s a working group string, it must occur first, as it has no recognizable keyword to identify it.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file id2xml-1.0.0.tar.gz
.
File metadata
- Download URL: id2xml-1.0.0.tar.gz
- Upload date:
- Size: 52.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7b2cf6173c7e7f755f6009743bfdc826da0e1eda695ad4731d698d1162dbd025 |
|
MD5 | 88d69311da61aeefe5306d313b2bf837 |
|
BLAKE2b-256 | fc7d4aff10002b6db466544721f8c3e5419ffd484ee21d572698f7b99f2f9ff2 |
File details
Details for the file id2xml-1.0.0-py2.7.egg
.
File metadata
- Download URL: id2xml-1.0.0-py2.7.egg
- Upload date:
- Size: 93.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e9b6189dc3b325ef7dad2a510e07ba99ff6d3a6f74029730766ece079b8ae109 |
|
MD5 | af120d396448c8c9a962ad5405eaaef5 |
|
BLAKE2b-256 | 50a6fab6a064df509d98bb9bc1e032a6055662eeebf04f436d851469c4766901 |