Skip to main content

A parsing package for dblp using the Simple API for XML (SAX)

Project description

DBLP SAX Parser

What is it?

A parsing package using the Simple API for XML (SAX).

There are a total of 10 elements: "article", "inproceedings", "proceedings", "book", "incollection", "phdthesis", "mastersthesis", "www", "person", "data".

Across the elements, these are the feature types available: "address", "author", "booktitle","cdrom", "chapter", "cite", "crossref", "editor", "ee", "isbn", "journal", "month", "note", "number", "pages", "publisher", "publnr", "school", "series", "title", "url", "volume", "year".

Features

  • download dblp files from the dblp website directly
  • parse throught the dblp xml file into a dataframe, exported with either csv or pickle format.

Future features for consideration

  • add more methods to parse data from a specific attribute. E.g. only for years in 2016
  • select which elements or features to be included/excluded

Context and Purpose

I created this package when working on a project as part of a course module. The aim of this package is to provide a quick way to parse DBLP elements directly, with the contents exported as a csv file for further preprocessing based on individual's use case.

Installation

pip install dblp-sax-parser

# import package
from dblp_parser import DBLP_Parser as dp

Usage

First step to using this parser is to instantiate the dblp_parser

# Instantiate the dblp class 
dblp = dp()

You can also DBLP_Parser to download the dblp data assets from the dblp website

# download latest data sets from dblp website
dblp.download_latest_dump()

Parsing the xml file

filename = 'dblp.xml'

# execute the parser from the dblp class
parser, handler = dblp.execute_parser(filename=<filename>)

# you can use the handler to convert the handler output to dataframe
handler.to_df()

# the dataframe can be persisted as a pickle file or exported as csv file
handler.to_csv() # export to csv
handler.save() # persist as pickle

DBLP Methods

class DBLP_Parser

  • This is the main class to be instantiated when before using the parser

class DBLP_Parser.download_latest_dump

  • Begins downloading the latest dblp files from the dblp website. If the url location where files are hosted is changed/incorrect, a separate url can be used instead.
  • This downloads the dblp .dtd and .xml.gz files, and decompress the .gz file into .xml.
  • dtd_url[str]: url location of the .dtd file to be downloaded from.
  • xml_zip_url [str]: url of the .xml.tz file to be downloaded from.
  • xml_zip_filename [str]: specify filename of the downloaded .xml.gz file.
  • xml_filename [str]: specify filename of the .xml file that is decompressed.

class DBLP_Parser.execute_parser

  • This executes the underlying SAX parser, calling the xml.sax.handler.ContentHandler
  • filename [str]: path and name of XML file to be parsed. If **download_latest_dump() was used, the file to be parsed will be "dblp.xml".

License

This code is published under the MIT licence.

References

There are two main references that helped contributed to writing this package. Instantiating the outer dblp class to download dblp materials directly came from from angelosalatino. Some component of the SAX parsing logic itself was borrowed from hibernator11.

Project details


Release history Release notifications | RSS feed

This version

0.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dblp_sax_parser-0.1.tar.gz (8.6 kB view details)

Uploaded Source

Built Distribution

dblp_sax_parser-0.1-py3-none-any.whl (8.2 kB view details)

Uploaded Python 3

File details

Details for the file dblp_sax_parser-0.1.tar.gz.

File metadata

  • Download URL: dblp_sax_parser-0.1.tar.gz
  • Upload date:
  • Size: 8.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.3

File hashes

Hashes for dblp_sax_parser-0.1.tar.gz
Algorithm Hash digest
SHA256 8ea24525619f32677111819afd4578f9505ed92602e49b796748f7a5d898139f
MD5 4b25303b0639f5e89419eea30d954c2d
BLAKE2b-256 29a3ddeb66c859d72392bfe3314ad58e33bd2ebd3dfa5cc8539e18711d73d597

See more details on using hashes here.

File details

Details for the file dblp_sax_parser-0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for dblp_sax_parser-0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 f0448862e98dcf84448b6d0f98c6788c31303204050db02956c940f1ab1a6b2f
MD5 acfb9ce15df05d355d2c28bc3d39fdf3
BLAKE2b-256 d996f8d64b584c931ada9bb14ca0beaaee2258492347d7d099f144663317efca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page