Skip to main content

Repository for parsing data from files and sites.

Project description

Pulling

Pulling is an open source python repository for parsing data from files and web pages. Documentation in English can be found here - https://github.com/ItYaS/pulling/wiki. The repository now supports .txt .rtf .pdf .docx .csv .avro .json formats and parsing data from tags(p, h, a, img, span) of web pages.

Future

Such a repository can be extended for life. That’s what I will do. But the next version (where there will be parsing from other formats) will not be released soon, because in 2020 and 2021, I am preparing for exams and admission to the Institute. So, keep this repository and be patient.

In the future, I want to parse .orc .rcf .parquet .feather (and one day .doc .odt), add conversion to other extensions for all formats, add new functions, and new formats.

Creation idea

The idea of creating this repository came to me, one might say by accident. I was writing my own site, which will check for matches between the link and the uploaded files. At the end I had a bug that I never fixed and because of which I never uploaded my site. But the code I wrote to check files and links, from which I suffered a lot (because by that time on the Internet. There were no materials on all extensions and I had to write everything myself) I wrote too long and looked for help on the Internet. It would be too shame not to post it.

Especially, I don’t think anybody else would spend so much time looking for extracting text from other extensions. There are no such libraries that really work, for example, on windows. Just no! (by now)

Version 1.1.1

  • At first, several things were added:
    1. Link parsing from audio and video tags

    2. Improved data parsing from the img tag

    3. Parsing text from such tags b, big, small, strong, i, sub, sup, span, ins, del, th

  • Secondly, the parsing function now returns 2 dictionaries. The first one with data from tags containing text, and the second one with links.

  • The code readability was also fixed and all unnecessary loops were removed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pulling-1.2.tar.gz (7.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page