Repository for parsing data from files and sites.
Project description
Pulling
Pulling is an open source python repository for parsing data from files and web pages. Documentation in English can be found here - https://github.com/ItYaS/pulling/wiki. The repository now supports .txt .rtf .pdf .docx .csv .avro .json formats and parsing data from tags(p, h, a, img, span) of web pages.
Future
Such a repository can be extended for life. That’s what I will do. But the next version (where there will be parsing from other formats) will not be released soon, because in 2020 and 2021, I am preparing for exams and admission to the Institute. So, keep this repository and be patient.
In the future, I want to parse .orc .rcf .parquet .feather (and one day .doc .odt), add conversion to other extensions for all formats, add new functions, and new formats.
Creation idea
The idea of creating this repository came to me, one might say by accident. I was writing my own site, which will check for matches between the link and the uploaded files. At the end I had a bug that I never fixed and because of which I never uploaded my site. But the code I wrote to check files and links, from which I suffered a lot (because by that time on the Internet. There were no materials on all extensions and I had to write everything myself) I wrote too long and looked for help on the Internet. It would be too shame not to post it.
Especially, I don’t think anybody else would spend so much time looking for extracting text from other extensions. There are no such libraries that really work, for example, on windows. Just no! (by now)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.