Python package for crawler data and extract main information
Project description
framler
=======
[![PyPi](https://img.shields.io/pypi/v/framler.svg)](https://pypi.python.org/pypi/framler)
[![Build Status](https://travis-ci.org/huyhoang17/framler.svg?branch=master)](https://travis-ci.org/huyhoang17/framler)
[![Updates](https://pyup.io/repos/github/huyhoang17/framler/shield.svg)](https://pyup.io/repos/github/huyhoang17/framler/)
[![Python 3](https://pyup.io/repos/github/huyhoang17/framler/python-3-shield.svg)](https://pyup.io/repos/github/huyhoang17/framler/)
[![Documentation Status](https://readthedocs.org/projects/framler/badge/?version=latest)](https://framler.readthedocs.io/en/latest/?badge=latest)
Python package for crawler data and extract main information
- Free software: MIT license
- Documentation: https://framler.readthedocs.io.
Features
--------
### Package to crawl and extract main information for online newspapers
- Some online newspapers:
- Dan Tri: https://dantri.com.vn/
- VnExpress: https://vnexpress.net/
- vietnamnet: https://vietnamnet.vn/
- Nhan Dan: http://www.nhandan.com.vn/
- Tuoi Tre: https://tuoitre.vn/
- Lao Dong: https://laodong.vn/
- Doi song phap luat: http://www.doisongphapluat.com/
- Thanh Nien: https://thanhnien.vn/
- VOV: https://vov.vn/
- Zing: https://news.zing.vn/
- ....
- Main information:
- Url
- Title
- Content
- Authors
- Publish date
- Top image
- Images
- Tags
- ....
- Additional information:
- Extract keyword
- Summary content
- ....
- Folder structure
```
├── articles.py - contain article's meta information
├── cleaners.py - base object to clean article's content, include: html, text, stopword, ...
├── extractors.py - base extractor to auto extract main information for any articles, must include: url, title, content, author
├── parsers.py - base class to define some short methods to extract information from html elements, ex: regex define; find element by tag, id, class, ...
└── utils.py - define some common and useful methods
```
- Some prerequisite libraries:
- Selenium
- Requests
- beautifulsoup4
### TODO
- Add document
Reference
---------
Based on newspaper's API library: https://github.com/codelucas/newspaper
Credits
-------
This package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.
=======
History
=======
0.0.1 (2019-02-12)
------------------
* First release on PyPI.
=======
[![PyPi](https://img.shields.io/pypi/v/framler.svg)](https://pypi.python.org/pypi/framler)
[![Build Status](https://travis-ci.org/huyhoang17/framler.svg?branch=master)](https://travis-ci.org/huyhoang17/framler)
[![Updates](https://pyup.io/repos/github/huyhoang17/framler/shield.svg)](https://pyup.io/repos/github/huyhoang17/framler/)
[![Python 3](https://pyup.io/repos/github/huyhoang17/framler/python-3-shield.svg)](https://pyup.io/repos/github/huyhoang17/framler/)
[![Documentation Status](https://readthedocs.org/projects/framler/badge/?version=latest)](https://framler.readthedocs.io/en/latest/?badge=latest)
Python package for crawler data and extract main information
- Free software: MIT license
- Documentation: https://framler.readthedocs.io.
Features
--------
### Package to crawl and extract main information for online newspapers
- Some online newspapers:
- Dan Tri: https://dantri.com.vn/
- VnExpress: https://vnexpress.net/
- vietnamnet: https://vietnamnet.vn/
- Nhan Dan: http://www.nhandan.com.vn/
- Tuoi Tre: https://tuoitre.vn/
- Lao Dong: https://laodong.vn/
- Doi song phap luat: http://www.doisongphapluat.com/
- Thanh Nien: https://thanhnien.vn/
- VOV: https://vov.vn/
- Zing: https://news.zing.vn/
- ....
- Main information:
- Url
- Title
- Content
- Authors
- Publish date
- Top image
- Images
- Tags
- ....
- Additional information:
- Extract keyword
- Summary content
- ....
- Folder structure
```
├── articles.py - contain article's meta information
├── cleaners.py - base object to clean article's content, include: html, text, stopword, ...
├── extractors.py - base extractor to auto extract main information for any articles, must include: url, title, content, author
├── parsers.py - base class to define some short methods to extract information from html elements, ex: regex define; find element by tag, id, class, ...
└── utils.py - define some common and useful methods
```
- Some prerequisite libraries:
- Selenium
- Requests
- beautifulsoup4
### TODO
- Add document
Reference
---------
Based on newspaper's API library: https://github.com/codelucas/newspaper
Credits
-------
This package was created with [Cookiecutter](https://github.com/audreyr/cookiecutter) and the [`audreyr/cookiecutter-pypackage`](https://github.com/audreyr/cookiecutter-pypackage) project template.
=======
History
=======
0.0.1 (2019-02-12)
------------------
* First release on PyPI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
framler-0.0.4.tar.gz
(12.3 kB
view hashes)
Built Distribution
Close
Hashes for framler-0.0.4-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | b340da14bfb442b856cb2d084a14d2213f7c5c3810f95faf854548a41b571126 |
|
MD5 | 3bd7a0c8563f0a682c2b80b7b4b1a10c |
|
BLAKE2b-256 | a3cb1f472eee64adb73583ba610122222ece22a7d521d6d83584e544dfbd1cc8 |