Skip to main content

Explore Web Pages - Scrapers and Crawlers

Project description

WebXplore (v1.0.3)

Build Status PyPI - License codecov PyPI - Python Version

WebXplore offers multitude of tools for web scraping, crawling and performing computations on scraped information to determine sentiment values or tone of the author.

This package helps in retrieving information from these sources:

  1. Google Search: Get links from any google search query.

  2. Website Text: Use an intelligent parser to strip all the HTML pages from webpage contents.

  3. Twitter: Given a word or phrase, get related tweets.

  4. Reddit: Get the hottest posts given the subreddit and a key phrase.

  5. NewsAPI: Retrieve News Articles given topic or phrase.

Installation

$ pip install webxplore

or clone the repository.

$ git clone https://github.com/arnavn101/WebXplore.git

Getting Started

Here are steps for using webxplore.

1. Get Links from Google Search

from webxplore import WebSearcher

searchQuery = WebSearcher.SearchWeb('Artificial Intelligence', 5)
print(searchQuery.returnListLinks())

2. Scrape a Website

from webxplore import WebScraper

webScraper = WebScraper.ScrapeWebsite('https://en.wikipedia.org/wiki/Artificial_intelligence')
print(webScraper.return_article())

3. Get Sentiments from Text

from webxplore.utils import SentimentAnalyzer

sentimentAnalyzer = SentimentAnalyzer.RetrieveSentiments('This is a good situation.')
print(sentimentAnalyzer.returnFinalSentiment())

4. Get Summary of the Text

from webxplore.utils import TextSummarizer

textSummarizer = TextSummarizer.SummarizeText('He feels very scared. He wants to protect himself.', 1)
print(textSummarizer.returnFinalSummary())

5. Get Tone of the Text (for each sentence)

from webxplore.utils import ToneAnalyzer

textTone = ToneAnalyzer.ToneAnalysis('Laugh and the world laughs with you.' +
                                     'Weep and you weep alone.', "watsonApiKey")
print(textTone.returnTone())

6. Use the news api to get the latest articles

from webxplore.searchBeyond import SearchNews

newsArticles = SearchNews.RetrieveNewsArticle('Politics', 5, 'newsApiKey')
print(newsArticles.return_articleSentences())

7. Get Posts from a SubReddit

from webxplore.searchBeyond import SearchReddit

redditPosts = SearchReddit.CrawlSubReddit('stocks', 'amazon', 10, 'RedditClientId',
                                          'RedditClientSecret', 'RedditUserAgent')
print(redditPosts.return_listSentences())

8. Get Tweets that have a key word

from webxplore.searchBeyond import SearchTwitter

retrieveTweets = SearchTwitter.CrawlTwitter('tesla', 10, 'TwitterConsumerKey', 'TwitterConsumerSecret',
                                            'TwitterAccountKey', 'TwitterAccountSecret')
print(retrieveTweets.return_tweets())

Contributions

Anyone is welcome to add any contribution to this repository. All good changes are welcome. Please create a pull request and ensure that it passes all the CI tests.

License

MIT License Copyright (c) 2020, Arnav Nidumolu

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

WebXplore-1.0.3.tar.gz (9.7 kB view hashes)

Uploaded Source

Built Distribution

WebXplore-1.0.3-py3-none-any.whl (14.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page