Skip to main content

Scrape data in one-shot

Project description

SCRAPEGOAT

drawing

Scrape data in one shot.

pypi

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing. It can be mainly used for non-English language to get accurate and relevant scraped text.

Concept

Initially the data is scraped from a website and processed ( to remove English words if the data required is in other language). The BERT model is feed with processed data and topic to compute the cosine similarity of the given topic with each word of the scraped data then mean of cosine similarity scores of is computed. If the mean is greater than threshold then scraped data is generated as output. Also there is a section where we are using Adaptive threshold.

drawing

BERT Model

BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. The BERT framework was pre-trained using text from Wikipedia. The transformer is the part of the model that gives BERT its increased capacity for understanding context and ambiguity in language. The transformer does this by processing any given word in relation to all other words in a sentence, rather than processing them one at a time. By looking at all surrounding words, the Transformer allows the BERT model to understand the full context of the word, and therefore better understand searcher intent.

Cosine Similarity

Cosine similarity is one of the metrics to measure the text-similarity between two documents irrespective of their size in Natural language Processing. A word can be represented in the vector form, therefore the text documents are represented in n-dimensional vector space. If the Cosine similarity score is 1, it means two vectors have the same orientation. The value closer to 0 indicates that the two documents have less similarity. The Cosine similarity of two documents will range from 0 to 1.

drawing

Multi Processing

The multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. The basic ideology of Multi-Processing is that if you have an algorithm that can be divided into different workers (small processors/cores), then you can speed up the program. Machines nowadays come with 4,6,8 and 16 cores, therefore parts of the code can be deployed in parallel.

Using Scrapegoat

The examples/test.py file contains these

from scrapegoat.main import getLinkData
from scrapegoat.main import generateData


if __name__=="__main__":
    # scrape one link and get the relevence score
    topic = " cricket"
    language = 'kn'
    url = "https://vijaykarnataka.com/sports/cricket/news/ind-vs-eng-brian-lara-congratulates-jasprit-bumrah-for-breaking-his-world-record-in-test-cricket/articleshow/92628545.cms"
    text,score = getLinkData(url, topic, language=language)
    print(score)


    # scrape and download data
    topic = " cricket"
    language = 'hi'
    generateData(topic, language, n_links=20)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scrapegoat-1.0.0.7.tar.gz (6.5 kB view details)

Uploaded Source

Built Distribution

scrapegoat-1.0.0.7-py3-none-any.whl (7.4 kB view details)

Uploaded Python 3

File details

Details for the file scrapegoat-1.0.0.7.tar.gz.

File metadata

  • Download URL: scrapegoat-1.0.0.7.tar.gz
  • Upload date:
  • Size: 6.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.25.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.41.0 importlib-metadata/1.6.0 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.9

File hashes

Hashes for scrapegoat-1.0.0.7.tar.gz
Algorithm Hash digest
SHA256 7b760fda607d8b6204ee2631c3d80a18e9e1d6945ccc4758bf5a55e4800c31b1
MD5 3957df75e2c5d00dbda4fdb11c6dd0bb
BLAKE2b-256 fc0b3b433331f6a09508e3303a0640902e5aabfc1e556495d0a2a83311adf40c

See more details on using hashes here.

File details

Details for the file scrapegoat-1.0.0.7-py3-none-any.whl.

File metadata

  • Download URL: scrapegoat-1.0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 7.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.3 readme-renderer/34.0 requests/2.25.1 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.41.0 importlib-metadata/1.6.0 keyring/23.4.1 rfc3986/1.5.0 colorama/0.4.4 CPython/3.6.9

File hashes

Hashes for scrapegoat-1.0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ce4625e5236948f598821b151eb2ca030373f94d4f23bd0d1d9d72cf41326405
MD5 64f8673c78e6033e13c13f6994fc42a6
BLAKE2b-256 f233e3eed2376bc7e0994de1c2eea7f1321bc0f755f9cb5ccc245794d94b86db

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page