Skip to main content

A simple newspaper3k haystack node wrapper. A node for scraping articles given a link and a crawler.

Project description

Newspaper3k Haystack

Newspaper3k Haystack is a simple wrapper for the newspaper3k library within the Haystack framework. It allows to scrape articles given urls using the scraper node or crawl many pages using the crawler node.

Installation:

You can install Newspaper3k Haystack using pip:

pip install newspaper3k-haystack

Usage:

Scraper node:

from newspaper3k-haystack import newspaper3k_scraper
scraper = newspaper3k_scraper()

You can also provide a header for the request and a timeout for the page loading.

scraper = newspaper3k_scraper(
    headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'},
    request_timeout= 10)

To run in standalone mode you can use the run or run_batch if you want to load one url or multiple urls in an array.

Available parameters:

:param query: list of strings containing the webpages to scrape.
:param lang: (None by default) language to process the article with, if None autodetected.
    Available languages are: (more info at https://newspaper.readthedocs.io/en/latest/)
    input code      full name

    ar              Arabic
    ru              Russian
    nl              Dutch
    de              German
    en              English
    es              Spanish
    fr              French
    he              Hebrew
...
:param summary: (False by default) Wether to summarize the document (through nespaper3k) and save it as document metadata.
:param path: (None by default) Path where to store the downloaded articles html, if None, not downloaded. Ignored if load=True
:param load: (False by default) If true query should be a local path to an html file to scrape.

In standalone:

scraper.run(query="https://www.lonelyplanet.com/articles/getting-around-norway",
    metadata=True,
    summary=True,
    keywords=True,
    path="articles")

In a pipeline:


from qdrant_haystack.document_stores import QdrantDocumentStore
from haystack.nodes import EntityExtractor
from haystack.pipelines import Pipeline
from haystack.nodes import PreProcessor

document_store = QdrantDocumentStore(
    ":memory:",
    index="Document",
    embedding_dim=768,
    recreate_index=True,
)

entity_extractor = EntityExtractor(model_name_or_path="dslim/bert-base-NER",flatten_entities_in_meta_data=True)

processor = PreProcessor(
    clean_empty_lines=False,
    clean_whitespace=False,
    clean_header_footer=False,
    split_by="sentence",
    split_length=30,
    split_respect_sentence_boundary=False,
    split_overlap=0
)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=scraper, name="scraper", inputs=['File'])
indexing_pipeline.add_node(component=processor, name="processor", inputs=['scraper'])
indexing_pipeline.add_node(entity_extractor, "EntityExtractor", ["processor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['EntityExtractor'])

#we can pass the previously seen arguments also
indexing_pipeline.run(query = "https://www.roughguides.com/norway/",
    params={
        "scraper":{
            "metadata":True,
            "summary":True,
            "keywords":True
        }
    })

Crawler node:

from newspaper3k-haystack import newspaper3k_crawler

When initializing the crawler you can pass the same parameters as to the scraper node.

crawler = newspaper3k_crawler(
    headers={'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'},
    request_timeout= 10)

Available parameters:

        :param query: list of initial urls to start scraping
        :param n_articles: number of articles to scrape per initial url
        :param beam: number of articles from each scraped website to prioritize in the crawl queue.
            If 0 then priority of scrape will be a simple continuous pile of found links after each scrape. (BFS). 
            If 1 then would be performing (DFS).
        :param filters: dictionary with lists of strings that the urls should contain or not. Keys: positive and negative.
            Urls will be checked to contain at least one positive filter and none of the negatives.
            e.g.
            {positive: [".com",".es"],
            negative: ["facebook","instagram"]}
        :param keep_links: (False by default) Wether to keep the found links in each page as document metadata or not

        :param lang: (None by default) language to process the article with, if None autodetected.
            Available languages are: (more info at https://newspaper.readthedocs.io/en/latest/)
            input code      full name

            ar              Arabic
            ru              Russian
            nl              Dutch
            de              German
            en              English
            es              Spanish
            fr              French
            he              Hebrew
            it              Italian
            ko              Korean
            no              Norwegian
            fa              Persian
            pl              Polish
            pt              Portuguese
            sv              Swedish
            hu              Hungarian
            fi              Finnish
            da              Danish
            zh              Chinese
            id              Indonesian
            vi              Vietnamese
            sw              Swahili
            tr              Turkish
            el              Greek
            uk              Ukrainian
            bg              Bulgarian
            hr              Croatian
            ro              Romanian
            sl              Slovenian
            sr              Serbian
            et              Estonian
            ja              Japanese
            be              Belarusian

        :param metadata: (False by default) Wether to get article metadata.
        :param keywords: (False by default) Wether to save the detected article keywords as document metadata.
        :param summary: (False by default) Wether to summarize the document (through nespaper3k) and save it as document metadata.
        :param path: (None by default) Path where to store the downloaded articles html, if None, not downloaded.

In standalone:

You can also use run_batch and pass a list of urls in the query argument. It will scrape n_articles for each provided url.

docs = crawler.run(
    query = "https://www.roughguides.com/norway/ ",
    n_articles = 10,
    beam = 5,
    filters = {
        "positive":["norway"],
        "negative":["facebook","instagram"]
    },
    keep_links = False,
    metadata=True,
    summary=True,
    keywords=True,
    path = "articles")

In a pipeline:

from qdrant_haystack.document_stores import QdrantDocumentStore
from haystack.nodes import EntityExtractor
from haystack.pipelines import Pipeline
from haystack.nodes import PreProcessor

document_store = QdrantDocumentStore(
    ":memory:",
    index="Document",
    embedding_dim=768,
    recreate_index=True,
)

entity_extractor = EntityExtractor(model_name_or_path="dslim/bert-base-NER",flatten_entities_in_meta_data=True)

processor = PreProcessor(
    clean_empty_lines=False,
    clean_whitespace=False,
    clean_header_footer=False,
    split_by="sentence",
    split_length=30,
    split_respect_sentence_boundary=False,
    split_overlap=0
)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=crawler, name="crawler", inputs=['File'])
indexing_pipeline.add_node(component=processor, name="processor", inputs=['crawler'])
indexing_pipeline.add_node(entity_extractor, "EntityExtractor", ["processor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['EntityExtractor'])

#we can pass the previously seen arguments also
indexing_pipeline.run(query = "https://www.roughguides.com/norway/",
    params={
        "crawler":{
            "n_articles" : 500,
            "beam" : 5,
            "filters" : {
                "positive":["norway"],
                "negative": ["facebook"]
            },
            "keep_links" : False,
            "metadata":True,
            "summary":True,
            "keywords":True,
            "path": "articles"
        }
    })

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newspaper3k_haystack-0.1.1.tar.gz (9.0 kB view details)

Uploaded Source

Built Distribution

newspaper3k_haystack-0.1.1-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file newspaper3k_haystack-0.1.1.tar.gz.

File metadata

  • Download URL: newspaper3k_haystack-0.1.1.tar.gz
  • Upload date:
  • Size: 9.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.16

File hashes

Hashes for newspaper3k_haystack-0.1.1.tar.gz
Algorithm Hash digest
SHA256 803c149eba1e4c441955c3cb2e0538b0a582c845668b9bfb2144fbc99fd23bed
MD5 b3be770f27385b575435508ea0ec86ec
BLAKE2b-256 d391e92f779964df5e6dbd7c99bb20eeeb3f9c92fe3c4e1a312a236f8ec63121

See more details on using hashes here.

File details

Details for the file newspaper3k_haystack-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for newspaper3k_haystack-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 46ca076a2955029027287273d9fe93ecd220fb4958c93ccff4c9ec430051f7f4
MD5 7b7e6f6a81a8ade47002ec237465112d
BLAKE2b-256 0c79353640ae10b722af1d7affc9b0bea57a8908ef562a4b68bc3c6610535506

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page