Skip to main content

Advanced web scrapper for machine learning and data science buit around BeautifulSoup and Pandas

Project description

b'logo\r\n\r\n# Introduction\r\n\r\nZineb is a lightweight tool solution for simple and efficient web scrapping and crawling built around BeautifulSoup and Pandas. It's main purpose is to help quickly structure your data in order to be used as fast as possible in data science or machine learning projects.\r\n\r\n# Understanding how Zineb works\r\n\r\nZineb gets your custom spider, creates a set of HTTPRequest objects for each url, sends the requests and caches a BeautifulSoup object of the page within an HTMLResponse class of that request.\r\n\r\nMost of your interactions with the HTML page will be done through the HTMLResponse class.\r\n\r\nWhen the spider starts crawling the page, each response and request in past through the start function:\r\n\r\n\r\ndef start(self, response, **kwargs):\r\n request = kwargs.get(\'request\')\r\n images = response.images\r\n\r\n\r\n# Getting started\r\n\r\n## Creating a project\r\n\r\nTo create a project do python -m zineb start_project <project name> which will create a directory which will have the following structure.\r\n\r\n.myproject\r\n|\r\n|--media\r\n|\r\n|-- models\r\n|-- base.py\r\n|\r\n|-- init.py\r\n|\r\n|-- manage.py\r\n|\r\n|-- settings.py\r\n|\r\n|-- spiders.py\r\n\r\nOnce the project folder is created, all your interractions with Zineb will be made trough the management commands that are executed through python manage.py from your project's directory.\r\n\r\nThe models directory allows you to place the elements that will help structure the data that you have scrapped from from the internet.\r\n\r\nThe manage.py file will allow you to run all the required commands from your project.\r\n\r\nFinally, the spiders module will contain all the spiders for your project.\r\n\r\n## Configuring your project\r\n\r\nOn startup, Zineb implements a set of basic settings (zineb.settings.base) that will get overrided by the values that you would have defined in your settings.py located in your project.\r\n\r\nYou can read more about this in the settings section of this file.\r\n\r\n## Creating a spider\r\n\r\nCreating a spider is extremely easy and requires a set of starting urls that can be used to scrap one or many HTML pages.\r\n\r\npython\r\nclass Celebrities(Zineb):\r\n start_urls = [\'http://example.com\']\r\n\r\n def start(self, response, request=None, soup=None, **kwargs):\r\n # Do something here\r\n\r\n\r\nOnce the Celibrities class is called, each request is passed through the start method. In other words the zineb.http.responses.HTMLResponse, zineb.http.request.HTTPRequest and the BeautifulSoup HTML page object are sent through the function.\r\n\r\nYou are not required to use all these parameters at once. They're just for convinience.\r\n\r\nIn which case, you can also write the start method as so if you only need one of these.\r\n\r\npython\r\ndef start(self, response, **kwargs):\r\n # Do something here\r\n\r\n\r\nOther objects can be passes through the function such as the models that you have created but also the settings of the application etc.\r\n\r\n### Adding meta options\r\n\r\nMeta options allows you to customize certain very specific behaviours [not found in the settings.py file] related to the spider.\r\n\r\npython\r\n class Celerities(Zineb):\r\n start_urls = [\'http://example.com\']\r\n \r\n class Meta:\r\n domains = []\r\n\r\n\r\n#### Domains\r\n\r\nThis option limits a spider to a very specific set of domains.\r\n\r\n#### Verbose name\r\n\r\nThis option writer as verbose_name will specific a different name to your spider.\r\n\r\n## Running commands\r\n\r\n### Start\r\n\r\nTriggers the execution of all the spiders present in the given the project.\r\n\r\n### Shell\r\n\r\nStart a iPython shell on which you can test various elements on the HTML page.\r\n\r\nWhen the shell is started, the zineb.http.HTTPRequest, the zineb.response.HTMLResponse, and the BeautifulSoup instance of the page are injected.\r\n\r\nExtractors are passed using aliases:\r\n\r\n* links: LinkExtractor\r\n* images: ImageExtractor\r\n* multilinks: MultiLinkExtractor\r\n* tables: TableExtractor\r\n\r\nThe extractors are also all passed within the shell in addition to the project settings.\r\n\r\nIn that regards, the shell becomes a interesting place where you can test various querying before using it in your project. For example, using the shell with http://example.com.\r\n\r\nWe can get a simple url :\r\n\r\npython\r\nIPython 7.19.0\r\n\r\nIn [1]: response.find("a")\r\nOut[1]: <a href="https://www.iana.org/domains/example">More information...</a>\r\n\r\n\r\nWe can find all urls on the page:\r\n\r\npython\r\nIPython 7.19.0\r\n\r\nIn [2]: extractor = links()\r\nIn [3]: extractor.resolve(response)\r\nIn [4]: str(extrator)\r\nOut [4]: [Link(url=https://www.iana.org/domains/example, valid=True)]\r\n\r\nIn [5]: response.links\r\nOut [5]: [Link(url=https://www.iana.org/domains/example, valid=True)]\r\n\r\n\r\nOr simply get the page title:\r\n\r\npython\r\nIPython 7.19.0\r\n\r\nIn [6]: response.page_title\r\nOut [6]: \'Example Domain\'\r\n\r\n\r\nRemember that in addition to the custom functions created for the class, all the rest called on zineb.response.HTMLResponse are BeautifulSoup ones (find, find_all, find_next, next_sibling...)\r\n\r\n## Queries on the page\r\n\r\nLike said previously, the majority of your interactions with the HTML page will be done through the HTMLResponse object or zineb.http.responses.HTMLResponse class.\r\n\r\nThis class will implement some very basic general functionnalities that you can use through the course of your project. To illustrate this, let's create a basic Zineb HTTP response from a request:\r\n\r\npython\r\nfrom zineb.http.requests import HTTPRequest\r\n\r\nrequest = HTTPRequest("http://example.com")\r\n\r\n\r\nRequests, when created a not sent [or resolved] automatically if the _send function is not called. In that case, they are marked as being unresolved ex. HTTPRequest("http://example.co", resolved=False).\r\n\r\nOnce the _send method is called, by using the html_page attribute or calling any BeautifulSoup function on the class, you can do all the classic querying on the page e.g. find, find_all...\r\n\r\npython\r\nrequest._send()\r\n\r\nrequest.html_response\r\n# -> Zineb HTMLResponse object\r\n\r\nrequest.html_response.html_page\r\n# -> BeautifulSoup object\r\n\r\nrequest.find("a")\r\n# -> BeautifulSoup Tag\r\n\r\n\r\nIf you do not know about BeautifulSoup please read the documentation here.\r\n\r\nFor instance, suppose you have a spider and want to get the first link present on http://example.com. That's what you would so:\r\n\r\npython\r\nfrom zineb.app import Zineb\r\n\r\nclass MySpider(Zineb):\r\n start_urls = ["http://example.com"]\r\n\r\n def start(self, response=None, request=None, soup=None, **kwargs):\r\n link = response.find("a")\r\n\r\n # Or, you can also use this tehnic through\r\n # the request object\r\n link = request.html_response.find("a")\r\n\r\n # Or you can directly use the soup\r\n # object as so\r\n link = soup.find("a")\r\n\r\n\r\nIn order to understand what the Link, Image and Table objects represents, please read the following section of this page.\r\n\r\nZineb HTTPRequest objects are better explained in the following section.\r\n\r\n### Getting all the links\r\n\r\npython\r\nrequest.html_response.links\r\n# -> [Link(url=http://example.com valid=True)]\r\n\r\n\r\n### Getting all the images\r\n\r\npython\r\nrequest.html_response.images\r\n# -> [Image(url=https://example.com/1.jpg")]\r\n\r\n\r\n### Getting all the tables\r\n\r\npython\r\nrequest.html_response.tables\r\n# -> [Table(url=https://example.com/1")]\r\n\r\n\r\n### Getting all the text\r\n\r\nFinally you can retrieve all the text of the web page at once.\r\n\r\npython\r\nrequest.html_response.text\r\n\r\n -> \'\\n\\n\\nExample Domain\\n\\n\\n\\n\\n\\n\\n\\nExample Domain\\nThis domain is for use in illustrative examples in documents. You may use this\\n domain in literature without prior coordination or asking for permission.\\nMore information...\\n\\n\\n\\n\'\r\n\r\n\r\n## FileCrawler\r\n\r\nThere might be situations where you might have a set of HTML files in your project directory that you want to crawl. Zineb provides a Spider for such event.\r\n\r\n__NOTE:__ Ensure that the directory to use is within your project.\r\n\r\npython\r\nclass Spider(FileCrawler):\r\n start_files = ["media/folder/myfile.html"]\r\n\r\n\r\nYou might have thousands of files and certainly might not want to reference each file one by one. You can then also use a utility function collect_files.\r\n\r\npython\r\nfrom zineb.utils.iterator import collect_files\r\n\r\nclass Spider(FileCrawler):\r\n start_files = collect_files("media/folder")\r\n\r\n\r\nRead more on collect_files here.\r\n\r\n# Models\r\n\r\nModels are a simple way to structure your scrapped data before saving them to a file.\r\n\r\n## Creating a custom Model\r\n\r\nIn order to create a model, subclass the Model object from zineb.models.Model and then add fields to it. For example:\r\n\r\npython\r\nfrom zineb.models.datastructure import Model\r\nfrom zineb.models import fields\r\n\r\nclass Player(Model):\r\n name = fields.CharField()\r\n date_of_birth = fields.DateField()\r\n height = fields.IntegerField()\r\n\r\n\r\n### Using the custom model\r\n\r\nOn its own, a model does nothing. In order to make it work, you have to add values to it and then resolve the fields.\r\n\r\n#### Adding a free custom value\r\n\r\nThe first method consists of adding values through the add_value method. This method does not rely on the BeautifulSoup HTML page object which means that values can be added freely.\r\n\r\npython\r\nplayer.add_value(\'name\', \'Kendall Jenner\')\r\n\r\n\r\n#### Adding a value based on an expression\r\n\r\nAddind expression based values requires a BeautifulSoup HTML page object. You can add one value at a time.\r\n\r\npython\r\nplayer.add_using_expression(\'\'name\', \'a\', attrs={\'class\': \'title\'})\r\n\r\n\r\n#### Add case based values\r\n\r\nIf you want to add a value to the model based on certain conditions, use add_case in combination wih an expression class.\r\n\r\nFor instance, suppose you are scrapping a fashion website and for certain prices, let's say 25 you want to replace them by 25.5.\r\n\r\npython\r\nfrom zineb.models.expressions import When\r\n\r\nmy_model.add_case(25, When(\'price__eq=25\', 25.5))\r\n\r\n\r\n#### Adding multiple values with expressions\r\n\r\n#### Adding calculated values\r\n\r\nIf you wish to operate a calculation on a field before passing to your model, you can use expression classes in combination with the add_calculated_value.\r\n\r\npython\r\nfrom zineb.models.expressions import Add\r\n\r\nmy_model.add_calculatd_value(\'price\', Add(25, 5))\r\n\r\n\r\n#### Adding related values\r\n\r\nIn cases where you want to add a value to your model based on the last inserted value, this function serves exactly this purpose. Suppose you are retrieving date of births on a website and want to automatically derive the person's age based on that model field:\r\n\r\npython\r\nclass MyModel(Model):\r\n date_of_birth = fields.DateField("%d-%M-%Y")\r\n age = fields.AgeField("%Y-%M-%d")\r\n\r\n\r\nWithout the add_related_value this is what you would do:\r\n\r\npython\r\nmodel.add_value("date_of_birth", value)\r\nmodel.add_value("age", value)\r\n\r\n\r\nHowever, with the add_related_value you can automatically insert the age value in the model based on the returned value from the date of birth:\r\n\r\npython\r\nmodel.add_related_value("date_of_birth", "age", value)\r\n\r\n\r\nThis will insert date of birth based on the DateField and then insert another on the AgeField.\r\n\r\n## Meta options\r\n\r\nBy adding a Meta to your model, you can pass custom behaviours.\r\n\r\n* Ordering\r\n* Template model\r\n\r\n### Template model\r\n\r\nIf a model only purpose is to implement additional fields to a child model, use the template_model option to indicate this state.\r\n\r\npython\r\nclass TemplateModel(Model):\r\n name = fields.CharField()\r\n\r\n class Meta:\r\n template_model = True\r\n\r\n\r\nclass MainModel(TemplateModel):\r\n surname = fields.CharField()\r\n\r\n\r\n### Ordering\r\n\r\nOrder your data in a specific way based on certain fields before saving your model.\r\n\r\n## Fields\r\n\r\nFields are a very simple way to passing HTML data to your model in a very structured way. Zineb comes with number of preset fields that you can use out of the box:\r\n\r\n* CharField\r\n* TextField\r\n* NameField\r\n* EmailField\r\n* UrlField\r\n* ImageField\r\n* IntegerField\r\n* DecimalField\r\n* DateField\r\n* AgeField\r\n* FunctionField\r\n* CommaSeparatedField\r\n* ListField\r\n* BooleanField\r\n\r\n### How fields work\r\n\r\nEach fields comes with a resolve function when called stores the resulting value within itself.\r\n\r\nBy default, the resolve function will do the following things.\r\n\r\nFirst, it will run all cleaning functions on the original value for example by stripping tags like "<" or ">" which normalizes the value before additional processing.\r\n\r\nSecond, a deep_clean method is run on the result by taking out out any useless spaces, removing escape characters and finally reconstructing the value to ensure that any none-detected white space be eliminated.\r\n\r\nFinally, all the registered validators (default and custom) are called on the final value.\r\n\r\n### CharField\r\n\r\nThe CharField represents the normal character element on an HTML page.\r\n\r\nCharField(max_length=None, null=None, default=None, validators=[])\r\n\r\n### TextField\r\n\r\nThe text field is longer which allows you then to add paragraphs of text.\r\n\r\nTextField(max_length=None, null=None, default=None, validators=[])\r\n\r\n### NameField\r\n\r\nThe name field allows you to implement capitalized text in your model. The title method is called on the string in order to represent the value correctly e.g. Kendall Jenner.\r\n\r\nNameField(max_length=None, null=None, default=None, validators=[])\r\n\r\n### EmailField\r\n\r\nThe email field represents emails. The default validator, validators.validate_email, is automatically called on the resolve function fo the class in order to ensure that that the value is indeed an email.\r\n\r\n* limit_to_domains: Check if email corresponds to the list of specified domains\r\n\r\nEmailField(limit_to_domains=[], max_length=None, null=None, default=None, validators=[])\r\n\r\n### UrlField\r\n\r\nThe url field is specific for urls. Just like the email field, the default validator, validators.validate_url is called in order to validate the url.\r\n\r\n### ImageField\r\n\r\nThe image field holds the url of an image exactly like the UrlField with the sole difference that you can download the image directly when the field is evaluated.\r\n\r\n* download: Download the image to your media folder while the scrapping is performed\r\n* as_thumnail: Download image as a thumbnail\r\n* download_to: Download image to a specific path\r\n\r\npython\r\nclass MyModel(Model):\r\n avatar = ImageField(download=True, download_to="/this/path")\r\n\r\n\r\n### IntegerField\r\n\r\nThis field allows you to pass an integer into your model.\r\n\r\n* default: Default value if None\r\n* max_value: Implements a maximum value constraint\r\n* min_value: Implements a minimum value constraint\r\n\r\n### DecimalField\r\n\r\nThis field allows you to pass a float value into your model.\r\n\r\n* default: Default value if None\r\n* max_value: Implements a maximum value constraint\r\n* min_value: Implements a minimum value constraint\r\n\r\n### DateField\r\n\r\nThe date field allows you to pass dates to your model. In order to use this field, you have to pass a date format so that the field can know how to resolve the value.\r\n\r\n* date_format: Indicates how to parse the incoming data value\r\n* default: Default value if None\r\n* tz_info: Timezone information\r\n\r\npython\r\nclass MyModel(Model):\r\n date = DateField("%d-%m-%Y")\r\n\r\n\r\n### AgeField\r\n\r\nThe age field works likes the DateField but instead of returning the date, it will return the difference between the date and the current date which corresponds to the age.\r\n\r\n* date_format: Indicates how to parse the incoming data value\r\n* default: Default value if None\r\n* tz_info: Timezone information\r\n\r\n### FunctionField\r\n\r\nThe function field is a special field that you can use when you have a set of functions to run on the value before returning the final result. For example, let's say you have this value Kendall J. Jenner and you want to run a specific function that takes out the middle letter on every incoming values:\r\n\r\npython\r\ndef strip_middle_letter(value):\r\n # Do something here\r\n return value\r\n\r\nclass MyModel(Model):\r\n name = FunctionField(strip_middle_letter, output_field=CharField())\r\n\r\n\r\nEvery time the resolve function will be called on this field, the methods provided will be passed on the value sequentially. Each method should return the new value.\r\n\r\nAn output field is not compulsory but if not provided, each value will be returned as a character.\r\n\r\n### ListField\r\n\r\nAn array field will store an array of values that are all evalutated to an output field that you would have specified.\r\n\r\n__N.B.__ Note that the value of an ArrayField is implemented as is in the final DataFrame. Make sure you are using this field correctly in order to avoid unwanted results.\r\n\r\n### CommaSeperatedField\r\n\r\nCreate a comma separated field in your model.\r\n\r\n__N.B.__ Note that the value of a CommaSeperatedField is implemented as is in the final DataFrame. Make sure you are using this field correctly in order to avoid unwanted results.\r\n\r\n### RegexField\r\n\r\nParse an element within a given value using a regex expression before storing it in your model.\r\n\r\npython\r\nRegexField(r\'(\\d+)(?<=\\\xe2\x82\xac)\')\r\n\r\n\r\n### BooleanField\r\n\r\nAdds a boolean based value to your model. Uses classic boolean represenations such as on, off, 1, 0, True, true, False or false to resolve the value.\r\n\r\n### Creating your own field\r\n\r\nYou an also create a custom field by suclassing zineb.models.fields.Field. When doing so, your custom field has to provide a resolve function in order to determine how the value should be parsed.\r\n\r\npython\r\nclass MyCustomField(Field):\r\n def resolve(self, value):\r\n initial_result = super().resolve(value)\r\n\r\n # Rest of your code here\r\n\r\n\r\n__NOTE:__ If you want to use the cleaning functionalities from the super class in your own resolve function, make sure to call super beforehand as indicated above.\r\n\r\n## Validators [initial validators]\r\n\r\nValidators make sure that the value that was passed respects the constraints that were implemented as a keyword arguments on the field class. There are five basic validations that could possibly run if you specify a constraint for them:\r\n\r\n* Maximum length (max_length)\r\n* Nullity (null)\r\n* Defaultness (default)\r\n* Validity (validators)\r\n\r\n### Maximum or Minimum length\r\n\r\nThe maximum or minimum length check ensures that the value does not exceed a certain length using validators.max_length_validator or validators.min_length_validator.\r\n\r\n### Nullity\r\n\r\nThe nullity validation ensures that the value is not null and that if a default is provided, that null value be replaced by the latter. It uses validators.validate_is_not_null.\r\n\r\nThe defaultness provides a default value for null or none existing ones.\r\n\r\n### Practical examples\r\n\r\nFor instance, suppose you want only values that do not exceed a certain length:\r\n\r\npython\r\nname = CharField(max_length=50)\r\n\r\n\r\nOr suppose you want a default value for fields that are empty or blank:\r\n\r\npython\r\nname = CharField(default=\'Kylie Jenner\')\r\n\r\n\r\nRemember that validators will validate the value itself for example by making sure that an URL is indeed an url or that an email follows the expected pattern that you would expect from an email.\r\n\r\nSuppose you want only values that would be Kendall Jenner. Then you could create a custom validator that would do the following:\r\n\r\npython\r\ndef check_name(value):\r\n if value == "Kylie Jenner":\r\n return None\r\n return value\r\n\r\nname = CharField(validators=[check_name])\r\n\r\n\r\nYou can also create validators that match a specific regex pattern using the zineb.models.validators.regex_compiler decorator:\r\n\r\npython\r\nfrom zineb.models.datastructure import Model\r\nfrom zineb.models.fields import CharField\r\nfrom zineb.models.validators import regex_compiler\r\n\r\n@regex_compiler(r\'\\d+\')\r\ndef custom_validator(value):\r\n if value > 10:\r\n return value\r\n return 0\r\n\r\nclass Player(Model):\r\n age = IntegerField(validators=[custom_validator])\r\n\r\n\r\n__NOTE:__ It is important to understand that the result of the regex compiler is reinjected into your custom validator on which you can then do various other checks.\r\n\r\n#### Field resolution\r\n\r\nIn order to get the complete structured data, you need to call resolve_values which will return a pandas.DataFrame object:\r\n\r\npython\r\nplayer.add_value("name", "Kendall Jenner")\r\nplayer.resolve_values()\r\n\r\n# -> pandas.DataFrame\r\n\r\n\r\nPractically though, you'll be using the save method which also calls the resolve_values under the hood:\r\n\r\npython\r\nplayer.save(commit=True, filename=None, **kwargs)\r\n\r\n# -> pandas.DataFrame or new file\r\n\r\n\r\nBy calling the save method, you'll be able to store the data directly to a JSON or CSV file.\r\n\r\n## Expressions\r\n\r\nExpressions a built-in functions that can modify the incoming value in some kind of way before storing to your model.\r\n\r\n### Math\r\n\r\nRun a calculation such as addition, substraction, division or multiplication on the value.\r\n\r\npython\r\nfrom zineb.models.expressions import Add\r\n\r\nplayer.add_calculated_field(\'height\', Add(175, 5))\r\n\r\n# -> {\'height\': [180]}\r\n\r\n\r\n### ExtractYear, ExtractDate, ExtractDay\r\n\r\nFrom a date string, extract the year, the date or the day.\r\n\r\npython\r\nfrom zineb.models.expressions import ExtractYear\r\n\r\nplayer.add_value(\'competition_year\', ExtractYear(\'11-1-2021\'))\r\n\r\n# -> {\'competition_year\': [2021]}\r\n\r\n\r\n# Extractors\r\n\r\nExtractors are utilities that facilitates extracting certain specific pieces of data from a web page such as links, images [...] quickly.\r\n\r\nSome extractors can be used in various manners. First, with a context processor:\r\n\r\npython\r\nextractor = LinkExtractor()\r\nwith extractor:\r\n # Do something here\r\n\r\n\r\nSecond, in an interation process:\r\n\r\npython\r\nfor link in extractor:\r\n # Do something here\r\n\r\n\r\nFinally, with next:\r\n\r\npython\r\nnext(extractor)\r\n\r\n\r\nYou can also check if an extractor has a specific value and even concatenate some of them together:\r\n\r\npython\r\n# Contains\r\nif x in extractor:\r\n # Do something here\r\n\r\n# Addition\r\nconcatenated_extractors = extractor1 + extractor2\r\n\r\n\r\n## LinkExtractor\r\n\r\n* url_must_contain - only keep urls that contain a specific string\r\n* unique - return a unique set of urls (no duplicates)\r\n* base_url - reconcile a domain to a path\r\n* only_valid_links - only keep links (Link) that are marked as valid\r\n\r\npython\r\nextractor = LinkExtractor()\r\nextractor.finalize(response.html_response)\r\n\r\n# -> [Link(url=http://example.com, valid=True)]\r\n\r\n\r\nThere might be times where the extracted links are relative paths. This can cause an issue for running additional requests. In which case, use the base_url parameter:\r\n\r\npython\r\nextractor = LinkExtractor(base_url=http://example.com)\r\nextractor.finalize(response.html_response)\r\n\r\n# Instead of getting this result which would also\r\n# be marked as a none valid link\r\n# -> [Link(url=/relative/path, valid=False)]\r\n\r\n# You will get the following with the full url link\r\n# -> [Link(url=http://example.com/relative/path, valid=True)]\r\n\r\n\r\nNOTE: By definition, a relative path is not a valid link hence the valid set to False.\r\n\r\n## MultiLinkExtractor\r\n\r\nA MultiLinkExtractor works exactly like the LinkExtractor with the only difference being that it also identifies and collects emails that are contained within the HTML page.\r\n\r\n## TableExtractor\r\n\r\nExtract all the rows from the first table that is matched on the HTML page.\r\n\r\n* class_name - intercept a table with a specific class name\r\n* has_headers - specify if the table has headers in order to ignore it in the final data\r\n* filter_empty_rows - ignore any rows that do not have a values\r\n* processors - a set of functions to run on the data once it is all extracted\r\n\r\n## ImageExtractor\r\n\r\nExtract all the images on the HTML page.\r\n\r\nYou can filter down the images that you get by using a specific set of parameters:\r\n\r\n* unique - return only a unique et set of urls\r\n* as_type - only return images having a specific extension\r\n* url_must_contain - only return images which contains a specific string\r\n* match_height - only return images that match as specific height\r\n* match_width - only return images that match a specific width\r\n\r\n## TextExtractor\r\n\r\nExtract all the text on an HTML page.\r\n\r\nFirst, the text is retrieved as a raw value then tokenized and vectorized using nltk.tokenize.PunktSentenceTokenizer and nltk.tokenize.WordPunctTokenizer.\r\n\r\nTo know more about NLKT, please read the following documentation.\r\n\r\n# Zineb special wrappers\r\n\r\n# HTTPRequest\r\n\r\nZineb uses a special built-in HTTPRequest class which wraps the following for better cohesion:\r\n\r\n* The requests.Request response class\r\n* The bs4.BeautifulSoup object\r\n\r\nIn general, you will not need to interact with this class that much because it's just an interface for implement additional functionnalities especially to the Request class.\r\n\r\n* follow: create a new instance of the class whose resposne will be the one of a new url\r\n* follow_all: create new instances of the class who responses will tbe the ones of the new urls\r\n* urljoin: join a path the domain\r\n\r\n# HTMLResponse\r\n\r\nIt wraps the BeautifulSoup object in order to implement some small additional functionalities:\r\n\r\n* page_title: return the page's title\r\n* links: return all the links of the page\r\n* images: return all the images of the page\r\n* tables: return all the tables of the page\r\n\r\n# Signals\r\n\r\nSignals are a very simple yet efficient way for you to run functions during the lifecycle of your project when certain events occur at very specific moments.\r\n\r\nInternally signals are sent on the following events:\r\n\r\n* When the registry is populated\r\n* Before the spider starts\r\n* After the spider has started\r\n* Before an HTTP request is sent\r\n* Before and HTTP request is sent\r\n* Before the model downloads anything\r\n* After the model has downloaded something\r\n\r\n## Creating a custom signal\r\n\r\nTo create custom signal, you need to mark a method as being a receiver for any incoming signals. For example, if you want to create a signal to intercept one of the events above, you should do:\r\n\r\npython\r\nfrom zineb.signals import receiver\r\n\r\n@receiver(tag="Signal Name")\r\ndef my_custom_signal(sender, **kwargs):\r\n pass\r\n\r\n\r\nThe signals function has to be able to accept a sender object and additional parameters such as the current url or the current HTML page.\r\n\r\nYou custom signals do not have to return anything.\r\n\r\n# Utilities\r\n\r\n## Link reconciliation\r\n\r\nMost of times, when you retrieve links from a page, they are returned as relative paths. The urljoin method reconciles the url of the visited page with that path.\r\n\r\npython\r\n# <a href="/kendall-jenner">Kendall Jenner</a>\r\n\r\n# Now we want to reconcile the relative path from this link to\r\n# the main url that we are visiting e.g. https://example.com\r\n\r\nrequest.urljoin("/kendall-jenner")\r\n\r\n# -> https://example.com/kendall-jenner\r\n\r\n\r\n## File collection\r\n\r\nCollect files within a specific directory using collect_files. Collect files also takes an additional function that can be used to filter or alter the final results.\r\n\r\n# Settings\r\n\r\nThis section will talk about all the available settings that are available for your project and how to use them for web scrapping.\r\n\r\nPROJECT_PATH\r\n\r\nRepresents the current path for your project. This setting is not be changed.\r\n\r\nSPIDERS\r\n\r\nIn order for your spider to be executed, every created spider should be registered here. The name of the class should serve as the name of the spider to be used.\r\n\r\npython\r\nSPIDERS = [\r\n "MySpider"\r\n]\r\n\r\n\r\nDOMAINS\r\n\r\nYou can restrict your project to use only to a specific set of domains by ensuring that no request is sent if it matches one of the domains within this list.\r\n\r\npython\r\nDOMAINS = [\r\n "example.com"\r\n]\r\n\r\n\r\nENSURE_HTTPS\r\n\r\nEnforce that every link in your project is a secured HTTPS link. This setting is set to False by default.\r\n\r\nMIDDLEWARES\r\n\r\nMiddlewares are functions/classes that are executed when a signal is sent from any part of the project. Middlewares implement extra functionnalities without affecting the core parts of the project. They can then be disabled safely if you do not need them.\r\n\r\npython\r\nMIDDLEWARES = [\r\n "zineb.middlewares.handlers.Handler",\r\n "myproject.middlewares.MyMiddleware"\r\n]\r\n\r\n\r\nThe main Zineb middlewares are the following:\r\n\r\n* zineb.middlewares.referer.Referer\r\n* zineb.middlewares.handlers.Handler\r\n* zineb.middlewares.automation.Automation\r\n* zineb.middlewares.history.History\r\n* zineb.middlewares.statistics.GeneralStatistics\r\n* zineb.middlewares.wireframe.WireFrame\r\n\r\nUSER_AGENTS\r\n\r\nA user agent is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting MDN Web Docs.\r\n\r\nImplement additional sets of user agents to your projects in addition to those that were already created.\r\n\r\nRANDOMIZE_USER_AGENTS\r\n\r\nSpecifies whether to use one user agent for every request or to randomize user agents on every request. This setting is set to to False by default.\r\n\r\nDEFAULT_REQUEST_HEADERS\r\n\r\nSpecify additional default headers to use for each requests.\r\n\r\nThe default initial headers are:\r\n\r\n* Accept-Language - en\r\n* Accept - text/html,application/json,application/xhtml+xml,application/xml;q=0.9,/;q=0.8\r\n* Referrer - None\r\n\r\nPROXIES\r\n\r\nUse a set of proxies for each request. When a request in sent, a random proxy is selected and implemented with the request.\r\n\r\npython\r\nPROXIES = [\r\n ("http", "127.0.0.1"),\r\n ("https", "127.0.0.1")\r\n]\r\n\r\n\r\nRETRY\r\n\r\nSpecifies the retry policy. This is set to False by default. In other words, the request silently fails and never retries.\r\n\r\nRETRY_TIMES\r\n\r\nSpecificies the amount of times the the request is sent before eventually failing.\r\n\r\nRETRY_HTTP_CODES\r\n\r\nIndicates which status codes should trigger a retry. By default, the following codes: 500, 502, 503, 504, 522, 524, 408 and 429 will trigger it.\r\n\r\nTIME_ZONE'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

zineb-scrapper-6.0.1.tar.gz (31.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page