Skip to main content

LangChain integration for advertools

Project description

LangChain integration with advertools

This package provides an integration to integrate advertools into the LangChain ecosystem.

Currently there is one class provided WebsiteLoader which is a document loader.

Installation

python3 -m pip install langchain-advertools

Typical workflow

Crawl a website

import advertools as adv
import pandas as pd
adv.crawl("https://www.langchain.com/", "langchain.jsonl", follow_links=True)
crawldf = pd.read_json("langchain.json", lines=True)

We now have the full website crawled that can be read into a DataFrame crawldf:

url title meta_desc viewport charset h1 h2 h3 canonical og:title og:description og:image og:type twitter:card body_text size download_timeout download_slot download_latency depth status links_url links_text links_nofollow nav_links_url nav_links_text nav_links_nofollow header_links_url header_links_text header_links_nofollow footer_links_url footer_links_text footer_links_nofollow img_src img_loading img_width img_alt img_sizes img_srcset img_height ip_address crawl_time resp_headers_Date resp_headers_Content-Type resp_headers_Cf-Ray resp_headers_Cf-Cache-Status resp_headers_Age resp_headers_Last-Modified resp_headers_Content-Security-Policy resp_headers_Surrogate-Control resp_headers_Surrogate-Key resp_headers_X-Frame-Options resp_headers_X-Lambda-Id resp_headers_Vary resp_headers_Set-Cookie resp_headers_Alt-Svc resp_headers_X-Cluster-Name request_headers_Accept request_headers_Accept-Language request_headers_User-Agent request_headers_Accept-Encoding request_headers_Referer h6 h4 h5
0 https://www.langchain.com/ LangChain LangChain’s suite of products suppo width=device-width, initial-scale=1 utf-8 Applications that can reason. Power From startups to global enterprises Hear from our happy customers https://www.langchain.com/ LangChain LangChain’s suite of products suppo https://cdn.prod.website-files.com/ website summary_large_image LangChain’s suite of products suppo 105173 180 www.langchain.com 0.0991158 0 200 https://www.langchain.com/@@https:/ @@LangGraph@@LangSmith False@@False@@False@@False@@False@@ https://www.langchain.com/langgraph LangGraph@@LangSmith@@LangChain@@Re False@@False@@False@@False@@False@@ https://www.langchain.com/contact-s Get a demo@@See customer stories False@@False@@False@@False@@False https://www.langchain.com/langchain LangChain@@LangSmith@@LangGraph@@Ag False@@False@@False@@False@@False@@ https://cdn.prod.website-files.com/ lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ 35.5@@@@84@@@@@@66@@38@@@@72.5@@86@ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @@@@@@@@@@(max-width: 1279px) 65.99 @@@@@@@@@@https://cdn.prod.website- @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@Aut 54.243.86.28 2025-05-13 03:17:37 Tue, 13 May 2025 03:17:37 GMT text/html 93ef00fbdbcac998-IAD HIT 247862 Sat, 10 May 2025 06:26:35 GMT frame-ancestors 'self' max-age=432000 www.langchain.com 65b8cd72835ceeacd SAMEORIGIN 59418863-7ac9-4380-a760-0e3817a430c Accept-Encoding _cfuvid=l5QDJcN0tziza860Y8K9y2SRXZo h3=":443"; ma=86400 us-east-1-prod-hosting-red text/html,application/xhtml+xml,app en advertools/0.16.6 gzip, deflate, zstd nan nan nan nan
1 https://www.langchain.com/contact-s Talk to our team You can expect a conversation with width=device-width, initial-scale=1 utf-8 Talk to our team Trusted by the best teams building nan https://www.langchain.com/contact-s Talk to our team You can expect a conversation with https://cdn.prod.website-files.com/ website summary_large_image Trusted by the best teams building 40694 180 www.langchain.com 0.053534 1 200 https://www.langchain.com/@@https:/ @@LangGraph@@LangSmith False@@False@@False@@False@@False@@ https://www.langchain.com/langgraph LangGraph@@LangSmith@@LangChain@@Re False@@False@@False@@False@@False@@ nan nan nan nan nan nan https://cdn.prod.website-files.com/ lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ @@@@@@@@@@@@@@@@@@1 @@@@@@@@@@@@@@@@@@ @@@@@@@@@@@@@@@@(max-width: 1919px) @@@@@@@@@@@@@@@@https://cdn.prod.we @@@@@@@@@@@@@@@@@@1 54.243.86.28 2025-05-13 03:17:37 Tue, 13 May 2025 03:17:37 GMT text/html 93ef00fd9ca9f28b-IAD HIT 798472 Sat, 03 May 2025 21:29:45 GMT frame-ancestors 'self' max-age=2147483647 www.langchain.com 65b8cd72835ceeacd SAMEORIGIN 9b369e26-9fb7-4b62-af02-a4b8bcea45d Accept-Encoding _cfuvid=MO5ivQuIZ0V.alvdhTofYmEnlk6 h3=":443"; ma=86400 us-east-1-prod-hosting-red text/html,application/xhtml+xml,app en advertools/0.16.6 gzip, deflate, zstd https://www.langchain.com/ nan nan nan
2 https://www.langchain.com/resources Resources Curated content for the AI engineer width=device-width, initial-scale=1 utf-8 Resources Built with LangGraph@@Built with La nan https://www.langchain.com/resources Resources Curated content for the AI engineer https://cdn.prod.website-files.com/ website summary_large_image Resources 62532 180 www.langchain.com 0.031961 1 200 https://www.langchain.com/@@https:/ @@LangGraph@@LangSmith False@@False@@False@@False@@False@@ https://www.langchain.com/langgraph LangGraph@@LangSmith@@LangChain@@Re False@@False@@False@@False@@False@@ https://www.langchain.com/built-wit Use cases & inspirationUpcomingBuil False@@False@@False@@False@@False@@ https://www.langchain.com/langchain LangChain@@LangSmith@@LangGraph@@Ag False@@False@@False@@False@@False@@ https://cdn.prod.website-files.com/ lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ @@@@@@@@@@@@@@@@@@@@@@@@@@1 Built with LangGraph@@Built with La 100vw@@100vw@@100vw@@100vw@@100vw@@ https://cdn.prod.website-files.com/ @@@@@@@@@@@@@@@@@@@@@@@@@@1 54.243.86.28 2025-05-13 03:17:37 Tue, 13 May 2025 03:17:37 GMT text/html 93ef00fdde1ac957-IAD HIT 73040 Mon, 12 May 2025 07:00:17 GMT frame-ancestors 'self' max-age=86383 www.langchain.com 65b8cd72835ceeacd SAMEORIGIN a5c89e0e-9196-4be5-b41c-041765aab03 Accept-Encoding _cfuvid=TdQBy1PjEDzsHLwBgiCjlNdsC_n h3=":443"; ma=86400 us-east-1-prod-hosting-red text/html,application/xhtml+xml,app en advertools/0.16.6 gzip, deflate, zstd https://www.langchain.com/ nan nan nan
3 https://www.langchain.com/pricing-l LangGraph Platform Pricing LangGraph Platform plans for teams width=device-width, initial-scale=1 utf-8 LangGraph Platform plansfor teams o LangSmith for Startups and Educatio nan https://www.langchain.com/pricing-l LangGraph Platform Pricing LangGraph Platform plans for teams https://cdn.prod.website-files.com/ website summary_large_image LangGraph Platform plans for teams 91367 180 www.langchain.com 0.1007 1 200 https://www.langchain.com/@@https:/ @@LangGraph@@LangSmith False@@False@@False@@False@@False@@ https://www.langchain.com/langgraph LangGraph@@LangSmith@@LangChain@@Re False@@False@@False@@False@@False@@ https://langchain-ai.github.io/lang Get started@@Get started@@Contact u False@@False@@False@@False@@False@@ https://www.langchain.com/langchain LangChain@@LangSmith@@LangGraph@@Ag False@@False@@False@@False@@False@@ https://cdn.prod.website-files.com/ lazy@@ @@1 @@ nan nan @@1 54.243.86.28 2025-05-13 03:17:37 Tue, 13 May 2025 03:17:37 GMT text/html 93ef00fdea022d0f-IAD HIT 86950 Mon, 12 May 2025 03:08:27 GMT frame-ancestors 'self' max-age=432000 www.langchain.com 65b8cd72835ceeacd SAMEORIGIN 75d9c6c2-17b1-44f5-9090-efe59cf4db1 Accept-Encoding _cfuvid=BM46MCuUt4XXJJ5ZJu3TBt3DjQq h3=":443"; ma=86400 us-east-1-prod-hosting-red text/html,application/xhtml+xml,app en advertools/0.16.6 gzip, deflate, zstd https://www.langchain.com/ nan nan nan
4 https://www.langchain.com/langchain LangChain An all-in-one developer platform fo width=device-width, initial-scale=1 utf-8 The largest community building the A complete set of interoperable bui Why choose LangChain? https://www.langchain.com/langchain LangChain An all-in-one developer platform fo https://cdn.prod.website-files.com/ website summary_large_image The largest community building the 65327 180 www.langchain.com 0.0322678 1 200 https://www.langchain.com/@@https:/ @@LangGraph@@LangSmith False@@False@@False@@False@@False@@ https://www.langchain.com/langgraph LangGraph@@LangSmith@@LangChain@@Re False@@False@@False@@False@@False@@ https://python.langchain.com/docs/t Get started with Python@@Get starte False@@False https://www.langchain.com/langchain LangChain@@LangSmith@@LangGraph@@Ag False@@False@@False@@False@@False@@ https://cdn.prod.website-files.com/ lazy@@lazy@@lazy@@lazy@@lazy@@lazy@ 656@@Auto@@Auto@@Auto@@Auto@@Auto@@ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ (max-width: 767px) 100vw, 656px@@@@ https://cdn.prod.website-files.com/ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 54.243.86.28 2025-05-13 03:17:37 Tue, 13 May 2025 03:17:37 GMT text/html 93ef00fe0eb8c957-IAD HIT 86944 Mon, 12 May 2025 03:08:33 GMT frame-ancestors 'self' max-age=432000 www.langchain.com 65b8cd72835ceeacd SAMEORIGIN 8a40c6f8-f244-4511-8ac3-b9b553a12cc Accept-Encoding _cfuvid=CuzspoSNIU_ALUsqpea_wSMI2rR h3=":443"; ma=86400 us-east-1-prod-hosting-red text/html,application/xhtml+xml,app en advertools/0.16.6 gzip, deflate, zstd https://www.langchain.com/ nan nan nan

This WebsiteLoader class a thin wrapper that provides this rich representation as a langchain Document object, lazily read, and containing all the available data under the metadata key

>>> from langchain_advertools import WebsiteLoader
>>> loader = WebsiteLoader("langchain.jsonl")  # note that the crawling process is a separate one, and has already happened
>>> lazy = loader.lazy_load()
>>> home = next(lazy)
>>> home.id
'https://www.langchain.com/'

>>> home.page_content[:800]
LangChains suite of products supports developers along each step of the LLM application lifecycle. Applications that can reason. Powered by LangChain. Get a demo Sign up to be the first to access recordings from  Interrupt, The AI Agent Conference ! Learn More From startups to global enterprises,  ambitious builders choose  LangChain products. Build LangChain is a composable framework to build with LLMs. LangGraph is the orchestration framework for controllable agentic workflows. Run Deploy your LLM applications at scale with LangGraph Platform, our infrastructure purpose-built for agents. Manage LangSmith is a unified agent observability and evals platform to optimize the performance of your AI agents - whether they're built with a LangChain framework or not.  Build your app with LangChain ...

>>> home.metadata.keys()
dict_keys(['title', 'meta_desc', 'viewport', 'charset', 'h1', 'h2', 'h3', 'canonical', 'og:title', 'og:description', 'og:image', 'og:type', 'twitter:card', 'size', 'download_timeout', 'download_slot', 'download_latency', 'depth', 'status', 'links_url', 'links_text', 'links_nofollow', 'nav_links_url', 'nav_links_text', 'nav_links_nofollow', 'header_links_url', 'header_links_text', 'header_links_nofollow', 'footer_links_url', 'footer_links_text', 'footer_links_nofollow', 'img_src', 'img_loading', 'img_width', 'img_alt', 'img_sizes', 'img_srcset', 'img_height', 'ip_address', 'crawl_time', 'resp_headers_Date', 'resp_headers_Content-Type', 'resp_headers_Cf-Ray', 'resp_headers_Cf-Cache-Status', 'resp_headers_Age', 'resp_headers_Last-Modified', 'resp_headers_Content-Security-Policy', 'resp_headers_Surrogate-Control', 'resp_headers_Surrogate-Key', 'resp_headers_X-Frame-Options', 'resp_headers_X-Lambda-Id', 'resp_headers_Vary', 'resp_headers_Set-Cookie', 'resp_headers_Alt-Svc', 'resp_headers_X-Cluster-Name', 'request_headers_Accept', 'request_headers_Accept-Language', 'request_headers_User-Agent', 'request_headers_Accept-Encoding'])

We can now explore the very rich metadata that tells us a lot about the crawled webpage

>>> home.metadata['title']
'LangChain'
>>> home.metadata['h1']
'Applications that can reason. Powered by LangChain.'
>>> home.metadata['h2'].split('@@') # multiple elements on the same page are delimited with @@
['From startups to global enterprises, ambitious builders choose LangChain products.', 'Build your app with LangChain', 'Run at scale with LangGraph\xa0Platform', 'Manage agent observability & performance with\xa0LangSmith', 'The reference architecture enterprises adopt for success.', 'The biggest developer community in GenAI', "Get started with LangChain's suite of products.", 'Get inspired by companies who have done it.', 'Ready to start shipping \u2028reliable GenAI apps faster?']

>>> home.metadata['links_url'].split('@@')[:10]
['https://www.langchain.com/', 'https://www.langchain.com/langgraph', 'https://www.langchain.com/langsmith', 'https://www.langchain.com/langchain', 'https://www.langchain.com/resources', 'https://blog.langchain.dev/', 'https://www.langchain.com/customers', 'https://academy.langchain.com/', 'https://www.langchain.com/community', 'https://www.langchain.com/experts']
>>> home.metadata['links_text'].split('@@')[:10]
['\n\n\n\n\n\n\n\n\n\n\n\n\n', 'LangGraph', 'LangSmith', 'LangChain', 'Resources Hub', 'Blog', 'Customer Stories', 'LangChain Academy', 'Community', 'Experts']

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

langchain_advertools-0.0.2.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

langchain_advertools-0.0.2-py3-none-any.whl (5.7 kB view details)

Uploaded Python 3

File details

Details for the file langchain_advertools-0.0.2.tar.gz.

File metadata

File hashes

Hashes for langchain_advertools-0.0.2.tar.gz
Algorithm Hash digest
SHA256 8abd2a1b0beb810f35aebf8003b47eb47fec9881ca3a56b198f99572db582686
MD5 d8cd95fcd52b756490c056f3bd4c3f2c
BLAKE2b-256 28d714871d1ac303112eec6d573cd95df6ac7fdce6c04a73dfc88a97022b9801

See more details on using hashes here.

File details

Details for the file langchain_advertools-0.0.2-py3-none-any.whl.

File metadata

File hashes

Hashes for langchain_advertools-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7c7a74a23857a2376e7c0fb1eb5c5c1e37171532e29c437c1dd3bfeb60d4ad26
MD5 f8e3ba4470d1ba457901f23f89340ce0
BLAKE2b-256 108e10ca20b38e33d3c6e5f6be3ee4cc33830c7c61a9b11d6919a85c4737e6c9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page