Skip to main content

A package to parse raw HTML and return structured information.

Project description

html2info

html2info is a Python package that allows you to parse LinkedIn profiles from raw HTML and return structured information in JSON format.

Features

  • Extracts profile information such as name, title, location, profile photo, about, experience, and education.
  • Returns a JSON object containing the parsed data.

Installation

Install html2info using pip:

pip install html2info

Usage

Here's an example of how to use html2info:

LinkedIn

from html2info.linkedin import Person

url = "https://www.linkedin.com/in/iglovikov"
raw_data = "..."  # Raw HTML content of the LinkedIn page

person = Person(url, raw_data)
person.parse()
print(person.to_dict())
{
    "linkedin_url": "https://www.linkedin.com/in/iglovikov",
    "name": "Vladimir Iglovikov",
    "title": "Kaggle Grandmaster. Co-creator of Albumentations.AI",
    "location": "San Francisco, California, United States",
    "profile_photo_link": "https://media.licdn.com/dms/image/C4D03AQFDvheHDkAQlw/profile-displayphoto-shrink_400_400/0/1654539436934?e=1687392000&v=beta&t=OX7WrIprduo-xWEvrRKNzYdqcqG6bdzDtlm6LWuHbIE",
    "about": "• Advisor and Angel investor.\n• Co-creator, Albumentations.AI: Open-source library with 30k daily downloads, adopted by top Computer Vision companies & Kaggle competition winners\n• Former Staff ML Engineer, Lyft Level5 (Autonomous Vehicles): Led Deep Learning model development & integration for Self-Driving & Ride Sharing\n• Kaggle Grandmaster: Multiple ML competition wins\n• Author: 20+ publications in Deep Learning for Medical, Satellite, Street View, and Natural Images",
    "experience": [
      {
        "title": "Chief Executive Officer",
        "company": "Ternaus Inc · Full-time",
        "image_link": null,
        "company_link": "https://www.linkedin.com/search/results/all/?keywords=Ternaus+Inc",
        "dates": "Aug 2022 - Present · 9 mos",
        "description": null
      },
      {
        "title": "Evangelist",
        "company": "OpenDataScience",
        "image_link": "https://media.licdn.com/dms/image/C510BAQFU1fTt5tE6Ug/company-logo_100_100/0/1554042536921?e=1689811200&v=beta&t=-sIbC_T8hZjxf5TNgO_H0ClRcYb7Y_oow6dAdW8xMHg",
        "company_link": "https://www.linkedin.com/company/11241268/",
        "dates": "Aug 2016 - Mar 2022 · 5 yrs 8 mos",
        "description": "OpenDataScience, or ODS, is a Russian-speaking community of over 50,000 data scientists, researchers, and engineers. ODS freely disseminates knowledge, and promotes professional development and exchange of ideas and opportunities in all areas of Data Science through live events, online classes and discussions, and other resources. Join us at http://ods.ai."
      },
      {
        "title": "Staff ML Engineer",
        "company": "Lyft · Full-time",
        "image_link": "https://media.licdn.com/dms/image/C560BAQFoMDej0VdZVA/company-logo_100_100/0/1545416046198?e=1689811200&v=beta&t=JV79uOIdgcbYcAeg0YAklLLZ6c5VkldGSG-Zu3G42xI",
        "company_link": "https://www.linkedin.com/company/2620735/",
        "dates": "Oct 2017 - Aug 2021 · 3 yrs 11 mos",
        "description": null
      },
      {
        "title": "Advisor",
        "company": "Iterative.ai · Part-time",
        "image_link": "https://media.licdn.com/dms/image/C4E0BAQGnnEVzx81kBg/company-logo_100_100/0/1653056165184?e=1689811200&v=beta&t=dNl2Q2CDgmX2r3KiymYIqjPtXJQXIYeTzgdNduZLLTs",
        "company_link": "https://www.linkedin.com/company/18657719/",
        "dates": "Nov 2018 - Nov 2020 · 2 yrs 1 mo",
        "description": null
      },
      {
        "title": "Senior Data Scientist (Machine Learning)",
        "company": "TrueAccord",
        "image_link": "https://media.licdn.com/dms/image/C560BAQEo_A523IxkGQ/company-logo_100_100/0/1656418732741?e=1689811200&v=beta&t=YRjhRCxnfijmSz40qvRCeKxkfoMHYGU1oiPGIJht-aw",
        "company_link": "https://www.linkedin.com/company/3249455/",
        "dates": "Jun 2016 - Sep 2017 · 1 yr 4 mos",
        "description": "Developed a supervised machine learning algorithm that predicts what personalized emails should be sent to each user to drive him to the target website. ROC AUC score 0.88. Prototyped, implemented, deployed and tested machine learning algorithm that helped to prioritize outbound phone traffic, improving conversion through phone calls by 80%."
      }
    ],
    "education_list": [
      {
        "university_name": "University of California, Davis",
        "degree_and_major": "Doctor of Philosophy (Ph.D.), Physics",
        "dates": "2010 - 2015",
        "university_link": "https://www.linkedin.com/company/2842/",
        "image_link": "https://media.licdn.com/dms/image/C4E0BAQEBG25KNBwuCQ/company-logo_100_100/0/1616103040374?e=1689811200&v=beta&t=sUF5ars4S8ek3vZs01usUvGwSJsU01KYtANnMkkZFdQ"
      },
      {
        "university_name": "Saint Petersburg State University",
        "degree_and_major": "Master's degree, Physics",
        "dates": "2001 - 2010",
        "university_link": "https://www.linkedin.com/company/15099991/",
        "image_link": "https://media.licdn.com/dms/image/C560BAQHWUjwogE235A/company-logo_100_100/0/1519863922741?e=1689811200&v=beta&t=DSpsTKY_AcMrmzWY1592EvCClph4M_TVOLdNSDpOg2I"
      }
    ]
  }

Kaggle

from html2info.kaggle import Person

url = "https://www.kaggle.com/iglovikov"
raw_data = "..."  # Raw HTML content of the LinkedIn page

person = Person(url, raw_data)
person.parse()
print(person.to_dict())
{
    "url": "https://www.kaggle.com/iglovikov",
    "name": "Vladimir Iglovikov",
    "title": "CEO  at ternaus.com",
    "location": "San Francisco, California, United States",
    "profile_photo_link": "https://storage.googleapis.com/kaggle-avatars/images/286455-fb.jpg",
    "social_network_links": [
      "https://github.com/ternaus",
      "https://twitter.com/viglovikov",
      "https://www.linkedin.com/in/iglovikov",
      "https://salesbrain.tech/"
    ],
    "personal_website_link": "https://salesbrain.tech/",
    "num_followers": 1534,
    "competitions_summary": {
      "tier": "grandmaster",
      "tier_image": "/static/images/tiers/grandmaster@48.png",
      "medals": {
        "gold": 5,
        "silver": 9,
        "bronze": 8
      },
      "highest_rank": 19
    },
    "datasets_summary": {
      "tier": "contributor",
      "tier_image": "/static/images/tiers/contributor@48.png",
      "medals": {
        "gold": 0,
        "silver": 0,
        "bronze": 0
      },
      "highest_rank": -1
    },
    "notebooks_summary": {
      "tier": "contributor",
      "tier_image": "/static/images/tiers/contributor@48.png",
      "medals": {
        "gold": 1,
        "silver": 1,
        "bronze": 1
      },
      "highest_rank": -1
    },
    "discussion_summary": {
      "tier": "master",
      "tier_image": "/static/images/tiers/master@48.png",
      "medals": {
        "gold": 52,
        "silver": 26,
        "bronze": 177
      },
      "highest_rank": 6
    },
    "bio": "* CEO at Ternaus Inc\n* Staff Computer Vision Engineer at Level5 Engineering Center, Lyft Inc (2017-2021)\n* Senior Data Scientist at TrueAccord (2016-2017)\n* Data Scientist at Bidgely (2015-2016)\n* PhD in theoretical Condensed Matter Physics at University of California, Davis (2010-2015)\n* MS in theoretical High Energy Physics at Saint Petersburg State University (2001-2010)\n* Спецназ ВДВ . Медаль за воинскую доблесть за вторую Чеченскую. (2002-2004)\n"
  }

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

html2info-0.2.0.tar.gz (11.4 kB view details)

Uploaded Source

Built Distribution

html2info-0.2.0-py3-none-any.whl (9.1 kB view details)

Uploaded Python 3

File details

Details for the file html2info-0.2.0.tar.gz.

File metadata

  • Download URL: html2info-0.2.0.tar.gz
  • Upload date:
  • Size: 11.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for html2info-0.2.0.tar.gz
Algorithm Hash digest
SHA256 49c8200eabde604577592ef3528629bcb3f368d48de8bfe0a4dddaf72730e9f4
MD5 a344b6ea913dd80da63f56da07522c0c
BLAKE2b-256 c41848a94e598852fd1c027913a01919a0f7147828b2fa705d177acb4261dc3d

See more details on using hashes here.

File details

Details for the file html2info-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: html2info-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.10

File hashes

Hashes for html2info-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 13797a34bafb2d761db130bc086604f3842da7b7ab93ea4aab6d1832a34723dd
MD5 e459a26d668173826a67667973d40c61
BLAKE2b-256 4a3b3a85f6bb142a04ee9d0f9c7aa86e1e0c401900b5d683cd8b4080447f579a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page