Skip to main content

No project description provided

Project description

mrkdwn_analysis

mrkdwn_analysis is a Python library designed to analyze Markdown files. With its powerful parsing capabilities, it can extract and categorize various elements within a Markdown document, including headers, sections, links, images, blockquotes, code blocks, and lists. This makes it a valuable tool for anyone looking to parse Markdown content for data analysis, content generation, or for building other tools that utilize Markdown.

Features

  • File Loading: The MarkdownAnalyzer can load any given Markdown file provided through the file path.

  • Header Identification: The tool can extract all headers from the markdown file, ranging from H1 to H6 tags. This allows users to have a quick overview of the document's structure.

  • Section Identification: The analyzer can recognize different sections of the document. It defines a section as a block of text followed by a line composed solely of = or - characters.

  • Paragraph Identification: The tool can distinguish between regular text and other elements such as lists, headers, etc., thereby identifying all the paragraphs present in the document.

  • Blockquote Identification: The analyzer can identify and extract all blockquotes in the markdown file.

  • Code Block Identification: The tool can extract all code blocks defined in the document, allowing you to separate the programming code from the regular text easily.

  • List Identification: The analyzer can identify both ordered and unordered lists in the markdown file, providing information about the hierarchical structure of the points.

  • Table Identification: The tool can identify and extract tables from the markdown file, enabling users to separate and analyze tabular data quickly.

  • Link Identification and Validation: The analyzer can identify all links present in the markdown file, categorizing them into text and image links. Moreover, it can also verify if these links are valid or broken.

  • Todo Identification: The tool is capable of recognizing and extracting todos (tasks or action items) present in the document.

  • Element Counting: The analyzer can count the total number of a specific element type in the file. This can help in quantifying the extent of different elements in the document.

  • Word Counting: The tool can count the total number of words in the file, providing an estimate of the document's length.

  • Character Counting: The analyzer can count the total number of characters (excluding spaces) in the file, giving a detailed measure of the document's size.

Installation

You can install mrkdwn_analysis from PyPI:

pip install mrkdwn_analysis

We hope mrkdwn_analysis helps you with all your Markdown analyzing needs!

Usage

Using mrkdwn_analysis is simple. Just import the MarkdownAnalyzer class, create an instance with your Markdown file, and you're good to go!

from mrkdwn_analysis import MarkdownAnalyzer

analyzer = MarkdownAnalyzer("path/to/your/markdown.md")

headers = analyzer.identify_headers()
sections = analyzer.identify_sections()
...

Class MarkdownAnalyzer

The MarkdownAnalyzer class is designed to analyze Markdown files. It has the ability to extract and categorize various elements of a Markdown document.

__init__(self, file_path)

The constructor of the class. It opens the specified Markdown file and stores its content line by line.

  • file_path: the path of the Markdown file to analyze.

identify_headers(self)

Analyzes the file and identifies all headers (from h1 to h6). Headers are returned as a dictionary where the key is "Header" and the value is a list of all headers found.

identify_sections(self)

Analyzes the file and identifies all sections. Sections are defined as a block of text followed by a line composed solely of = or - characters. Sections are returned as a dictionary where the key is "Section" and the value is a list of all sections found.

identify_paragraphs(self)

Analyzes the file and identifies all paragraphs. Paragraphs are defined as a block of text that is not a header, list, blockquote, etc. Paragraphs are returned as a dictionary where the key is "Paragraph" and the value is a list of all paragraphs found.

identify_blockquotes(self)

Analyzes the file and identifies all blockquotes. Blockquotes are defined by a line starting with the > character. Blockquotes are returned as a dictionary where the key is "Blockquote" and the value is a list of all blockquotes found.

identify_code_blocks(self)

Analyzes the file and identifies all code blocks. Code blocks are defined by a block of text surrounded by lines containing only the "```" text. Code blocks are returned as a dictionary where the key is "Code block" and the value is a list of all code blocks found.

identify_ordered_lists(self)

Analyzes the file and identifies all ordered lists. Ordered lists are defined by lines starting with a number followed by a dot. Ordered lists are returned as a dictionary where the key is "Ordered list" and the value is a list of all ordered lists found.

identify_unordered_lists(self)

Analyzes the file and identifies all unordered lists. Unordered lists are defined by lines starting with a -, *, or +. Unordered lists are returned as a dictionary where the key is "Unordered list" and the value is a list of all unordered lists found.

identify_tables(self)

Analyzes the file and identifies all tables. Tables are defined by lines containing | to delimit cells and are separated by lines containing - to define the borders. Tables are returned as a dictionary where the key is "Table" and the value is a list of all tables found.

identify_links(self)

Analyzes the file and identifies all links. Links are defined by the format [text](url). Links are returned as a dictionary where the keys are "Text link" and "Image link" and the values are lists of all links found.

check_links(self)

Checks all links identified by identify_links to see if they are broken (return a 404 error). Broken links are returned as a list, each item being a dictionary containing the line number, link text, and URL.

identify_todos(self)

Analyzes the file and identifies all todos. Todos are defined by lines starting with - [ ] . Todos are returned as a list, each item being a dictionary containing the line number and todo text.

count_elements(self, element_type)

Counts the total number of a specific element type in the file. The element_type should match the name of one of the identification methods (for example, "headers" for identify_headers). Returns the total number of elements of this type.

count_words(self)

Counts the total number of words in the file. Returns the word count.

count_characters(self)

Counts the total number of characters (excluding spaces) in the file. Returns the character count.

Contributions

Contributions are always welcome! If you have a feature request, bug report, or just want to improve the code, feel free to create a pull request or open an issue.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

markdown_analysis-0.0.5.tar.gz (7.1 kB view details)

Uploaded Source

File details

Details for the file markdown_analysis-0.0.5.tar.gz.

File metadata

  • Download URL: markdown_analysis-0.0.5.tar.gz
  • Upload date:
  • Size: 7.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.0.0 CPython/3.11.4

File hashes

Hashes for markdown_analysis-0.0.5.tar.gz
Algorithm Hash digest
SHA256 a6b97fbf4f1c3b80491d8ff5f630d0236d18cbedf203132fff60466e5790e95b
MD5 7dc5fc4318af5e3ba4d5b495a872cffb
BLAKE2b-256 9c1456ab1759f0a8b46e9168815bf069cc8c0675b288f19212bbbe50a8522d29

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page