Everything to Markdown.
Project description
🚀 E2M: Everything to Markdown
Everything to Markdown
E2M is a Python library that can parse and convert various file types into Markdown format. By utilizing a parser-converter architecture, it supports the conversion of multiple file formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a.
✨The ultimate goal of the E2M project is to provide high-quality data for Retrieval-Augmented Generation (RAG) and model training or fine-tuning.
Core Architecture of the Project:
- Parser: Responsible for parsing various file types into text or image data.
- Converter: Responsible for converting text or image data into Markdown format.
Generally, for any type of file, the parser is run first to extract internal data such as text and images. Then, the converter is used to transform this data into Markdown format.
📹 Video Introduction
📂 All Converters and Parsers
Parser | ||
---|---|---|
Parser Type | Engine | Supported File Type |
PdfParser | surya_layout, marker, unstructured | |
DocParser | xml | doc |
DocxParser | xml | docx |
PptParser | unstructured | ppt |
PptxParser | unstructured | pptx |
UrlParser | unstructured, jina, firecrawl | url |
EpubParser | unstructured | epub |
HtmlParser | unstructured | html, htm |
VoiceParser | openai_whisper_api, openai_whisper_local, SpeechRecognition | mp3, m4a |
Converter | ||
---|---|---|
Converter Type | Engine | Strategy |
ImageConverter | litellm, zhipuai (Not Well in Image Recognition, Not Recommended) | default |
TextConverter | litellm, zhipuai | default |
📦 Installation
Create Environment:
conda create -n e2m python=3.10
conda activate e2m
Update pip:
pip install --upgrade pip
Install E2M using pip:
# Option 1: Install via pip
pip install wisup_e2m
# Option 2: Install via git
pip install git+https://github.com/wisupai/e2m.git --index-url https://pypi.org/simple
# Option 3: Manual installation
git clone https://github.com/wisupai/e2m.git
cd e2m
pip install poetry
poetry build
pip install dist/wisup_e2m-0.1.56-py3-none-any.whl
⚡️ Parser Quick Start
Here's simple examples demonstrating how to use E2M Parsers:
📄 Pdf Parser
from wisup_e2m import PdfParser
pdf_path = "./test.pdf"
parser = PdfParser(engine="marker") # pdf engines: marker, unstructured, surya_layout
pdf_data = parser.parse(pdf_path)
print(pdf_data.text)
📝 Doc Parser
from wisup_e2m import DocParser
doc_path = "./test.doc"
parser = DocParser(engine="xml") # doc engines: xml
doc_data = parser.parse(doc_path)
print(doc_data.text)
📜 Docx Parser
from wisup_e2m import DocxParser
docx_path = "./test.docx"
parser = DocxParser(engine="xml") # docx engines: xml
docx_data = parser.parse(docx_path)
print(docx_data.text)
📚 Epub Parser
from wisup_e2m import EpubParser
epub_path = "./test.epub"
parser = EpubParser(engine="unstructured") # epub engines: unstructured
epub_data = parser.parse(epub_path)
print(epub_data.text)
🌐 Html Parser
from wisup_e2m import HtmlParser
html_path = "./test.html"
parser = HtmlParser(engine="unstructured") # html engines: unstructured
html_data = parser.parse(html_path)
print(html_data.text)
🔗 Url Parser
from wisup_e2m import UrlParser
url = "https://www.example.com"
parser = UrlParser(engine="jina") # url engines: jina, firecrawl, unstructured
url_data = parser.parse(url)
print(url_data.text)
🖼️ Ppt Parser
from wisup_e2m import PptParser
ppt_path = "./test.ppt"
parser = PptParser(engine="unstructured") # ppt engines: unstructured
ppt_data = parser.parse(ppt_path)
print(ppt_data.text)
🖼️ Pptx Parser
from wisup_e2m import PptxParser
pptx_path = "./test.pptx"
parser = PptxParser(engine="unstructured") # pptx engines: unstructured
pptx_data = parser.parse(pptx_path)
print(pptx_data.text)
🎤 Voice Parser
from wisup_e2m import VoiceParser
voice_path = "./test.mp3"
parser = VoiceParser(
engine="openai_whisper_local", # voice engines: openai_whisper_api, openai_whisper_local
model="large" # available models: https://github.com/openai/whisper#available-models-and-languages
)
voice_data = parser.parse(voice_path)
print(voice_data.text)
🔄 Converter Quick Start
Here's simple examples demonstrating how to use E2M Converters:
📝 Text Converter
from wisup_e2m import TextConverter
text = "Parsed text data from any parser"
converter = TextConverter(
engine="litellm", # text engines: litellm
model="deepseek/deepseek-chat",
api_key="your api key",
base_url="your base url"
)
text_data = converter.convert(text)
print(text_data)
🖼️ Image Converter
from wisup_e2m import ImageConverter
images = ["./test1.png", "./test2.png"]
converter = ImageConverter(
engine="litellm", # image engines: litellm
model="gpt-4o",
api_key="your api key",
base_url="your base url"
)
image_data = converter.convert(image_path)
print(image_data)
🆙 Next Level
🛠️ E2MParser
E2MParser
is an integrated parser that supports multiple file types. It can be used to parse a wide range of file types into Markdown format.
from wisup_e2m import E2MParser
# Initialize the parser with your configuration file
ep = E2MParser.from_config("config.yaml")
# Parse the desired file
data = ep.parse(file_name="/path/to/file.pdf")
# Print the parsed data as a dictionary
print(data.to_dict())
🛠️ E2MConverter
E2MConverter
is an integrated converter that supports text and image conversion. It can be used to convert text and images into Markdown format.
from wisup_e2m import E2MConverter
ec = E2MConverter.from_config("./config.yaml")
text = "Parsed text data from any parser"
ec.convert(text=text)
images = ["test.jpg", "test.png"]
ec.convert(images=images)
You can use a config.yaml
file to specify the parsers and converters you want to use. Here is an example of a config.yaml
file:
parsers:
doc_parser:
engine: "xml"
langs: ["en", "zh"]
docx_parser:
engine: "xml"
langs: ["en", "zh"]
epub_parser:
engine: "unstructured"
langs: ["en", "zh"]
html_parser:
engine: "unstructured"
langs: ["en", "zh"]
url_parser:
engine: "jina"
langs: ["en", "zh"]
pdf_parser:
engine: "marker"
langs: ["en", "zh"]
pptx_parser:
engine: "unstructured"
langs: ["en", "zh"]
voice_parser:
# option 1: use openai whisper api
# engine: "openai_whisper_api"
# api_base: "https://api.openai.com/v1"
# api_key: "your_api_key"
# model: "whisper"
# option 2: use local whisper model
engine: "openai_whisper_local"
model: "large" # available models: https://github.com/openai/whisper#available-models-and-languages
converters:
text_converter:
engine: "litellm"
model: "deepseek/deepseek-chat"
api_key: "your_api_key"
# base_url: ""
image_converter:
engine: "litellm"
model: "gpt-4o-mini"
api_key: "your_api_key"
# base_url: ""
❓ Q&A
-
Why set up parsers and converters instead of directly converting to Markdown in one step?
- The core purpose of a parser is to extract data such as text and images without heavily processing it. In some projects, like knowledge bases, not all files need to be converted into Markdown. If the extracted text and image content already meets basic retrieval-augmented generation (RAG) needs, there is no need to incur additional costs on format conversion.
- Based on the extracted images and text, converters can further refine and format the data to make it more suitable for training and fine-tuning RAG models.
-
Why does the
PdfParser
produce poor Markdown text results?- The primary function of
PdfParser
is parsing, not directly converting to Markdown. PdfParser
supports three engines:marker
: Inspired by the well-knownmarker
project, it can directly convert to Markdown. However, due to suboptimal results with complex text, it is only part of the parser.unstructured
: The parsed output is raw text with almost no formatting, recommended for use with well-structured PDFs.surya_layout
: The output is not text but images with layout information marked, requiring conversion using theImageConverter
. If theImageConverter
uses multimodal models likegpt-4o
, the conversion to Markdown yields the best results, comparable to some commercial conversion software.
- Below is a code example that produces the best conversion results:
import os from wisup_e2m import PdfParser, ImageConverter work_dir = os.getcwd() # Use the current directory as the working directory image_dir = os.path.join(work_dir, "figure") pdf = "./test.pdf" # Load the parser pdf_parser = PdfParser(engine="surya_layout") # Load the converter image_converter = ImageConverter( engine="litellm", api_key="<your API key>", # Replace with your API key model="gpt-4o", base_url="<your base URL>", # If using a model proxy, provide the base URL caching=True, cache_type="disk-cache", ) # Parse the PDF into images pdf_data = pdf_parser.parse( pdf, start_page=0, # Starting page number end_page=20, # Ending page number work_dir=work_dir, image_dir=image_dir, # Directory to save extracted images relative_path=True, # Whether the image paths are relative to work_dir ) # Convert images to text using ImageConverter md_text = image_converter.convert( images=pdf_data.images, attached_images_map=pdf_data.attached_images_map, work_dir=work_dir, # Image addresses in Markdown will be relative to work_dir; the default is absolute paths ) # Save test Markdown with open("test.md", "w") as f: f.write(md_text)
- The primary function of
-
Unable to connect to 'https://huggingface.co'
- Method 1: Try accessing through a VPN or proxy.
- Method 2: Use a mirror in your code:
import os os.environ['CURL_CA_BUNDLE'] = '' os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
- Method 3: Set environment variables in the terminal:
export CURL_CA_BUNDLE='' export HF_ENDPOINT='https://hf-mirror.com'
-
Resource xxx not found. Please use the NLTK Downloader to obtain the resource:
import nltk nltk.download('all') # you can directly download all resources
-
Resource wordnet not found.
- Uninstall
nltk
completely:pip uninstall nltk
- Reinstall
nltk
with the following command:pip install nltk
- Download corpora/wordnet.zip manually and unzip it to the directory specified in the error message. Otherwise, you can download it using the following commands:
- Windows:
wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~\AppData\Roaming\nltk_data\corpora\wordnet.zip
andunzip ~\AppData\Roaming\nltk_data\corpora\wordnet.zip -d ~\AppData\Roaming\nltk_data\corpora\
- Unix:
wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~/nltk_data/corpora/wordnet.zip
andunzip ~/nltk_data/corpora/wordnet.zip -d ~/nltk_data/corpora/
- Windows:
- Uninstall
📜 License
This project is licensed under the MIT License. See the LICENSE file for details.
📧 Contact
You can scan the QR code below to join our WeChat group:
For any questions or inquiries, please open an issue on GitHub or contact us at team@wisup.ai.
🌟 Contributing
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file wisup_e2m-0.1.56.1.tar.gz
.
File metadata
- Download URL: wisup_e2m-0.1.56.1.tar.gz
- Upload date:
- Size: 98.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 219ab3230788a7ad14c92a2b89c7274ea94495e7c3627785b41cad94df0b715a |
|
MD5 | 2e9998cb238af67f03bbf68934a3a6bb |
|
BLAKE2b-256 | 5f0b8179fa16a1f1ea9b725b01e205e8582c3ff8066d4e75c5989bfd5149dea2 |
File details
Details for the file wisup_e2m-0.1.56.1-py3-none-any.whl
.
File metadata
- Download URL: wisup_e2m-0.1.56.1-py3-none-any.whl
- Upload date:
- Size: 69.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.3 CPython/3.10.14 Darwin/23.6.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0b870f648d3617e8ff8a70bd13319812897fa147b1fdfdfd114cf90dba44d628 |
|
MD5 | 9159346e4bb998883c748fa1108d6c2c |
|
BLAKE2b-256 | e683d7423d91a9b36a591641bfa7f0e54cdaf40c540d112691012178a45e6146 |