MegaParse - Your Parser for every type of documents

MegaParse is a powerful and versatile parser that can handle various types of documents with ease. Whether you're dealing with text, PDFs, Powerpoint presentations, Word documents MegaParse has got you covered. Focus on having no information loss during parsing.

Key Features 🎯

Versatile Parser: MegaParse is a powerful and versatile parser that can handle various types of documents with ease.
No Information Loss: Focus on having no information loss during parsing.
Fast and Efficient: Designed with speed and efficiency at its core.
Wide File Compatibility: Supports Text, PDF, Powerpoint presentations, Excel, CSV, Word documents.
Open Source: Freedom is beautiful, and so is MegaParse. Open source and free to use.

Support

Files: ✅ PDF ✅ Powerpoint ✅ Word
Content: ✅ Tables ✅ TOC ✅ Headers ✅ Footers ✅ Images

Example

megaparse.mp4

Installation

pip install megaparse

Usage

Add your OpenAI or Anthropic API key to the .env file
Install poppler on your computer (images and PDFs)
Install tesseract on your computer (images and PDFs)
If you have a mac, you also need to install libmagic brew install libmagic

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.unstructured_parser import UnstructuredParser

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # or any langchain compatible Chat Models
parser = UnstructuredParser(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use MegaParse Vision

Change the parser to MegaParseVision

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.megaparse_vision import MegaParseVision

model = ChatOpenAI(model="gpt-4o", api_key=os.getenv("OPENAI_API_KEY"))  # type: ignore
parser = MegaParseVision(model=model)
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md")

Note: The model supported by MegaParse Vision are the multimodal ones such as claude 3.5, claude 4, gpt-4o and gpt-4.

(Optional) Use LlamaParse for Improved Results

Create an account on Llama Cloud and get your API key.
Change the parser to LlamaParser

from megaparse.core.megaparse import MegaParse
from langchain_openai import ChatOpenAI
from megaparse.core.parser.llama import LlamaParser

parser = LlamaParser(api_key = os.getenv("LLAMA_CLOUD_API_KEY"))
megaparse = MegaParse(parser)
response = megaparse.load("./test.pdf")
print(response)
megaparse.save("./test.md") #saves the last processed doc in md format

Use as an API

There is a MakeFile for you, simply use : make dev at the root of the project and you are good to go.

See localhost:8000/docs for more info on the different endpoints !

BenchMark

Parser	similarity_ratio
megaparse_vision	0.87
unstructured_with_check_table	0.77
unstructured	0.59
llama_parser	0.33

Higher the better

Note: Want to evaluate and compare your Megaparse module with ours ? Please add your config in evaluations/script.py and then run python evaluations/script.py. If it is better, do a PR, I mean, let's go higher together 🚀.

In Construction 🚧

Improve table checker
Create Checkers to add modular postprocessing ⚙️
Add Structured output, let's get computer talking 🤖

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
.aws		.aws
.github/workflows		.github/workflows
evaluations		evaluations
images		images
megaparse		megaparse
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
.release-please-manifest.json		.release-please-manifest.json
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
Pipfile		Pipfile
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
logo.png		logo.png
pyproject.toml		pyproject.toml
release-please-config.json		release-please-config.json
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaParse - Your Parser for every type of documents

Key Features 🎯

Support

Example

Installation

Usage

Use MegaParse Vision

(Optional) Use LlamaParse for Improved Results

Use as an API

BenchMark

In Construction 🚧

Star History

About

Releases 43

Contributors 6

Languages

License

QuivrHQ/MegaParse

Folders and files

Latest commit

History

Repository files navigation

MegaParse - Your Parser for every type of documents

Key Features 🎯

Support

Example

Installation

Usage

Use MegaParse Vision

(Optional) Use LlamaParse for Improved Results

Use as an API

BenchMark

In Construction 🚧

Star History

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 43

Contributors 6

Languages