marker
Repository: datalab-to/markerDescription: Fast PDF to Markdown/JSON conversion with high accuracy and table extraction. Key Features:
- High Accuracy: Outperforms LLM-only parsers on complex layouts and tables.
- Table Extraction: Intelligent table identification and formatting into Markdown/JSON.
- OCR Integration: Uses Surya for high-quality OCR when text extraction fails.
- Markdown/JSON/Chunks Output: Flexible output formats for RAG and data ingestion.
- Model Pipeline: Combines layout detection, cleaning, and formatting models.
Primary Use Cases:
- Converting complex scientific papers or financial reports for RAG.
- Extracting structured data from legacy PDF documents.
- Fast, high-volume document ingestion for LLM training or search.
Tags: #pdf-parsing #rag #knowledge-extraction #ocr Added: 2026-06-18 Source: GitHub
Notes / Why Notable
Marker is significantly faster than many cloud-based PDF parsing services while maintaining comparable or better accuracy on structured elements like tables and formulas.