Skip to content

marker

Repository: datalab-to/markerDescription: Fast PDF to Markdown/JSON conversion with high accuracy and table extraction. Key Features:

  • High Accuracy: Outperforms LLM-only parsers on complex layouts and tables.
  • Table Extraction: Intelligent table identification and formatting into Markdown/JSON.
  • OCR Integration: Uses Surya for high-quality OCR when text extraction fails.
  • Markdown/JSON/Chunks Output: Flexible output formats for RAG and data ingestion.
  • Model Pipeline: Combines layout detection, cleaning, and formatting models.

Primary Use Cases:

  • Converting complex scientific papers or financial reports for RAG.
  • Extracting structured data from legacy PDF documents.
  • Fast, high-volume document ingestion for LLM training or search.

Tags: #pdf-parsing #rag #knowledge-extraction #ocr Added: 2026-06-18 Source: GitHub

Notes / Why Notable

Marker is significantly faster than many cloud-based PDF parsing services while maintaining comparable or better accuracy on structured elements like tables and formulas.

Maintained with Yeda — Karpathy LLM Wiki paradigm.