Architools Wiki

marker

Repository: datalab-to/markerDescription: Fast PDF to Markdown/JSON conversion with high accuracy and table extraction. Key Features:

High Accuracy: Outperforms LLM-only parsers on complex layouts and tables.
Table Extraction: Intelligent table identification and formatting into Markdown/JSON.
OCR Integration: Uses Surya for high-quality OCR when text extraction fails.
Markdown/JSON/Chunks Output: Flexible output formats for RAG and data ingestion.
Model Pipeline: Combines layout detection, cleaning, and formatting models.

Primary Use Cases:

Converting complex scientific papers or financial reports for RAG.
Extracting structured data from legacy PDF documents.
Fast, high-volume document ingestion for LLM training or search.

Tags: #pdf-parsing #rag #knowledge-extraction #ocr Added: 2026-06-18 Source: GitHub

Notes / Why Notable

Marker is significantly faster than many cloud-based PDF parsing services while maintaining comparable or better accuracy on structured elements like tables and formulas.

marker ​

Notes / Why Notable ​

marker

Notes / Why Notable