
Marker, hosted: a scientific PDF parser API with LaTeX equations preserved
The problem I kept hitting the same wall when building RAG pipelines over research papers: every generic PDF parser I tried mangled the equations. Adobe Extract, AWS Textract, pdfplumber, PyMuPDF — they all collapse display math into plain-text garbage. Attention(Q,K,V) = softmax(QKT / √dk) V becomes something like: QKT √dk Attention(Q,K,V ) = softmax( )V (1) Unusable. Your embedding model sees a soup of tokens. Your LLM has no idea what the equation means. Your RAG answers are wrong on anything math-heavy. What I tried I benchmarked the obvious options on a handful of arxiv papers I cared about: Docling (IBM): drops every display equation as a placeholder. ~5/12 on a controlled equation-extraction benchmark. Nougat (Meta): the results were actually good when it worked, but the repo is essentially unmaintained and the dependency tree is a minefield. Mistral OCR : cheap and general-purpose, but equation fidelity is inconsistent on papers with dense notation. LlamaParse : optimized for "
Continue reading on Dev.to
Opens in a new tab



