Docling
IBM's open-source document parser for PDFs, DOCX, and 20+ formats
Visit Docling
https://github.com/docling-project/docling
About Docling
Open-source document processing library from IBM Research that parses diverse formats including PDF, DOCX, PPTX, XLSX, HTML, images, audio, and more. Docling provides advanced PDF understanding with page layout analysis, table structure recognition, code and formula extraction, and seamless integration with LangChain, LlamaIndex, CrewAI, and Haystack for RAG pipelines.
Key Features
✓Parse PDF, DOCX, PPTX, XLSX, HTML, images, audio, and more
✓Advanced PDF layout and table structure recognition
✓OCR for scanned documents and images
✓Export to Markdown, HTML, JSON, DocTags
✓LangChain, LlamaIndex, CrewAI integrations
✓Local execution for sensitive data / air-gapped environments
Tags
document parsingpdfopen sourceragdata extractionibm
🏷️
Is this your tool?
Claim your listing to get a Featured badge, edit your description, and stand out from competitors. All plans include a permanent dofollow backlink to your site.
Claim Now →Stay updated on Data & Analytics tools — join our weekly newsletter
One concise email with fresh launches, trending picks, and featured standouts.