deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does
not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR
and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models. For more
specific text processing tasks use one of the many other great NLP libraries.
deepdoctection focuses on applications and is made for those who want to solve real world problems related to
document extraction from PDFs or scans in various image formats.
Check the demo of a document layout analysis pipeline with OCR on
deepdoctection provides model wrappers of supported libraries for various tasks to be integrated into
pipelines. Its core function does not depend on any specific deep learning library. Selected models for the following
tasks are currently supported:
- Document layout analysis including table recognition in Tensorflow with Tensorpack,
or PyTorch with Detectron2,
- OCR with support of Tesseract, DocTr
(Tensorflow and PyTorch implementations available) and a wrapper to an API for a commercial solution,
- Text mining for native PDFs with pdfplumber,
- Language detection with fastText,
- Deskewing and rotating images with jdeskew.
- Document and token classification with all LayoutLM models
provided by the Transformer library.
(Yes, you can use any LayoutLM-model with any of the provided OCR-or pdfplumber tools straight away!). Check the notebook repo or
the documentation on how to train a model on your custom task or how to setup a pipeline.
- Table detection and table structure recognition with
table-transformer. You can try a pipeline using
deepdoctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to
post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words
into contiguous text. You will get an output in JSON format that you can customize even further by yourself.
Check the release notes for recent updates.
deepdoctection or its support libraries provide pre-trained models that are in most of the cases available at the
Hugging Face Model Hub or that will be automatically downloaded once
requested. For instance, you can find pre-trained object detection models from the Tensorpack or Detectron2 framework
for coarse layout analysis, table cell detection and table recognition.
Training is a substantial part to get pipelines ready on some specific domain, let it be document layout analysis,
document classification or NER. deepdoctection provides training scripts for models that are based on trainers
developed from the library that hosts the model code. Moreover, deepdoctection hosts code to some well established
datasets like Publaynet that makes it easy to experiment. It also contains mappings from widely used data
formats like COCO and it has a dataset framework (akin to datasets so that
setting up training on a custom dataset becomes very easy. This notebook
shows you how to do this.
deepdoctection comes equipped with a framework that allows you to evaluate predictions of a single or multiple
models in a pipeline against some ground truth. Check again here how it is
Having set up a pipeline it takes you a few lines of code to instantiate the pipeline and after a for loop all pages will
be processed through the pipeline.
import deepdoctection as dd from IPython.core.display import HTML from matplotlib import pyplot as plt analyzer = dd.get_dd_analyzer() # instantiate the built-in analyzer similar to the Hugging Face space demo df = analyzer.analyze(path = "/path/to/your/doc.pdf") # setting up pipeline df.reset_state() # Trigger some initialization doc = iter(df) page = next(doc) image = page.viz() plt.figure(figsize = (25,17)) plt.axis('off') plt.imshow(image)
There is an extensive documentation available
containing tutorials, design concepts and the API. We want to present things as comprehensively and understandably
as possible. However, we are aware that there are still many areas where significant improvements can be made in terms
of clarity, grammar and correctness. We look forward to every hint and comment that increases the quality of the
Everything in the overview listed below the deepdoctection layer are necessary requirements and have to be installed
- Linux or macOS. (Windows is not supported but there is a Dockerfile available)
- Python >= 3.8
- PyTorch >= 1.8 or Tensorflow >= 2.9 and CUDA. If you want to run the models provided by Tensorpack a GPU is
required. You can run on PyTorch with a CPU only.
- deepdoctection uses Python wrappers for Poppler to convert PDF documents into
- With respect to the Deep Learning framework, you must decide between Tensorflow
- Tesseract OCR engine will be used through a Python wrapper. The core
engine has to be installed separately.
We recommend using a virtual environment. You can install the package via pip or from source. Bug fixes or enhancements
will be deployed to PyPi every 4 to 6 weeks.
Depending on which Deep Learning library you have available, use the following installation option:
For Tensorflow, run
pip install deepdoctection[tf]
first install Detectron2 separately as it is not distributed via PyPi. Check the instruction
here. Then run
pip install deepdoctection[pt]
This will install deepdoctection with all dependencies listed above the deepdoctection layer. Use this setting,
if you want to get started or want to explore all features.
If you want to have more control with your installation and are looking for fewer dependencies then
install deepdoctection with the basic setup only.
pip install deepdoctection
This will ignore all model libraries (layers above the deepdoctection layer in the diagram) and you
will be responsible to install them by yourself. Note, that you will not be able to run any pipeline with this setup.
For further information, please consult the full installation instructions.
Download the repository or clone via
git clone https://github.com/deepdoctection/deepdoctection.git
To get started with Tensorflow, run:
pip install ".[tf]"
Installing the full PyTorch setup from source will also install Detectron2 for you:
pip install ".[source-pt]"
We thank all libraries that provide high quality code and pre-trained models. Without, it would have been impossible
to develop this framework.
We try hard to eliminate bugs. We also know that the code is not free of issues. We welcome all issues relevant to this
repo and try to address them as quickly as possible.
…you can easily support the project by making it more visible. Leaving a star or a recommendation will help.
Distributed under the Apache 2.0 License. Check LICENSE
for additional information.