Python Packages to Convert pdf to different formats
In this article, I will be recommending few python packages which can be used to convert the pdf to other useful formats. Since there are a lot of online websites which we can make use of, when it becomes to research, people won’t be comfortable to upload their work to any random websites, due to privacy concerns.
PDF TO WORD
PDF TO MARKDOWN
PDF TO XML
PDF TO HTML
PDF TO IMAGE
PDF TO CSV
PDF To WORD
Package name : pdf2docx
Installation Code: pip install pdf2docx
Usage
$ pdf2docx --help
NAME
pdf2docx - Run the pdf2docx parser.
SYNOPSIS
pdf2docx PDF_FILE <flags>
DESCRIPTION
Run the pdf2docx parser.
POSITIONAL ARGUMENTS
PDF_FILE
PDF filename to read from
FLAGS
--docx_file=DOCX_FILE
DOCX filename to write to
--start=START
first page to process, starting from zero
--end=END
last page to process, starting from zero
--pages=PAGES
range of pages
--multi_processing=MULTI_PROCESSING
NOTES
You can also use flags syntax for POSITIONAL ARGUMENTS
By range of pages
$ pdf2docx test.pdf test.docx --start=5 --end=10
By page numbers
$ pdf2docx test.pdf test.docx --pages=5,7,9
Multi-Processing
$ pdf2docx test.pdf --multi_processing=True
PDF to Markdown
Package name : pdf-to-markdown 0.1.0
Installation Code: pip install pdf-to-markdown
Usage
pdf2md <pdf path>
PDF to XML
Package name : pypdf2xml 0.3
Installation Code: pip install pypdf2xml
Usage
pypdf2xml <pdf path>
PDF to Html
Parse PDFs into HTML-like trees.
Package name : pdftotree 0.4.1
Installation Code: pip install pdftotree
Dependencies
You’ll need to install the Python3 Toolkit:
$ sudo apt install python3-tk
Installation
To install this package from PyPi:
$ pip install pdftotree
Usage
pdftotree as a Python package
import pdftotree
pdftotree.parse(pdf_file, html_path=None, model_type=None, model_path=None, favor_figures=True, visualize=False):
pdftotree
This is the primary command-line utility provided with this Python package. This takes a PDF file as input, and produces an HTML-like representation of the data:
usage: pdftotree [options] pdf_file
Script to extract tree structure from PDF files. Takes a PDF as input and
outputs an HTML-like representation of the document's structure. By default,
this conversion is done using heuristics. However, a model can be provided as
a parameter to use a machine-learning-based approach.
positional arguments:
pdf_file PDF file name for which tree structure needs to be
extracted
optional arguments:
-h, --help show this help message and exit
-mt {vision,ml,None}, --model_type {vision,ml,None}
Model type to use. None (default) for heuristics
approach.
-m MODEL_PATH, --model_path MODEL_PATH
Pretrained model, generated by extract_tables tool
-o OUTPUT, --output OUTPUT
Path where tree structure should be saved. If none,
HTML is printed to stdout.
-f FAVOR_FIGURES, --favor_figures FAVOR_FIGURES
Whether figures must be favored over other parts such
as tables and section headers
-V, --visualize Whether to output visualization images for the tree
-d, --dry-run Run pdftotree, but do not save any output or print to
console.
-v, --verbose Output INFO level logging.
-vv, --veryverbose Output DEBUG level logging.
extract_tables
This tool trains a machine-learning model to extract tables. The output model can be used as an input to pdftotree
:
usage: extract_tables [-h] [--mode MODE] --model-path MODEL_PATH
[--train-pdf TRAIN_PDF] --test-pdf TEST_PDF
[--gt-train GT_TRAIN] --gt-test GT_TEST --datapath
DATAPATH [--iou-thresh IOU_THRESH] [-v] [-vv]
Script to extract tables bounding boxes from PDF files using machine learning.
If `model.pkl` is saved in the model-path, the pickled model will be used for
prediction. Otherwise the model will be retrained. If --mode is test (by
default), the script will create a .bbox file containing the tables for the
pdf documents listed in the file --test-pdf. If --mode is dev, the script will
also extract ground truth labels for the test data and compute statistics.
optional arguments:
-h, --help show this help message and exit
--mode MODE Usage mode dev or test, default is test
--model-path MODEL_PATH
Path to the model. If the file exists, it will be
used. Otherwise, a new model will be trained.
--train-pdf TRAIN_PDF
List of pdf file names used for training. These files
must be saved in the --datapath directory. Required if
no pretrained model is provided.
--test-pdf TEST_PDF List of pdf file names used for testing. These files
must be saved in the --datapath directory.
--gt-train GT_TRAIN Ground truth train tables. Required if no pretrained
model is provided.
--gt-test GT_TEST Ground truth test tables.
--datapath DATAPATH Path to directory containing the input documents.
--iou-thresh IOU_THRESH
Intersection over union threshold to remove duplicate
tables
-v Output INFO level logging
-vv Output DEBUG level logging
-
PDF List Format
The list of PDFs are simply a single filename on each line. For example:
1-s2.0-S000925411100369X-main.pdf 1-s2.0-S0009254115301030-main.pdf 1-s2.0-S0012821X12005717-main.pdf 1-s2.0-S0012821X15007487-main.pdf 1-s2.0-S0016699515000601-main.pdf
-
Ground Truth File Format
The ground truth is formatted to mirror the PDF List. That is, the first line of the ground truth file provides the labels for the first document in corresponding PDF list. Labels take the form of semicolon-separated tuples containing the values
(page_num, page_width, page_height, top, left, bottom, right)
. For example:`(10, 696, 951, 634, 366, 832, 653);(14, 696, 951, 720, 62, 819, 654);(4, 696, 951, 152, 66, 813, 654);(7, 696, 951, 415, 57, 833, 647);(8, 696, 951, 163, 370, 563, 652) (11, 713, 951, 97, 47, 204, 676);(11, 713, 951, 261, 45, 357, 673);(3, 713, 951, 110, 44, 355, 676);(8, 713, 951, 763, 55, 903, 687) (5, 672, 951, 88, 57, 203, 578);(5, 672, 951, 593, 60, 696, 579) (5, 718, 951, 131, 382, 403, 677) (13, 713, 951, 119, 56, 175, 364);(13, 713, 951, 844, 57, 902, 363);(14, 713, 951, 109, 365, 164, 671);(8, 713, 951, 663, 46, 890, 672)
One method to label these tables is to use DocumentAnnotation, which allows you to select table regions in your web browser and produces the bounding box file.
Example Dataset: Paleontological Papers
A full set of documents and ground truth labels can be downloaded here: PaleoDocs. You can train a machine-learning model to extract table regions by downloading this dataset and extracting it into a directory named data
and then running the command below. Double check that the paths in the command match wherever you have downloaded the data:
$ extract_tables --train-pdf data/paleo/ml/train.pdf.list.paleo.not.scanned --gt-train data/paleo/ml/gt.train --test-pdf data/paleo/ml/test.pdf.list.paleo.not.scanned --gt-test data/paleo/ml/gt.test --datapath data/paleo/documents/ --model-path data/model.pkl
The resulting model of this example command would be saved as data/model.pkl
.
PDF to Image
Package name : pdf2image 1.14.0
Installation Code: pip install pdf2image
How to install
pip install pdf2image
Windows
Windows users will have to build or download poppler for Windows. I recommend @oschwartz10612 version which is the most up-to-date. You will then have to add the bin/
folder to PATH or use poppler_path = r"C:\path\to\poppler-xx\bin" as an argument
in convert_from_path
.
Mac
Mac users will have to install poppler for Mac.
Linux
Most distros ship with pdftoppm
and pdftocairo
. If they are not installed, refer to your package manager to install poppler-utils
Platform-independant (Using conda
)
- Install poppler:
conda install -c conda-forge poppler
- Install pdf2image:
pip install pdf2image
How does it work?
from pdf2image import convert_from_path, convert_from_bytes
from pdf2image.exceptions import (
PDFInfoNotInstalledError,
PDFPageCountError,
PDFSyntaxError
)
Then simply do:
images = convert_from_path('/home/belval/example.pdf')
OR
images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())
OR better yet
import tempfile
with tempfile.TemporaryDirectory() as path:
images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
# Do something here
images
will be a list of PIL Image representing each page of the PDF document.
Here are the definitions:
convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600)
convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', jpegopt=None, thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False, size=None, paths_only=False, use_pdftocairo=False, timeout=600)
PDF to CSV
Package name : pdf-rules 1.0.0
Installation Code: pip install pdf-rules
Usage
pdf-rules <pdf path>
####