Hello, friends. In this simple post, you will learn how to extract embedded images from a PDF.
Occasionally, it is necessary to extract the images that are inside a PDF file. This may seem difficult, but it is easier than you think.
Best of all, you can do it from the terminal, so you don’t waste resources and make sure the process is very fast.
Let’s get started.
Install Poppler on Linux
According to the Poppler website
Poppler is a PDF rendering library based on the xpdf-3.0 code base.
It is with this library that we will have access to PDF file manipulation tools.
To install it, it makes the most sense to resort to the package included in the official repositories of each distribution. Although, you can also compile it or download the binaries.
In the case of Debian, Ubuntu and its derivatives such as Linux Mint, you can run
sudo apt update
sudo apt install poppler-utils
Once the library is installed, then we can use part of its components to accomplish the task.
Extract embedded images from a PDF file
The procedure is simple. Just follow this syntax.
pdfimages -all input.pdf images/prefix
The above command takes all the images from the input.pdf
file and extracts them into the same directory as the prompt. Of course, you can set an absolute path to where the PDF file is and another one for the output.
As for images/prefix
the ideal would be to choose one that identifies the images well and with a format like jpeg
or png
of this two PNG, it brings more quality.
Then, the command would look like this
pdfimages -all input.pdf sample
This will originate image files with this nomenclature sample-nnn.png
in the directory.
If you want to use jpg
, then add the -j
option
pdfimages -all -j input.pdf sample
About the -j
option, you might not get the desired results, but see what man says about it:
” Normally, all images are written as PBM (for monochrome images) or PPM for non-monochrome images) files. With this option, images in DCT format are saved as JPEG files. All non-DCT images are saved in PBM/PPM format as usual.”
More options available for extracting images
The above command extracts all images, but many times we want to define a range. Important option if the file is very long.
For this, there are the options -f
and -l
that define the first and the last page from where to extract the images
pdfimages -f 1 -l 5 -png input.pdf images
This is perhaps the most useful option because it allows us to limit the output files.
Another very interesting option is -p
which includes page numbers in output file names
pdfimages -f 1 -l 5 -png -p input.pdf images
So, as you can see, it is simple.