Hello, friends. In this simple post, you will learn how to extract embedded images from a PDF.
Occasionally, it is necessary to extract the images that are inside a PDF file. This may seem difficult, but it is easier than you think.
Best of all, you can do it from the terminal, so you don’t waste resources and make sure the process is very fast.
Let’s get started.
Install Poppler on Linux
According to the Poppler website
Poppler is a PDF rendering library based on the xpdf-3.0 code base.
It is with this library that we will have access to PDF file manipulation tools.
To install it, it makes the most sense to resort to the package included in the official repositories of each distribution. Although, you can also compile it or download the binaries.
In the case of Debian, Ubuntu and its derivatives such as Linux Mint, you can run
sudo apt update sudo apt install poppler-utils
Once the library is installed, then we can use part of its components to accomplish the task.
Extract embedded images from a PDF file
The procedure is simple. Just follow this syntax.
pdfimages -all input.pdf images/prefix
The above command takes all the images from the
input.pdf file and extracts them into the same directory as the prompt. Of course, you can set an absolute path to where the PDF file is and another one for the output.
images/prefix the ideal would be to choose one that identifies the images well and with a format like
png of this two PNG, it brings more quality.
Then, the command would look like this
pdfimages -all input.pdf sample
This will originate image files with this nomenclature
sample-nnn.png in the directory.
If you want to use
jpg, then add the
pdfimages -all -j input.pdf sample
-j option, you might not get the desired results, but see what man says about it:
” Normally, all images are written as PBM (for monochrome images) or PPM for non-monochrome images) files. With this option, images in DCT format are saved as JPEG files. All non-DCT images are saved in PBM/PPM format as usual.”
More options available for extracting images
The above command extracts all images, but many times we want to define a range. Important option if the file is very long.
For this, there are the options
-l that define the first and the last page from where to extract the images
pdfimages -f 1 -l 5 -png input.pdf images
This is perhaps the most useful option because it allows us to limit the output files.
Another very interesting option is
-p which includes page numbers in output file names
pdfimages -f 1 -l 5 -png -p input.pdf images
So, as you can see, it is simple.