How to extract embedded images from a PDF

Everything Linux, A.I, IT News, DataOps, Open Source and more delivered right to you.
Subscribe
"The best Linux newsletter on the web"

Hello, friends. In this simple post, you will learn how to extract embedded images from a PDF.

Occasionally, it is necessary to extract the images that are inside a PDF file. This may seem difficult, but it is easier than you think.

Best of all, you can do it from the terminal, so you don’t waste resources and make sure the process is very fast.

Let’s get started.

Install Poppler on Linux

According to the Poppler website

Poppler is a PDF rendering library based on the xpdf-3.0 code base.

It is with this library that we will have access to PDF file manipulation tools.

To install it, it makes the most sense to resort to the package included in the official repositories of each distribution. Although, you can also compile it or download the binaries.

In the case of Debian, Ubuntu and its derivatives such as Linux Mint, you can run

sudo apt update
sudo apt install poppler-utils

Once the library is installed, then we can use part of its components to accomplish the task.

Extract embedded images from a PDF file

The procedure is simple. Just follow this syntax.

pdfimages -all input.pdf images/prefix

The above command takes all the images from the input.pdf file and extracts them into the same directory as the prompt. Of course, you can set an absolute path to where the PDF file is and another one for the output.

As for images/prefix the ideal would be to choose one that identifies the images well and with a format like jpeg or png of this two PNG, it brings more quality.

Then, the command would look like this

pdfimages -all input.pdf sample

This will originate image files with this nomenclature sample-nnn.png in the directory.

If you want to use jpg, then add the -j option

pdfimages -all -j input.pdf sample

About the -j option, you might not get the desired results, but see what man says about it:

” Normally, all images are written as PBM (for monochrome images) or PPM for non-monochrome images) files. With this option, images in DCT format are saved as JPEG files. All non-DCT images are saved in PBM/PPM format as usual.”

More options available for extracting images

The above command extracts all images, but many times we want to define a range. Important option if the file is very long.

For this, there are the options -f and -l that define the first and the last page from where to extract the images

pdfimages -f 1 -l 5 -png input.pdf images

This is perhaps the most useful option because it allows us to limit the output files.

Another very interesting option is -p which includes page numbers in output file names

pdfimages -f 1 -l 5 -png -p input.pdf images

So, as you can see, it is simple.

Everything Linux, A.I, IT News, DataOps, Open Source and more delivered right to you.
Subscribe
"The best Linux newsletter on the web"
Angelo
Angelo
I am Angelo. A systems engineer passionate about Linux and all open-source software. Although here I'm just another member of the family.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest articles

Join us on Facebook