Extract Embedded Images from PDFs in Linux

Extracting Embedded Images from PDFs in Linux is a task that can be accomplished using various command-line tools. Linux provides several tools that can extract images from PDF files, but the best tool for the job is poppler-utils, which contains pdfimages, pdftoppm, and pdfgrep.

Here are some key differences between these three tools:

  • pdfimages extract all images from a PDF file and save them in PNG, JPEG, and PBM formats.
  • pdftoppm can convert PDF files to various image formats, including PNG, JPEG, and PPM.
  • pdfgrep is a powerful tool for searching for text and extracting images from PDF files.

Using poppler-utils, you can access all three tools, which provides a convenient way to extract images from PDF files in Linux.

The guide will demonstrate how to extract embedded PDF images using the Poppler various tools. This guide will cover the installation of the required tools and how to use them to extract images from PDF files. By the end of this guide, you will clearly understand how to use these tools to extract images from PDF files on Linux.

Install Poppler-Utils

Before starting, you may need to install some of the required tools. Some Linux distributions come with pre-installed tools; if not, you can install them using your package manager. Below are some of the most commonly used Linux distributions and their commands to install poppler-utils:

Debian/Ubuntu based

sudo apt-get install poppler-utils

Fedora/RHEL based

sudo dnf install poppler-utils

Gentoo

sudo emerge --ask poppler-utils

openSUSE

sudo zypper install poppler-tools

Arch based

sudo pacman -S poppler

Alpine Linux

sudo apk add poppler-utils

Note that the package names may differ slightly depending on your Linux distribution and version. If you encounter any issues with the installation, you can refer to your distribution package management documentation for more information.

Method 1: pdfimages

pdfimages is a command-line tool that is part of the poppler-utils package. It can extract images from PDF files and save them in various formats, including JPEG, PNG, and PBM. To extract all images from a PDF file using pdfimages, run the following command:

pdfimages -all input.pdf output_prefix

This command extracts all images from the input PDF file and saves them as separate files with names starting with the output_prefix. For example, if the output_prefix is set to “image”, the output files will be named “image-000.png”, “image-001.png”, and so on.

You can also use pdfimages to extract images from a specific page range of the PDF file. For example, to extract images from pages 5 to 10, run the following command:

pdfimages -f 5 -l 10 input.pdf output_prefix

Method 2: pdftoppm

pdftoppm is another command-line tool that can extract images from PDF files. It is part of the poppler-utils package and can convert PDF files to various image formats, including PNG, JPEG, and PPM. To extract images from a PDF file using pdftoppm, run the following command:

pdftoppm -png input.pdf output_prefix

This command converts each page of the input PDF file to a PNG image and saves it as a separate file with the specified prefix. You can also use pdftoppm to extract images from a specific page range of the PDF file. For example, to extract images from pages 5 to 10, run the following command:

pdftoppm -png -f 5 -l 10 input.pdf output_prefix

Method 3: Using pdfgrep

pdfgrep is a powerful tool that can search for text in PDF files but can also be used to extract images. To extract images from a PDF file using pdfgrep, run the following command:

pdfgrep --stream --no-filename '<< /Type /XObject /Subtype /Image' input.pdf | awk '{print $1}' | xargs -L1 pdfdetach -saveall -output output_prefix

This command uses pdfgrep to search for all XObject Image objects in the input PDF file and outputs their object numbers. These object numbers are then passed to pdfdetach, which extracts each object to a separate file with the specified prefix.

Conclusion

Extracting images from PDF files in Linux is a straightforward process that can be accomplished with a few command-line tools. The choice of tool will depend on your specific needs and preferences. pdfimages, pdftoppm, and pdfgrep are just a few tools available to extract images from PDF files. With these tools, you can easily extract images from PDF files and use them in your projects.