+6 votes
in Software by (1.5m points)

How to search multiple PDF files simultaneously with pdfgrep

1 Answer

+7 votes
by (725k points)
 
Best answer

Today we are going to talk about a free and open source application for Linux systems that allows us to search several PDF files simultaneously . The program used is called pdfgrep and allows us to locate words, phrases or text strings in PDF files directly from the console .

image

To summarize its operation, it is like the classic grep but designed to work with PDF files . It is a very useful tool since PDFs are not plain text files and performing searches forces us to use tools like this.

How to install pdfgrep?

The installation process is very simple, since the application is available in most official repositories of the different Linux distributions (Debian, Fedora, Ubuntu, openSUSE, Arch Linux, Gentoo, FreeBSD, etc) .

It can also be compiled from the source code, but in our we will perform an installation from Ubuntu with the command:
sudo apt-get install pdfgrep

Although it is also true that in the Ubuntu repositories we find a somewhat old version, 1.4.1 and if we take a look at the official website of the project, we see that they are already running version 2.0.1. Go through the official website because there are also detailed instructions to compile pdfgrep in case you want to have the latest version.

How to search multiple PDF files simultaneously with pdfgrep.

The most basic use command of pdfgrep is:
pdfgrep <palabra> <archivo.pdf>

With the previous command we will look for the "word" that we define within the specified "file.pdf" . If there is an occurrence, it will be shown on the screen.

image

But the really interesting thing is that the search is carried out in several PDF documents simultaneously, for this we execute the command:
pdfgrep <palabra> *.pdf

For example, when I run the pdfgrep computer *.pdf command, I would search for the word "computer" in all the files in the current folder.

But we can still go further and perform a recursive search in the current directory and its subdirectories . To perform the recursive search, use the -r option accompanied by the --include options to include matches or --exclude to exclude matches. This you will understand better with the following examples:

  • Search recursively in all PDF files: pdfgrep -r --include "*.pdf" <palabra>
  • Search recursively in all PDF files, but excluding those whose name begins with «invoice»: pdfgrep -r --exclude "factura*.pdf" <palabra>

Delving a little deeper into the different options of pdfgrep.

It is now where some of the pdfgrep options come into play, such as the -i option that forces the search not to be case-sensitive . Another interesting option is -n , which shows us the page number where the word or text string was found .

For example, we can combine the above options and execute the following command:
pdfgrep -in -r --include "*.pdf" computer

image

To consult all available options, I recommend reading the help of the program carefully by executing the pdfgrep --help or man pdfgrep . You can also check the official online documentation page if you find it more convenient.

zeokat@ubuntu:~$ pdfgrep --help
Usage: pdfgrep [OPTION]... PATTERN FILE...

Search for PATTERN in each FILE.
PATTERN is, by default, an extended regular expression.

Options:
 -i, --ignore-case              Ignore case distinctions
 -P, --pcre                     Use Perl compatible regular expressions (PCRE)
 -H, --with-filename            Print the file name for each match
 -h, --no-filename              Suppress the prefixing of file name on output
 -n, --page-number              Print page number with output lines
 -c, --count                    Print only a count of matches per file
 -C, --context NUM              Print at most NUM chars of context
     --color WHEN               Use colors for highlighting;
                                WHEN can be `always', `never' or `auto'
 -p, --page-count               Print only a count of matches per page
 -m, --max-count NUM            Stop reading after NUM matching lines (per file)
 -q, --quiet                    Suppress normal output
 -r, --recursive                Search directories recursively
 -R, --dereference-recursive    Likewise, but follow all symlinks
     --help                     Print this help
 -V, --version                  Show version information

Most popular questions within the last 20 days

...