Today we are going to talk about a free and open source application for Linux systems that allows us to
search several PDF files simultaneously
. The program used is called
pdfgrep and allows us to locate words, phrases or text strings in PDF files directly from the console
.
To summarize its operation, it
is like the classic grep but designed to work with PDF files
. It is a very useful tool since PDFs are not plain text files and performing searches forces us to use tools like this.
How to install pdfgrep?
The installation process is very simple, since the application is available in most official repositories of the different Linux distributions
(Debian, Fedora, Ubuntu, openSUSE, Arch Linux, Gentoo, FreeBSD, etc)
.
It can also be compiled from the source code, but in our we will perform an installation from Ubuntu with the command:
sudo apt-get install pdfgrep
Although it is also true that in the Ubuntu repositories we find a somewhat old version, 1.4.1 and if we take a look at the
official website of
the project, we see that they are already running version 2.0.1. Go through the official website because there are also
detailed instructions to compile pdfgrep
in case you want to have the latest version.
How to search multiple PDF files simultaneously with pdfgrep.
The most basic use command of pdfgrep is:
pdfgrep <palabra> <archivo.pdf>
With the previous command we will look for the
"word"
that we define within the specified
"file.pdf"
. If there is an occurrence, it will be shown on the screen.
But the really interesting thing is that the search is carried out in several PDF documents simultaneously, for this we execute the command:
pdfgrep <palabra> *.pdf
For example, when I run the
pdfgrep computer *.pdf
command, I would search for the word
"computer"
in all the files in the current folder.
But we can still go further and perform a
recursive search in the current directory and its subdirectories
. To perform the recursive search, use the
-r
option accompanied by the
--include
options to include matches or
--exclude
to exclude matches. This you will understand better with the following examples:
-
Search recursively in all PDF files:
pdfgrep -r --include "*.pdf" <palabra>
-
Search recursively in all PDF files, but excluding those whose name begins with «invoice»:
pdfgrep -r --exclude "factura*.pdf" <palabra>
Delving a little deeper into the different options of pdfgrep.
It is now where some of the
pdfgrep
options come into play, such as the
-i
option that
forces the search not to be case-sensitive
. Another interesting option is
-n
, which
shows
us
the page number where the word or text string was found
.
For example, we can combine the above options and execute the following command:
pdfgrep -in -r --include "*.pdf" computer
To consult all available options, I recommend reading the help of the program carefully by executing the
pdfgrep --help
or
man pdfgrep
. You can also check the
official online documentation
page if you find it more convenient.
zeokat@ubuntu:~$ pdfgrep --help
Usage: pdfgrep [OPTION]... PATTERN FILE...
Search for PATTERN in each FILE.
PATTERN is, by default, an extended regular expression.
Options:
-i, --ignore-case Ignore case distinctions
-P, --pcre Use Perl compatible regular expressions (PCRE)
-H, --with-filename Print the file name for each match
-h, --no-filename Suppress the prefixing of file name on output
-n, --page-number Print page number with output lines
-c, --count Print only a count of matches per file
-C, --context NUM Print at most NUM chars of context
--color WHEN Use colors for highlighting;
WHEN can be `always', `never' or `auto'
-p, --page-count Print only a count of matches per page
-m, --max-count NUM Stop reading after NUM matching lines (per file)
-q, --quiet Suppress normal output
-r, --recursive Search directories recursively
-R, --dereference-recursive Likewise, but follow all symlinks
--help Print this help
-V, --version Show version information