Working with PDF in Python

The PDF or Portable Document Format is a popular and common file format used to present and exchange documents. PDFs are used for many things like creating EBooks, digitally signed agreements, password-protected files, etc. Adobe originally invents PDF, but it is now open standard and maintained by the International Organization for Standardization (ISO).

In this tutorial, we will learn to perform the following operations on PDF using python.

  • Extracting Document information(Metadata) From PDF
  • Rotating pages
  • Merging PDFs
  • Split PDFs
  • Adding Watermarks
  • Encrypting and Decrypting PDFs
  • Extracting all images from a PDF
  • Extracting text from PDF

Now, let us start discussing each of the topics but before that we need to install the requirements.

PDF tools Installation

In this tutorial, we will use the PyPDF2 and PyMuPDF libraries, which are the best libraries for working with PDF files in python. PyPDF2 and PyMuPDF do not come pre-installed with Python, so we need to install it manually. We can easily install them using the pip package manager of python.

To begin installing PyPDF2, run the following command in the terminal.

pip install -U pypdf2

Similarly, to install PyMuPDF using the pip tool, use the the following command.

pip install -U pypdf2

After the installation is finished, you can work with PDF by importing this libraries in the code.

Extracting MetaData From PDF File

Many types of documents, including PDF, contains useful information in their metadata. Information like author name, creator, producer, subject, number of pages, etc is very useful for forensics tasks and for collecting digital artifacts. We can use the PyPDF2 library to extract the metadata from any PDF files easily.

In this tutorial, I am using a PDF file having the name sample.pdf with some dummy content. You can use any PDF file and provide its name in the place of sample.pdf in the code. Run the code for extracting document information from the pdf sample.pdf.

# importing required modules
import PyPDF2 
# creating a pdf file object
# change the name sample.pdf to your pdf file 
pdfFileObj = open('sample.pdf', 'rb')
# creating a pdf reader object 
pdf = PyPDF2.PdfFileReader(pdfFileObj)
# extracting document information
doc_info = pdf.getDocumentInfo()
# displaying the information
print("author :", doc_info.author)
print("creator :", doc_info.creator)
print("producer :", doc_info.producer)
print("subject :", doc_info.subject)
print("title :", doc_info.title)
print("Total number of pages :", pdf.getNumPages())
# closing the pdf file object 
pdfFileObj.close()

In the above code, we first imported the PyPDF2 library in our code using the import statement. Then we open the sample.pdf file using the open() function of python. While using the open() function of python, we need to specify the file mode to open the file. Here I specify the mode to be rb, which means that we will open the reading file in binary mode. The open() function will return a file object, which we stored in the pdfFileObj variable. Then we used the PdfFileReader function of the PyPDF2 library, which accepts the pdf file object as its argument and will read the pdf file.

It also returns an object with a method named getDocumentInfo() that can be used to extract the metadata from the PDF file. Then we use the print() method to display the data stored in the python dictionary returned by the getDocumentInfo() method. In the last line of the code block, we used python’s close() function to close the PDF file.

On running the above code in python we will get the following output.

extracting pdf metadata using python
extracting pdf metadata using python

Working with PDF in Python

Rotate Pages of PDF using Python

In many scenarios, we need to rotate a particular page in a PDF to an angle (mostly 90 degrees). We can perform this operation easily by using the PyPDF2 library of python. We need to use the rotateClockwise() or rotateCounterClockwise() functions of the PyPDF2 module.

See the below code for illustration.

# importing required modules 
import PyPDF2 
# creating a pdf file object
# change the name sample.pdf to your pdf file 
pdfFileObj = open('sample.pdf', 'rb') 
# creating a pdf reader object 
pdf_reader = PyPDF2.PdfFileReader(pdfFileObj)
# creating a pdf file writer object
pdf_writer = PyPDF2.PdfFileWriter()
# accessing the first page of the PDF
first_page = pdf_reader.getPage(0)
# rotating the first page to 90 degree anti-clockwise
first_page_rotate = first_page.rotateCounterClockwise(90)
# adding the rotated page to the pdf_writer 
pdf_writer.addPage(first_page_rotate)
# opening a new file for writing in binary mode
newfile = open('new.pdf', 'wb')
# writing pdf data of the rotated page to the new file
pdf_writer.write(newfile)
# closing the new file
newfile.close()
# closing the previous pdf file
pdfFileObj.close()

In the above code, we first imported the PyPDF2 library in our code using the import statement of python. Then we open the sample.pdf file for reading in binary mode by specifying the string “rb” in the argument. We created a pdf reader and writer objects by using the PdfFileReader and PdfFileWriter methods of the PyPDF2 module. After that, we use the getPage() pdf reader object’s method with an argument 0 to get the first page of the input PDF file.

Next, we use the method rotateCounterClockwise() with an argument 90 to rotate the pdf’s first page at 90 degrees in the anti-clockwise direction. We then used the pdf writer object to add the rotated page to it and write that page to a new pdf file with the name new.pdf. Finally, we close each of the open files by using the close() function of python.

On running the above code, we will get a new file generated with the name new.pdf containing the first page of the sample pdf rotated in 90 degrees counter-clockwise direction.

Merging PDF

There are situations to merge two or more PDFs into a single file. We can use python to merge two or more PDFs into a single file easily. The below code shows a practical demonstration of merging two PDFs into a single file.

# importing required modules 
import PyPDF2 
# creating a list with pdf names
pdfs = ['sample1.pdf', 'sample2.pdf']
# creating a pdf merger object
pdf_merger = PyPDF2.PdfFileMerger()
# merging the files
for pdf in pdfs:
    f = open(pdf, 'rb')
    pdf_merger.append(f)
# creating a output file
outputfile = open('merged.pdf', 'wb')
# writing the merged pdf in the output file
pdf_merger.write(outputfile)
# closing all the opened files
f.close()
outputfile.close()

We first imported the PyPDF2 module in our code by using the import statement of python. Then we created a Python List that contains the name of two PDFs that we want to merge. You can give any number of PDFs you want to merge on the list. Next, we created a PDF merger object using the PdfFieMerger() method of the PyPDF2 library. We used the pdf merger’s append() method to append the two PDFs to the pdf merger object. We then used the write() method of pdf merger object to write the merged file to a PDF file with the name merged.pdf.

We will get a new PDF file created in the current working directory with the name merged.pdf on running the above code.

Splitting PDF

In the previous topic, we have seen how to merge multiple PDFs into a single PDF. We can also split a single PDF into multiple PDFs. The following example shows how we can split a single pdf into multiple PDFs.

# importing the required modules
import PyPDF2
import os
# name of the pdf file we want to split
path = "sample.pdf"
# prefix name of the splited file
split_name = 'sample_page'
# directory where we want to split
split_dir = "split"
# creating the split directory
pdf = PyPDF2.PdfFileReader(path)
if not os.path.exists(split_dir):
    os.mkdir(split_dir)
# changing path to the split directory
os.chdir(split_dir)
# splitting the pdf into individual pages
for page in range(pdf.getNumPages()):
    pdf_writer = PyPDF2.PdfFileWriter()
    pdf_writer.addPage(pdf.getPage(page))
    output = f'{split_name}_{page+1}.pdf'
    with open(output, 'wb') as output_pdf:
        pdf_writer.write(output_pdf)
print(f'[+] The PDF file has been splitted in the {split_dir} directory ')

In the above code, we first imported the PyPDF2 and OS modules in our code using the python import statement. Next, we declare some variable strings, namely path, split_name, split_dir, that contains the path to the pdf we want to split, the prefix for the name of the split files, and the directory we want to keep the split files, respectively. Then we created a pdf reader object by using the PdfFileReader() method.

We also use the os module to create the directory where we want to split the PDF and then changed the working directory of our python code to that directory using the OS module’s chdir() function. Next, we split the single PDF into multiple PDFs by accessing each of the single pages by using the python for loop and then writing every page as a new pdf using the write() method of the PdfFileWriter() object.

On running the above code, we will get a new directory created in the current working directory with the name split, which will contain all the split pdf files.

Adding Watermarks to PDF

Adding a watermark to a PDF is typically done for copyright purposes. We can easily add a watermark to a PDF by using the PyPDF2 module of Python. We need two PDFs; one is the pdf in which we want to add the watermark, and the second one is that in which the watermark is present, we want to add to the input pdf file. We need to use the mergePage() method of PyPDF2 to merge the watermark to the input file. The following code shows a practical illustration of how to add a watermark to a PDF.

# importing required modules 
import PyPDF2 
# The input PDF file 
input_pdf="sample.pdf"
# The PDF file name which will be output
output_pdf="output.pdf"
# the pdf file containing watermark
watermark_pdf="watermark.pdf"
# creating a reader object for the input pdf
pdf_reader = PyPDF2.PdfFileReader(input_pdf)
# creating a pdf writer object
pdf_writer = PyPDF2.PdfFileWriter()
# reading the watermark
watermark_reader = PyPDF2.PdfFileReader(watermark_pdf)
watermark_page = watermark_reader.getPage(0)
# reading the first page of te input pdf
first_page = pdf_reader.getPage(0)
first_page.mergePage(watermark_page)
# writing the watermarked output pdf file
pdf_writer.addPage(first_page)
output_file = open(output_pdf,'wb')
pdf_writer.write(output_file)
# closing all the open files
output_file.close()

In the above code, we first imported the PyPDF2 module in our program using the import statement of python. PyPDF2 is a great library, and it will ease our task of watermarking a PDF in python. Next, we declare three string variables with the name input_pdf, output_pdf, watermark_pdf that will contain the name of the input file, output file, and the file in which the watermark is present, respectively. Then we created the pdf reader object for the input pdf and the watermark pdf by using the PdfFileReader() method.

We use the getPage() method with argument 0 to get the first page of the PDF. Then we add the watermark to the first page of the input pdf file by using the mergePage() method. Finally, we created a pdf writer object by using the PdfFileWriter() method and use the write() method of the pdf writer object to write the merged file to an output pdf file having the name output.pdf.

On running the above code by providing the sample pdf file and watermark file in the code, we will get a new pdf file generated having the name output.pdf that will contain the watermarked PDF.

Encrypting and Decrypting PDFs

We use pdf documents to store many personal identity documents, company documents, etc. To keep the information private, we have the option to secure the PDF by encrypting it. The PDF files provide one of the toughest encryption for documents after encrypting it by providing a strong password. We can use the python’s PyPDF2 library to encrypt or decrypt a PDF file easily. The following code block shows a practical demonstration of encrypting a PDF file using the PyPDF2 library.

# importing the required modules
import PyPDF2
# creating a PDF writer object
pdf_writer = PyPDF2.PdfFileWriter()
# creating a pdf reader object for the input pdf
pdf_reader = PyPDF2.PdfFileReader('sample2.pdf')
# creating a new pdf with the content of the input pdf and encrypt it
for page in range(pdf_reader.getNumPages()):
    pdf_writer.addPage(pdf_reader.getPage(page))
    pdf_writer.encrypt(user_pwd='codeunderscored', owner_pwd=None, use_128bit=True)
# writing the output file
output_file = open('output.pdf', 'wb')
pdf_writer.write(output_file)
# closing the opened files
output_file.close()

In the above code, we first imported the PyPDF2 module in our code using the import statement of python. We then created a pdf writer object and a pdf reader object for the output and input PDF file. Then we used the python for loop with the encrypt() method of pdf writer object to encrypt each page of the pdf file. The encrypt method accepts some arguments; the first one is the user password, the second one is the owner password, and the third argument is whether to use 128 bit or not. We can provide the details to it as our requirements. Finally, we used the write() method of the pdf writer object to write the encrypted output file to a PDF file having the name output.pdf.

We will get a file with the name output.pdf, which has been encrypted with a user password on running the above code. On opening the output file in a PDF reader, you may notice that the file has been encrypted, and the reader is prompting for a password.

We have encrypted a PDF file, but we can also decrypt an encrypted pdf file using python by providing the password. To decrypt a PDF file with the help of python’s PyPDF2 library, see the below code.

# importing the required modules
import PyPDF2
# reading the encrypted pdf file
pdf_reader = PyPDF2.PdfFileReader('output.pdf')
# checking if the file is encrypted
is_encrypt = pdf_reader.getIsEncrypted()
if is_encrypt==True:
    print("The file is Encrypted")
    print("decrypting...")
    # decrypting the encrypted pdf file
    pdf_reader.decrypt('codeunderscored')
    pdf_writer = PyPDF2.PdfFileWriter()
    for page in range(pdf_reader.getNumPages()):
        pdf_writer.addPage(pdf_reader.getPage(page))
    output_file = open("output.pdf", "wb")
    pdf_writer.write(output_file)
    output_file.close()
    print("The file has been decrypted")
else:
    print("The file is not encrypted")

In the above code, We first imported the PyPDF2 library in our code using the import statement of python. Then we use the PdfFileReader() method of the PyPDF2 module to read the encrypted output.pdf file. Next, we used the getIsEncrypted() method of the pdf reader to check if the file is encrypted or not. We decrypted the pdf, which will accept the password of the PDF as an argument. Next, we created a pdf writer object by using the PdfFileWriter() method and add all the pages of the decrypted file to it. We opened the output.pdf file in “wb” mode and wrote all the pages from the pdf writer object to it. If you open the output.pdf file using a pdf reader, you will find it decrypted.

Extracting all images from a PDF

In all of the above topics, we used the PyPDF2 library, but we will use the PyMuPDF library for the tutorials below. The PyMuPDF can be used to extract images or text easily from a PDF. The below code shows a practical example of extracting images from a PDF file. It creates a new directory with the name images and extracts every image from the PDF file, and stores them in the directory.

# importing the required modules
import fitz
import os
# the PDF file from which images will be extract
input_pdf = "sample1.pdf"
# The directory where we want to output the images
output_dir = "images"
# The prefix of the images name
image_prefix = "codeunderscored"
# openig the pdf file
pdf = fitz.open(input_pdf)
# creating the output directory if not present already
if not os.path.exists(output_dir):
    os.mkdir(output_dir)
# changing the path to the output directory
os.chdir(output_dir)
# extracting all the images from the pdf
for page in range(len(pdf)):
    for image in pdf.getPageImageList(page):
        xref = image[0]
        pix = fitz.Pixmap(pdf, xref)
        if pix.n < 5:
            pix.writePNG(f"{image_prefix}{page}-{xref}.png")
        else:
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG(f"{image_prefix}{page}-{xref}.png")
            pix1 = None
        pix = None
# end of the program

In the above code, we import the fitz(PyMuPDF) module in our code required to extract the images. Then we use the for loop in a nested manner to iterate over the file and used the getPageImageList() method to find all the images present in the pdf. Then we write image data of each image in the PDF to a new image file and store them one by one in the images directory.

Extracting Text From PDF

We can extract text from PDF easily by using the PyMuPDF library. Although PyPDF2 can also be used to extract text from a PDF but in many cases, PyMuPDF gives a better result. The below code shows a practical demo of extracting text from the first page of a PDF.

# importing the required modules
import fitz
# the pdf file we want to extract text from
input_pdf = "sample2.pdf"
# opening the input pdf file
pdf = fitz.open(input_pdf)
# loading the first page
firstpage = pdf.loadPage(0)
# extracting the text of the first page
firstpagetext = firstpage.getText("text")
# displaying the text of the first page
print(firstpagetext)

In the above code, we first used the open() function of the PyMuPDF library to open the input PDF file. Then we used the loadPage() method with argument 0 to load the PDF’s first page. After that, we use the getText() method on the first page to extract the text from the pdf’s first page.

On running the above code in a sample PDF file, I have the following output.

extracting text from pdf using python
extracting text from pdf using python

Conclusion

In this tutorial, we have learned to work with PDf files in python. We have seen many practical examples of codes to perform some important operations with PDF files like extracting images from PDfs, encrypting PDFs, rotating PDFs pages, etc. You may also like to see our step by step guide on working with JSON files in python.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *