Use PyPDF2 - open PDF file or encrypted PDF file

Use PyPDF2 - open PDF file or encrypted PDF file
Page content

Motivation

Since I want to work PDF file with Python on my work, I investigate what library can do that and how to use it.

Preparation

The runtime and module version are as below.

  • python 3.6
  • PyPDF2 1.26.0

Install PyPDF2

To work PDF file with Python, PyPDF2 is often used.

PyPDF2 can

  • Extract text from PDF file
  • Work existing PDF file and create new one

Let’s install with pip command.

1pip install PyPDF2

Prepare PDF file

Prepare a new PDF file for working. Download Executive Order in this time. It looks like below. There are three pages in all.

executive_order

Read PDF file

In this section, Open and read a normal PDF file. Print number of pages in the PDF file in the following sample code.

1import PyPDF2
2
3FILE_PATH = './files/executive_order.pdf'
4
5with open(FILE_PATH, mode='rb') as f:
6    reader = PyPDF2.PdfFileReader(f)
7    print(f"Number of pages: {reader.getNumPages()}")

Open the PDF file as binary read mode after importing PyPDF2. And then, create a PdfFileReader object to work PDF.

Check the result.

Number of pages: 3

Read a PDF file with password(Encrypted PDF)

In this section, Open and read an encrypted PDF file that has a password when opening a file. To create an encrypted PDF file, set a password with enabling encryption option when saving a PDF file.

Failed example

Save a PDF file named executive_order_encrypted.pdf with a password hoge1234. Open the PDF file and execute with the previous code that read the PDF without password.

1# Failed example
2import PyPDF2
3
4FILE_PATH = './files/executive_order_encrypted.pdf'
5
6with open(FILE_PATH, mode='rb') as f:
7    reader = PyPDF2.PdfFileReader(f)
8    print(f"Number of pages: {reader.getNumPages()}")

The following error message will be printed.

PdfReadError: File has not been decrypted

Success example

The decrypt function given a password string to an argument decrypts an encrypted PDF file. It is a better way to check if the file is encrypted with isEncrypted function before calling decrypt function.

1import PyPDF2
2
3ENCRYPTED_FILE_PATH = './files/executive_order_encrypted.pdf'
4
5with open(ENCRYPTED_FILE_PATH, mode='rb') as f:        
6    reader = PyPDF2.PdfFileReader(f)
7    if reader.isEncrypted:
8        reader.decrypt('hoge1234')
9        print(f"Number of page: {reader.getNumPages()}")
Number of pages: 3

Troubleshooting: NotImplementedError is thrown in calling decrypt function

The following error message may be thrown when working an encrypted PDF file.

NotImplementedError: only algorithm code 1 and 2 are supported

The error message means that PyPDF2 doesn’t have an implementation to decrypt an algorithm that encrypts the PDF file. If this happens, it’s difficult to open the PDF file with PyPDF2 only.

Decrypt with qpdf

Using qpdf is a quick solution. qpdf is a tool to work PDF file on command line interface. We can download its installer for Windows from SourceForge, or install it for Mac with brew install qpdf command.

Sample code that qpdf decrypts a PDF file is below.

 1import PyPDF2
 2import os
 3
 4ENCRYPTED_FILE_PATH = './files/executive_order_encrypted.pdf'
 5FILE_OUT_PATH = './files/executive_order_out.pdf'
 6
 7PASSWORD='hoge1234'
 8
 9with open(ENCRYPTED_FILE_PATH, mode='rb') as f:        
10    reader = PyPDF2.PdfFileReader(f)
11    if reader.isEncrypted:
12        try:
13            reader.decrypt(PASSWORD)
14        except NotImplementedError:
15            command=f"qpdf --password='{PASSWORD}' --decrypt {ENCRYPTED_FILE_PATH} {FILE_OUT_PATH};"
16            os.system(command)            
17            with open(FILE_OUT_PATH, mode='rb') as fp:
18                reader = PyPDF2.PdfFileReader(fp)
19                print(f"Number of page: {reader.getNumPages()}")

The point is that Python executes the qpdf command as the OS command and save decrypted PDF file as new PDF file without password. Then, create PdfFileReader instance to work the PDF file with PyPDF2.

Conclusion

It is available to

  • Open PDF file with PdfFileReader on PyPDF2
  • Decrypt an encrypted PDF file with decrypt function
  • Decrypt an encrypted PDF file with qpdf when NotImplementedError is occured