Use Python to Fill PDF Files!

PDFs are hard to work with. Over the years I've tried several approaches to filling them out in an automated way. It's amazing my job has so many manual tasks that require filling out PDFs. It's fairly routine for me to be manually filling out PDF files to process transactions. Needless to say I've either created or borrowed several solutions. First let me say I'm no VBA expert but I have experimented with solutions here as well.

I've wrote a VBA script to fill out a PDF using "send keys". Think "how could I do everything by use of just the keyboard shortcuts?" Once you know how to open a PDF with shortcuts, tab through the form fields and use shortcuts to save, you can automate this in VBA. The problem here is that it's painstaking to set up. Plus if the form changes or you want to add a new form it's basically like starting from scratch.

Next I used VBA and the Acrobat reference to access and manipulate PDFs. This works much better because you can access the PDF form fields using VBA and Java Script. I would highly recommend this route if you're going to use VBA. I still felt as if every PDF template had to be setup completely separate. Some of this is likely due to my experience level with VBA. Either way there was a lot of copying a pasting code.

Then came my experience with Python, PyPDF2 and reportlab. I won't go into too much detail about exactly how I did this. In short you create your PDF template, create blank PDF with just your data fields, and paste the new PDF as a watermark on top of your PDF template. Again, this is painstaking because you're using grid coordinates to position where text should be placed on the page. This worked, it was fast, but it wasn't great if the PDF template changed or if you wanted to manipulate the PDF file afterward.

It was great when I found you could fill PDF form fields with python using PyPDF2 and pdfrw. Both of these libraries look to be able to do similar tasks but I chose pdfrw because it appears to be maintained better. PyPDF2 actually is no longer maintained. There is a PyPDF3 and PyPDF4; however, I already settled on pdfrw. The only issue I ran into is that you could fill in the fields but those values wouldn't show until you refreshed the field in Acrobat. I found two ways around this; one was to click into every field and hit Enter. This option isn't doable if you have several PDFs. The next was to open the PDFs in a web browser which causes a refresh of the fields.

Because of these challenges I gave up for a while... However, while digging into Python and PDFs again I found the solution that refreshes the fields!

So now I have a working solution I can pass around the office easily. A basic macro reference a Python exe file located on a shared network drive. Meaning there is no python install! And we can populate PDF forms with a simple excel macro while still getting all the flexibility and functionality of Python. The rest of this post will be going through an example of how to fill out a PDF using python.

PDF Setup

I'm using Adobe Acrobat DC. I'm going to create a sample PDF file for this example. If you have an existing PDF you want to use just open, click on Tools > Prepare Form. This action will create a fillable PDF form.

Now let's create a simple PDF for this example. We have the following fields.

  • name
  • phone
  • date
  • account_number
  • cb_1 (check box "Yes")
  • cb_2 (check box "No")

Now that we have a sample PDF we will get started with a little Python.

image003.png

pdfrw Setup

First thing to do is install pdfrw using !pip install pdfrw

!pip3 install pdfrw
import pdfrw
pdfrw.__version__
'0.4'

Accessing our PDF

pdf_template = "template.pdf"
pdf_output = "output.pdf"
template_pdf = pdfrw.PdfReader(pdf_template)  # create a pdfrw object from our template.pdf
# template_pdf  # uncomment to see all the data captured from this PDF.

You should print out template_pdf to see everything availabe in the PDF. There is a lot so for ease of reading I'll comment out.

For now let's just try to get the form fields of the PDF we created. To do this we will set some of the variable we find important. I grabbed this code from a random snippet online but you can find several similar setups on stack overflow.

ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'

Next, we can loop through the page(s). Here we only have one but you it's a good idea to prepare for future functionality. We grab all the annotations to grab just the form field keys.

for page in template_pdf.pages:
    annotations = page[ANNOT_KEY]
    for annotation in annotations:
        if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
            if annotation[ANNOT_FIELD_KEY]:
                key = annotation[ANNOT_FIELD_KEY][1:-1]
                print(key)
name
phone
date
account_number
cb_1
cb_2

There you can see we were able to grab our form field names!

Filling a PDF

To fill a PDF we can create a dictionary of what we want to populate the PDF. The dictionary keys will be the form field names and the values will be what we want to fill into the PDF.

from datetime import date

data_dict = {
    'name': 'Andrew Krcatovich',
    'phone': '(123) 123-1234',
    'date': date.today(),
    'account_number': '123123123',
    'cb_1': True,
    'cb_2': False,
}

Let's setup a function to handle grabbing the keys, populating the values, and saving out the output.pdf file

def fill_pdf(input_pdf_path, output_pdf_path, data_dict):
    template_pdf = pdfrw.PdfReader(input_pdf_path)
    for page in template_pdf.pages:
        annotations = page[ANNOT_KEY]
        for annotation in annotations:
            if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
                if annotation[ANNOT_FIELD_KEY]:
                    key = annotation[ANNOT_FIELD_KEY][1:-1]
                    if key in data_dict.keys():
                        if type(data_dict[key]) == bool:
                            if data_dict[key] == True:
                                annotation.update(pdfrw.PdfDict(
                                    AS=pdfrw.PdfName('Yes')))
                        else:
                            annotation.update(
                                pdfrw.PdfDict(V='{}'.format(data_dict[key]))
                            )
                            annotation.update(pdfrw.PdfDict(AP=''))
    pdfrw.PdfWriter().write(output_pdf_path, template_pdf)
fill_pdf(pdf_template, pdf_output, data_dict)

Okay! That just filled out a PDF. Opening in preview on my Mac shows.

image004.png

However, opening the very same PDF in Acrobat doesn't show the values of the form fields. If you click into the field you can see it did fill but for some reason the field isn't refreshed to show the value. Printing the PDF here won't help either as it will print blank. After a long while searching for an answer I found the following solution. Worked like a charm and the form fields are now showing in Acrobat as well.

Tip: add Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject("true")))

Honestly, I don't know why this isn't the default setting. It seems like everyone online runs into the same issue and this solution seems hidden away to where there are several hard work-arounds that are being used. Either way just add the above reference line to the fill_pdf function like so.

def fill_pdf(input_pdf_path, output_pdf_path, data_dict):
    template_pdf = pdfrw.PdfReader(input_pdf_path)
    for page in template_pdf.pages:
        annotations = page[ANNOT_KEY]
        for annotation in annotations:
            if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
                if annotation[ANNOT_FIELD_KEY]:
                    key = annotation[ANNOT_FIELD_KEY][1:-1]
                    if key in data_dict.keys():
                        if type(data_dict[key]) == bool:
                            if data_dict[key] == True:
                                annotation.update(pdfrw.PdfDict(
                                    AS=pdfrw.PdfName('Yes')))
                        else:
                            annotation.update(
                                pdfrw.PdfDict(V='{}'.format(data_dict[key]))
                            )
                            annotation.update(pdfrw.PdfDict(AP=''))
    template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))  # NEW
    pdfrw.PdfWriter().write(output_pdf_path, template_pdf)

I added one additional function fill_simple_pdf_file as I found it very useful to manipulate a data dictionary, especially if it came from an excel file, first before populating the data. This way you can create many fillable forms from the same data source, do formating on the fields and set default values if nothing was supplied.

Bringing it all together

import pdfrw
from datetime import date


ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'


def fill_pdf(input_pdf_path, output_pdf_path, data_dict):
    template_pdf = pdfrw.PdfReader(input_pdf_path)
    for page in template_pdf.pages:
        annotations = page[ANNOT_KEY]
        for annotation in annotations:
            if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
                if annotation[ANNOT_FIELD_KEY]:
                    key = annotation[ANNOT_FIELD_KEY][1:-1]
                    if key in data_dict.keys():
                        if type(data_dict[key]) == bool:
                            if data_dict[key] == True:
                                annotation.update(pdfrw.PdfDict(
                                    AS=pdfrw.PdfName('Yes')))
                        else:
                            annotation.update(
                                pdfrw.PdfDict(V='{}'.format(data_dict[key]))
                            )
                            annotation.update(pdfrw.PdfDict(AP=''))
    template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))
    pdfrw.PdfWriter().write(output_pdf_path, template_pdf)
    

# NEW
def fill_simple_pdf_file(data, template_input, template_output):
    some_date = date.today()
    data_dict = {
        'name': data.get('name', ''),
        'phone': data.get('phone', ''),
        'date': some_date,
        'account_number': data.get('account_number', ''),
        'cb_1': data.get('cb_1', False),
        'cb_2': data.get('cb_2', False),
    }
    return fill_pdf(template_input, template_output, data_dict)


if __name__ == '__main__':
    pdf_template = "template.pdf"
    pdf_output = "output.pdf"
    
    sample_data_dict = {
        'name': 'Andrew Krcatovich',
        'phone': '(123) 123-1234',
#         'date': date.today(),  # Removed date so we can dynamically set in python.
        'account_number': '123123123',
        'cb_1': True,
        'cb_2': False,
    }
    fill_simple_pdf_file(sample_data_dict, pdf_template, pdf_output)

Thanks for reading! Hope this can help someone else!