PDF stands for portable document format, one of the most widely used formats for sharing files. Its several advantages like graphical integrity, convenience, security, and compact are the significant reasons for its popularity. So, due to its wide uses, a programmer should know to handle these files while programming. Today, in this article, we will see the different tools available to handle a pdf file in the python programming language, or we can say python pdf parser tools. We will get a quick overview of different python libraries that help us handle a pdf file. So, let’s start.
Libraries for Parsing PDF Files
So, python comes with many libraries that help us handle pdf files using python API. We can read a file, extract desired content from files or make necessary changes in pdf files using them. Some of these libraries are:
- PDFMiner
- PyPDF2
- pdfrw
- slate
PDFMiner Module
PDFMiner module is a text extractor module for pdf files in python. It is a purely python based module and obtains the exact location of text and other layout information (fonts, etc.) for the pdf files. It helps to convert PDF into different formats like HTML, TXT, e.t.c. Let’s see the installation and example of it.
Installation
To install the given module, we will use the following command.
pip install pdfminer
Example 1: Extracting Text from a PDF file and Converting into Text File
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
import io
def pdf_to_text(input_file,output):
i_f = open(input_file,'rb')
resMgr = PDFResourceManager()
retData = io.StringIO()
TxtConverter = TextConverter(resMgr,retData, laparams= LAParams())
interpreter = PDFPageInterpreter(resMgr,TxtConverter)
for page in PDFPage.get_pages(i_f):
interpreter.process_page(page)
txt = retData.getvalue()
print(txt)
with open(output,'w') as of:
of.write(txt)
input_pdf = 'sample.pdf'
output_txt = 'sample.txt'
pdf_to_text(input_pdf,output_txt)
Output:
This is a simple pdf file.
Continued to next page...
Page2 started....
This is second page of the pdf.
In the above example, we created a function to read a pdf file and then convert it into a text file. In that function, we first open the file and the initialized object for the resource manager class, which manages the required resources while converting the pdf. We also initialized the object for the TextConverter class. Then, we initialized the object for PDFPageInterpreter and pass the resource manager and text converter object as the argument of that class. Once done, we read that data from the pdf file using the getvalues() function and then wrote it in the output file.
PyPDF2 Module
Although pdfminer is considered one of the best ways to handle PDF files in python, PyPDF is considered one of the easiest interfaces for doing the same. This module is also a third-party module with a lot of functionality. However, to use it, we need to install it explicitly. To do that, we will use the following command.
pip install PyPDF2
We can do several operations like extracting elements from a pdf document, splitting and merging documents, cropping pages, adding watermark and many more using this module. It can work entirely on StringIO rather than file stream allowing manipulations of documents in the memory. Let’s see an example of it.
import PyPDF2
file = open('sample.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(file)
# printing number of pages in pdf file
print("Total number of pages in sample.pdf",pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
file.close()
Output:
This is a simple pdf file.
Continued to next page...
In the above example, we first opened the file and then read the file using PdfFileReader class. Once it is done, we can easily print it on the console or write it into any other file format.
pdfrw Module
This is another module with the same functionalities mentioned above. They are like reading pdf documents, splitting and merging documents, cropping pages, adding watermarks. Along with these features, we can also use pdfrw along with ReportLab. It is also a third-party library and requires a separate installation.
pip install pdfrw
Let’s see an example.
from pdfrw import PdfReader
def get_pdf_info(path):
pdf = PdfReader(path)
print(pdf.keys())
print(pdf.Info)
print(pdf.Root.keys())
print('PDF has {} pages'.format(len(pdf.pages)))
if __name__ == '__main__':
get_pdf_info('sample.pdf')
Output:
['/Size', '/Root', '/Info']
{'/Creator': '(Rave \\(http://www.pythonpool.com/))', '/Producer': '(Python Pool)', '/CreationDate': '(D:20060301072826)'}
['/Type', '/Outlines', '/Pages']
PDF has 2 pages
Slate
Slate is the third-party python library that is used to extract texts from the pdf file. Moreover, it depends on the pdfminer library to extract these contents and read pdf files. Slate provides one class, PDF. PDF takes a file-like object and will extract all text from the document, presenting each page as a string of text. We can’t discuss this library as it is unofficially dead and is not updated for four years.
PDF to CSV Parser Python
We will use a third-party module to convert a pdf file into a CSV file. Let’s see an example of it.
# Import Module
import pdftables_api
# API KEY VERIFICATION
conversion = pdftables_api.Client('API KEY')
# Coverting pdf to CSV file
conversion.csv('src.pdf', 'output.csv')
To convert a file from pdf to CSV, we first need to import pdftables_api. Then, we need to verify API Key using the Client() class. After that, we use CSV() method to convert the file into a CSV file.
PDF to XML / HTML / XLSX Parser Python
As described above, we can also convert a pdf file into an XML, HTML, or Excel file using the pdftables_api module. We just need to replace the CSV() method to xlsx(), xml() or HTML() method according to our preference. Let’s see an example.
# Import Module
import pdftables_api
# API KEY VERIFICATION
c = pdftables_api.Client('API KEY')
# Coverting pdf to xml file
c.xml('src.pdf', 'output.xml')
# Coverting pdf to html file
c.html('src.pdf', 'output.html')
# Coverting pdf to xlsx file
c.xlsx('src.pdf', 'output.xlsx')
Parse PDF to JSON using Python
In the above section, you have seen how we can convert a pdf file to xml, HTML files. But, when it comes to converting a pdf file into a JSON file, you can’t simply do that as above. It can be a two-step process but not a difficult task if one has some developer’s experience. So, in this process, we will first convert a pdf file into a text file and then convert that text file into a JSON file. So, in the above sections, we have seen how we can convert a pdf file to a text file. This section will see how we can convert a text file into a JSON file.
import json
filename = 'output.txt'
dict1 = {}
with open(filename) as fh:
for line in fh:
command, description = line.strip().split(None, 1)
dict1[command] = description.strip()
# creating json file
# the JSON file is named as test1
out_file = open("output.json", "w")
json.dump(dict1, out_file, indent = 4, sort_keys = False)
out_file.close()
So, we first read the file and converted the text of the text file into a dictionary object. Once we are done with it, we can write the data into the JSON file. In the end, we will use the dump() method to convert the python dictionary object to a JSON object.
FAQs on Python PDF Parser
We can’t read a pdf file line to line. These modules read the pages at once. However, one can split it using the split method. One needs to use the following line of code after reading the page of the pdf file.
text = pageObj.extractText().split(” “)
# Finally the lines are stored into list
# For iterating over list a loop is used
for i in range(len(text)):
print(text[i],end=”\n\n”)
To parse images present in the pdf file, we can use the PyMuPDF Pillow library.
We use the following line of code to read images from the pdf file.
import fitz
import io
from PIL import Image
file = "sample.pdf"
# open the file
pdf_file = fitz.open(file)
for page_index in range(len(pdf_file)):
page = pdf_file[page_index]
image_list = page.getImageList()
Conclusion
So, today in this article, we have a quick introduction to different libraries that help us read and manipulate pdf files. We have seen demonstrations over how we can read files and change formats of files or extract contents from a pdf file using these libraries. I hope this article has helped you. Thank You.