How to Extract Text From PDF using Python
In this article, we will learn how to extract text from a given PDF using Python. We will be using the PyPDF2 module for extracting text from PDF files.
Note: The PDF can be a multipage PDF too…
Installing the Module :
To install the PyPDF2 module and some other related dependencies, we can use the pip command
pip install PyPDF2
For extracting text from a PDF we will be using the PdfFileReader class which is used to initialize the PdfFileReader object.
# Importing Library:
from PyPDF2 import PdfFileReader
# Opening the PDF file in Read Binary Mode:
file = open(r"Narendra Modi Speech.pdf" , "rb")
# Initilizing Object:
reader = PdfFileReader(file)
# For Printing the Number of Pages present in our PDF File:
print("Number of Pages : ", reader.getNumPages())
# Printing First Page of the File:
pageObj = reader.getPage(1)
print(pageObj.extractText())
# For Printing All the pages line by line, We have to Write the Code in For Loop:
for i in range(0, pages):
print("Page Number : ", i+1)
# Make sure to follow indentation
pageObj = reader.getPage(i)
print(pageObj.extractText())
After reading the file, Make sure to Close the file.
# Closing the PDF file:
file.close()
— — — — — — — — — — — — — — — — — — — — — — — — — —
Reference:
PDF File Used: Prime Minister Narendra Modi Speech
Official Documentation: PyPDF2
Originally published at https://inblog.in.