How to Extract Text From PDF using Python

SagarDhandare
2 min readJun 18, 2021

--

PDF file

In this article, we will learn how to extract text from a given PDF using Python. We will be using the PyPDF2 module for extracting text from PDF files.

Note: The PDF can be a multipage PDF too…

Installing the Module :

To install the PyPDF2 module and some other related dependencies, we can use the pip command

pip install PyPDF2

For extracting text from a PDF we will be using the PdfFileReader class which is used to initialize the PdfFileReader object.

# Importing Library:

from PyPDF2 import PdfFileReader

# Opening the PDF file in Read Binary Mode:

file = open(r"Narendra Modi Speech.pdf" , "rb")

# Initilizing Object:

reader = PdfFileReader(file)

# For Printing the Number of Pages present in our PDF File:

print("Number of Pages : ", reader.getNumPages())

# Printing First Page of the File:

pageObj = reader.getPage(1)

print(pageObj.extractText())

Prime Minister Narendra Modi Speech

# For Printing All the pages line by line, We have to Write the Code in For Loop:

for i in range(0, pages):

print("Page Number : ", i+1) # Make sure to follow indentation

pageObj = reader.getPage(i)

print(pageObj.extractText())

After reading the file, Make sure to Close the file.

# Closing the PDF file:

file.close()

— — — — — — — — — — — — — — — — — — — — — — — — — —

Reference:

PDF File Used: Prime Minister Narendra Modi Speech

Official Documentation: PyPDF2

Originally published at https://inblog.in.

--

--

SagarDhandare
SagarDhandare

Written by SagarDhandare

Learner | Data Scientist | Mathematics

Responses (1)