42.Working with Excel & PDFs (openpyxl, PyMuPDF)
Working with Excel and PDF files is a common requirement in data analysis, reporting, and automation tasks. Python provides powerful libraries such as `openpyxl` for Excel and `PyMuPDF` (also known as `fitz`) for PDF manipulation. This lesson explores how to read, write, and extract data from these file formats using Python.
Working with Excel using openpyxl
`openpyxl` is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files. It allows you to create new Excel files, read existing ones, and modify their contents.
Installing openpyxl:
To install openpyxl, use the following command:
- pip install openpyxl
Reading an Excel file:
Here’s how to read data from an Excel file using openpyxl:
from openpyxl import load_workbook
wb = load_workbook(‘example.xlsx’)
sheet = wb.active
for row in sheet.iter_rows(values_only=True):
print(row)
Writing to an Excel file:
You can create and write to an Excel file as follows:
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws[‘A1’] = ‘Hello’
ws[‘B1’] = ‘World’
wb.save(‘new_file.xlsx’)
Use Cases:
– Automating report generation
– Reading data from spreadsheets for analysis
– Updating Excel files with new data
Best Practices:
– Always close workbooks after use
– Validate data before writing to Excel
– Use styles and formatting for better readability
Common Pitfalls:
– Trying to open unsupported Excel formats (e.g., .xls)
– Not saving the workbook after making changes
– Accessing cells incorrectly (e.g., using wrong indices)
Working with PDFs using PyMuPDF
– Trying to open unsupported Excel formats (e.g., .xls)
– Not saving the workbook after making changes
– Accessing cells incorrectly (e.g., using wrong indices)
Installing PyMuPDF:
To install PyMuPDF, use the following command:
- pip install pymupdf
Reading and extracting text from a PDF:
Here’s how to extract text from a PDF file:
import fitz # PyMuPDF
doc = fitz.open(‘sample.pdf’)
for page in doc:
text = page.get_text()
print(text)
Extracting images from a PDF:
You can extract images from PDF pages as follows:
for page_index in range(len(doc)):
page = doc[page_index]
images = page.get_images(full=True)
for img_index, img in enumerate(images):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image[‘image’]
with open(f’image_{page_index}_{img_index}.png’, ‘wb’) as f:
f.write(image_bytes)
Use Cases:
– Extracting data from invoices and reports
– Automating PDF content analysis
– Building document processing pipelines
Best Practices:
– Handle exceptions for corrupted or encrypted PDFs
– Use page indexing carefully (starts from 0)
– Optimize image extraction for large files
Common Pitfalls:
– Not closing the PDF document after processing
– Assuming all PDFs have extractable text
– Ignoring image formats during extraction