42.Working with Excel & PDFs (openpyxl, PyMuPDF)

Working with Excel and PDF files is a common requirement in data analysis, reporting, and automation tasks. Python provides powerful libraries such as `openpyxl` for Excel and `PyMuPDF` (also known as `fitz`) for PDF manipulation. This lesson explores how to read, write, and extract data from these file formats using Python.

Working with Excel using openpyxl

`openpyxl` is a Python library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files. It allows you to create new Excel files, read existing ones, and modify their contents.

Installing openpyxl:

To install openpyxl, use the following command:

  • pip install openpyxl

Reading an Excel file:

Here’s how to read data from an Excel file using openpyxl:

from openpyxl import load_workbook
wb = load_workbook(‘example.xlsx’)
sheet = wb.active
for row in sheet.iter_rows(values_only=True):
    print(row)

Writing to an Excel file:

You can create and write to an Excel file as follows:

from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws[‘A1’] = ‘Hello’
ws[‘B1’] = ‘World’
wb.save(‘new_file.xlsx’)

Use Cases:

– Automating report generation
– Reading data from spreadsheets for analysis
– Updating Excel files with new data

Best Practices:

– Always close workbooks after use
– Validate data before writing to Excel
– Use styles and formatting for better readability

Common Pitfalls:

– Trying to open unsupported Excel formats (e.g., .xls)
– Not saving the workbook after making changes
– Accessing cells incorrectly (e.g., using wrong indices)

Working with PDFs using PyMuPDF

– Trying to open unsupported Excel formats (e.g., .xls)
– Not saving the workbook after making changes
– Accessing cells incorrectly (e.g., using wrong indices)

Installing PyMuPDF:

To install PyMuPDF, use the following command:

  • pip install pymupdf

Reading and extracting text from a PDF:

Here’s how to extract text from a PDF file:

import fitz  # PyMuPDF
doc = fitz.open(‘sample.pdf’)
for page in doc:
    text = page.get_text()
    print(text)

Extracting images from a PDF:

You can extract images from PDF pages as follows:

for page_index in range(len(doc)):
    page = doc[page_index]
    images = page.get_images(full=True)
    for img_index, img in enumerate(images):
        xref = img[0]
        base_image = doc.extract_image(xref)
        image_bytes = base_image[‘image’]
        with open(f’image_{page_index}_{img_index}.png’, ‘wb’) as f:
            f.write(image_bytes)

Use Cases:

– Extracting data from invoices and reports
– Automating PDF content analysis
– Building document processing pipelines

Best Practices:

– Handle exceptions for corrupted or encrypted PDFs
– Use page indexing carefully (starts from 0)
– Optimize image extraction for large files

Common Pitfalls:

– Not closing the PDF document after processing
– Assuming all PDFs have extractable text
– Ignoring image formats during extraction

Scroll to Top
Tutorialsjet.com