40. Data Analysis with pandas

Pandas is a powerful and widely-used open-source data analysis and manipulation library for Python. It provides data structures and functions needed to efficiently manipulate structured data, especially tabular data. Pandas is built on top of NumPy and integrates well with other data science libraries such as Matplotlib and Scikit-learn.

How pandas Works

Pandas introduces two primary data structures: Series and DataFrame. These structures are designed to handle and analyze data efficiently. A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Series and DataFrames

A Series is similar to a column in a spreadsheet or a database table. It has an index and a single data column. A DataFrame is akin to a table with rows and columns, where each column can be a different data type.

Example of Series

import pandas as pd
s = pd.Series([10, 20, 30, 40])
print(s)

Example of DataFrame:

data = {‘Name’: [‘Alice’, ‘Bob’], ‘Age’: [25, 30]}
df = pd.DataFrame(data)
print(df)

Data Loading

Pandas can load data from various file formats including CSV, Excel, JSON, and SQL databases.

Example:

df = pd.read_csv(‘data.csv’)

Data Cleaning

Data cleaning involves handling missing values, correcting data types, and removing duplicates

Examples:

df.dropna()
df.fillna(0)
df.drop_duplicates()

Data Transformation

Transforming data includes operations like renaming columns, changing data types, and applying functions to columns.

Examples:

df.rename(columns={‘old_name’: ‘new_name’})
df[‘column’] = df[‘column’].astype(float)
df[‘new_col’] = df[‘col’].apply(lambda x: x * 2)

Data Aggregation

Aggregation involves summarizing data using operations like sum, mean, count, etc.

Examples:

df.groupby(‘category’).sum()
df[‘column’].mean()

Data Visualization

Pandas integrates with Matplotlib and Seaborn for data visualization.

Examples:

import matplotlib.pyplot as plt
df[‘column’].plot(kind=’bar’)
plt.show()

Best Practices

  • Always inspect your data using df.head(), df.info(), and df.describe().
  • Handle missing data appropriately before analysis.
  • Use vectorized operations for better performance.
  • Avoid loops; use pandas built-in functions.
  • Document your data transformations for reproducibility.

Common Pitfalls

  • Ignoring missing values can lead to incorrect analysis.
  • Using loops instead of vectorized operations can slow down performance.
  • Not understanding the difference between copy and view can lead to unexpected behavior.
  • Failing to reset index after filtering can cause issues in further operations.
Scroll to Top
Tutorialsjet.com