40. Data Analysis with pandas
Pandas is a powerful and widely-used open-source data analysis and manipulation library for Python. It provides data structures and functions needed to efficiently manipulate structured data, especially tabular data. Pandas is built on top of NumPy and integrates well with other data science libraries such as Matplotlib and Scikit-learn.
How pandas Works
Pandas introduces two primary data structures: Series and DataFrame. These structures are designed to handle and analyze data efficiently. A Series is a one-dimensional labeled array, while a DataFrame is a two-dimensional labeled data structure with columns of potentially different types.
Series and DataFrames
A Series is similar to a column in a spreadsheet or a database table. It has an index and a single data column. A DataFrame is akin to a table with rows and columns, where each column can be a different data type.
Example of Series
import pandas as pd
s = pd.Series([10, 20, 30, 40])
print(s)
Example of DataFrame:
data = {‘Name’: [‘Alice’, ‘Bob’], ‘Age’: [25, 30]}
df = pd.DataFrame(data)
print(df)
Data Loading
Pandas can load data from various file formats including CSV, Excel, JSON, and SQL databases.
Example:
df = pd.read_csv(‘data.csv’)
Data Cleaning
Data cleaning involves handling missing values, correcting data types, and removing duplicates
Examples:
df.dropna()
df.fillna(0)
df.drop_duplicates()
Data Transformation
Transforming data includes operations like renaming columns, changing data types, and applying functions to columns.
Examples:
df.rename(columns={‘old_name’: ‘new_name’})
df[‘column’] = df[‘column’].astype(float)
df[‘new_col’] = df[‘col’].apply(lambda x: x * 2)
Data Aggregation
Aggregation involves summarizing data using operations like sum, mean, count, etc.
Examples:
df.groupby(‘category’).sum()
df[‘column’].mean()
Data Visualization
Pandas integrates with Matplotlib and Seaborn for data visualization.
Examples:
import matplotlib.pyplot as plt
df[‘column’].plot(kind=’bar’)
plt.show()
Best Practices
- Always inspect your data using df.head(), df.info(), and df.describe().
- Handle missing data appropriately before analysis.
- Use vectorized operations for better performance.
- Avoid loops; use pandas built-in functions.
- Document your data transformations for reproducibility.
Common Pitfalls
- Ignoring missing values can lead to incorrect analysis.
- Using loops instead of vectorized operations can slow down performance.
- Not understanding the difference between copy and view can lead to unexpected behavior.
- Failing to reset index after filtering can cause issues in further operations.