Data Analysis with Pandas
Learn essential data analysis techniques using Python and Pandas. This tutorial covers loading data, exploration, filtering, grouping, and pivot tables using a real sales dataset.
Prerequisites:
- Basic Python knowledge
- Pandas library installed
What you’ll learn:
- Loading and exploring CSV data
- Filtering and querying DataFrames
- Group by operations and aggregations
- Creating pivot tables for analysis
1. Setup: Import Libraries
First, let’s import the required libraries for data analysis.
import pandas as pd
import numpy as np
from datetime import datetime
# Display settings for better output formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.2f}'.format)
print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
Libraries imported successfully!
Pandas version: 3.0.0
NumPy version: 2.4.2
2. Load and Explore Data
Let’s load our sales dataset and take a first look at the data structure.
# Load the sales data
df = pd.read_csv('/Users/bamr87/github/zer0-mistakes/assets/data/notebooks/sales_data.csv')
# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])
# Display first few rows
print("📊 Sales Data - First 10 rows:")
df.head(10)
📊 Sales Data - First 10 rows:
| date | product | category | quantity | unit_price | revenue | region | salesperson | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2025-01-05 | Laptop Pro | Electronics | 3 | 1299.99 | 3899.97 | North | Alice Johnson |
| 1 | 2025-01-07 | Wireless Mouse | Electronics | 15 | 29.99 | 449.85 | South | Bob Smith |
| 2 | 2025-01-08 | Office Chair | Furniture | 5 | 249.99 | 1249.95 | East | Carol Davis |
| 3 | 2025-01-10 | Standing Desk | Furniture | 2 | 599.99 | 1199.98 | West | David Wilson |
| 4 | 2025-01-12 | Monitor 27inch | Electronics | 8 | 399.99 | 3199.92 | North | Alice Johnson |
| 5 | 2025-01-15 | Keyboard Mechanical | Electronics | 12 | 149.99 | 1799.88 | South | Bob Smith |
| 6 | 2025-01-17 | Desk Lamp | Furniture | 20 | 49.99 | 999.80 | East | Carol Davis |
| 7 | 2025-01-20 | Webcam HD | Electronics | 10 | 79.99 | 799.90 | West | David Wilson |
| 8 | 2025-01-22 | Laptop Pro | Electronics | 5 | 1299.99 | 6499.95 | North | Eve Martinez |
| 9 | 2025-01-25 | Wireless Mouse | Electronics | 25 | 29.99 | 749.75 | South | Frank Brown |
# Get basic information about the dataset
print("📋 Dataset Information:")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nColumn names: {list(df.columns)}")
print("\n" + "="*60)
df.info()
📋 Dataset Information:
Shape: 98 rows × 8 columns
Column names: ['date', 'product', 'category', 'quantity', 'unit_price', 'revenue', 'region', 'salesperson']
============================================================
<class 'pandas.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 98 non-null datetime64[us]
1 product 98 non-null str
2 category 98 non-null str
3 quantity 98 non-null int64
4 unit_price 98 non-null float64
5 revenue 98 non-null float64
6 region 98 non-null str
7 salesperson 98 non-null str
dtypes: datetime64[us](1), float64(2), int64(1), str(4)
memory usage: 6.3 KB
# Statistical summary of numeric columns
print("📈 Statistical Summary:")
df.describe()
📈 Statistical Summary:
| date | quantity | unit_price | revenue | |
|---|---|---|---|---|
| count | 98 | 98.00 | 98.00 | 98.00 |
| mean | 2025-05-08 14:56:19.591836 | 16.55 | 324.99 | 2840.04 |
| min | 2025-01-05 00:00:00 | 2.00 | 29.99 | 449.85 |
| 25% | 2025-03-08 12:00:00 | 8.00 | 49.99 | 1199.74 |
| 50% | 2025-05-09 00:00:00 | 15.00 | 174.99 | 2124.89 |
| 75% | 2025-07-09 12:00:00 | 22.00 | 399.99 | 3862.44 |
| max | 2025-09-10 00:00:00 | 50.00 | 1299.99 | 11699.91 |
| std | NaN | 10.74 | 389.56 | 2236.43 |
3. Filtering and Querying Data
Learn different ways to filter and query your DataFrame.
# Filter: Get all Electronics sales
electronics_sales = df[df['category'] == 'Electronics']
print(f"🔌 Electronics Sales: {len(electronics_sales)} transactions")
electronics_sales.head()
🔌 Electronics Sales: 68 transactions
| date | product | category | quantity | unit_price | revenue | region | salesperson | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2025-01-05 | Laptop Pro | Electronics | 3 | 1299.99 | 3899.97 | North | Alice Johnson |
| 1 | 2025-01-07 | Wireless Mouse | Electronics | 15 | 29.99 | 449.85 | South | Bob Smith |
| 4 | 2025-01-12 | Monitor 27inch | Electronics | 8 | 399.99 | 3199.92 | North | Alice Johnson |
| 5 | 2025-01-15 | Keyboard Mechanical | Electronics | 12 | 149.99 | 1799.88 | South | Bob Smith |
| 7 | 2025-01-20 | Webcam HD | Electronics | 10 | 79.99 | 799.90 | West | David Wilson |
# Multiple conditions: High-value sales (revenue > $3000) in the North region
high_value_north = df[(df['revenue'] > 3000) & (df['region'] == 'North')]
print(f"💰 High-value North sales: {len(high_value_north)} transactions")
print(f"Total revenue: ${high_value_north['revenue'].sum():,.2f}")
high_value_north
💰 High-value North sales: 14 transactions
Total revenue: $75,348.48
| date | product | category | quantity | unit_price | revenue | region | salesperson | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2025-01-05 | Laptop Pro | Electronics | 3 | 1299.99 | 3899.97 | North | Alice Johnson |
| 4 | 2025-01-12 | Monitor 27inch | Electronics | 8 | 399.99 | 3199.92 | North | Alice Johnson |
| 8 | 2025-01-22 | Laptop Pro | Electronics | 5 | 1299.99 | 6499.95 | North | Eve Martinez |
| 32 | 2025-03-28 | Monitor 27inch | Electronics | 12 | 399.99 | 4799.88 | North | Eve Martinez |
| 36 | 2025-04-08 | Laptop Pro | Electronics | 6 | 1299.99 | 7799.94 | North | Carol Davis |
| 52 | 2025-05-18 | Monitor 27inch | Electronics | 15 | 399.99 | 5999.85 | North | Alice Johnson |
| 56 | 2025-05-28 | Laptop Pro | Electronics | 3 | 1299.99 | 3899.97 | North | Eve Martinez |
| 68 | 2025-06-28 | Headphones Wireless | Electronics | 20 | 199.99 | 3999.80 | North | Eve Martinez |
| 72 | 2025-07-08 | Monitor 27inch | Electronics | 14 | 399.99 | 5599.86 | North | Carol Davis |
| 76 | 2025-07-18 | Laptop Pro | Electronics | 5 | 1299.99 | 6499.95 | North | Alice Johnson |
| 80 | 2025-07-28 | Office Chair | Furniture | 15 | 249.99 | 3749.85 | North | Eve Martinez |
| 88 | 2025-08-18 | Headphones Wireless | Electronics | 22 | 199.99 | 4399.78 | North | Alice Johnson |
| 92 | 2025-08-28 | Monitor 27inch | Electronics | 18 | 399.99 | 7199.82 | North | Eve Martinez |
| 96 | 2025-09-08 | Laptop Pro | Electronics | 6 | 1299.99 | 7799.94 | North | Carol Davis |
# Using query() method - more readable for complex filters
q3_sales = df.query("date >= '2025-07-01' and date <= '2025-09-30'")
print(f"📅 Q3 2025 Sales: {len(q3_sales)} transactions")
print(f"Q3 Total Revenue: ${q3_sales['revenue'].sum():,.2f}")
q3_sales.head()
📅 Q3 2025 Sales: 28 transactions
Q3 Total Revenue: $107,193.76
| date | product | category | quantity | unit_price | revenue | region | salesperson | |
|---|---|---|---|---|---|---|---|---|
| 70 | 2025-07-02 | Office Chair | Furniture | 11 | 249.99 | 2749.89 | East | Alice Johnson |
| 71 | 2025-07-05 | Standing Desk | Furniture | 8 | 599.99 | 4799.92 | West | Bob Smith |
| 72 | 2025-07-08 | Monitor 27inch | Electronics | 14 | 399.99 | 5599.86 | North | Carol Davis |
| 73 | 2025-07-10 | Keyboard Mechanical | Electronics | 28 | 149.99 | 4199.72 | South | David Wilson |
| 74 | 2025-07-12 | Desk Lamp | Furniture | 35 | 49.99 | 1749.65 | East | Eve Martinez |
4. Group By Operations
Group data by one or more columns and perform aggregations.
# Group by category and calculate total revenue
category_revenue = df.groupby('category')['revenue'].sum().sort_values(ascending=False)
print("💵 Revenue by Category:")
category_revenue
💵 Revenue by Category:
category
Electronics 208677.93
Furniture 69645.85
Name: revenue, dtype: float64
# Group by region and calculate multiple metrics
region_stats = df.groupby('region').agg({
'revenue': ['sum', 'mean', 'count'],
'quantity': 'sum'
}).round(2)
# Flatten column names
region_stats.columns = ['total_revenue', 'avg_revenue', 'num_transactions', 'total_units']
region_stats = region_stats.sort_values('total_revenue', ascending=False)
print("🌍 Sales Statistics by Region:")
region_stats
🌍 Sales Statistics by Region:
| total_revenue | avg_revenue | num_transactions | total_units | |
|---|---|---|---|---|
| region | ||||
| North | 96996.70 | 3879.87 | 25 | 330 |
| East | 85646.80 | 3568.62 | 24 | 320 |
| South | 48294.72 | 1931.79 | 25 | 528 |
| West | 47385.56 | 1974.40 | 24 | 444 |
# Top salespeople by revenue
salesperson_revenue = df.groupby('salesperson')['revenue'].sum().sort_values(ascending=False)
print("🏆 Top Salespeople by Total Revenue:")
salesperson_revenue
🏆 Top Salespeople by Total Revenue:
salesperson
Eve Martinez 62347.93
Alice Johnson 61397.89
Carol Davis 58897.68
David Wilson 34786.65
Bob Smith 33356.65
Frank Brown 27536.98
Name: revenue, dtype: float64
5. Pivot Tables
Create pivot tables for multi-dimensional analysis.
# Pivot table: Revenue by Region and Category
pivot_region_category = pd.pivot_table(
df,
values='revenue',
index='region',
columns='category',
aggfunc='sum',
fill_value=0
)
print("📊 Revenue by Region and Category:")
pivot_region_category
📊 Revenue by Region and Category:
| category | Electronics | Furniture |
|---|---|---|
| region | ||
| East | 63798.81 | 21847.99 |
| North | 81598.30 | 15398.40 |
| South | 35094.94 | 13199.78 |
| West | 28185.88 | 19199.68 |
# Add month column for time-based analysis
df['month'] = df['date'].dt.to_period('M')
# Pivot table: Monthly revenue by product
pivot_monthly = pd.pivot_table(
df,
values='revenue',
index='product',
columns='month',
aggfunc='sum',
fill_value=0
)
print("📅 Monthly Revenue by Product:")
pivot_monthly
📅 Monthly Revenue by Product:
| month | 2025-01 | 2025-02 | 2025-03 | 2025-04 | 2025-05 | 2025-06 | 2025-07 | 2025-08 | 2025-09 |
|---|---|---|---|---|---|---|---|---|---|
| product | |||||||||
| Desk Lamp | 999.80 | 749.85 | 1249.75 | 2399.52 | 1099.78 | 1399.72 | 1749.65 | 1999.60 | 1599.68 |
| Headphones Wireless | 0.00 | 1399.93 | 1999.90 | 2399.88 | 2999.85 | 5599.72 | 3599.82 | 4399.78 | 0.00 |
| Keyboard Mechanical | 1799.88 | 2699.82 | 5249.65 | 3299.78 | 3749.75 | 2399.84 | 4199.72 | 9749.35 | 0.00 |
| Laptop Pro | 10399.92 | 2599.98 | 5199.96 | 7799.94 | 14299.89 | 9099.93 | 6499.95 | 11699.91 | 7799.94 |
| Monitor 27inch | 3199.92 | 2399.94 | 8799.78 | 3599.91 | 5999.85 | 4399.89 | 5599.86 | 13599.66 | 0.00 |
| Office Chair | 1249.95 | 3499.86 | 2499.90 | 1749.93 | 2999.88 | 2249.91 | 6499.74 | 3249.87 | 0.00 |
| Standing Desk | 1199.98 | 4199.93 | 2999.95 | 3599.94 | 2399.96 | 4199.93 | 8399.86 | 5399.91 | 0.00 |
| USB Hub | 0.00 | 879.78 | 719.82 | 999.75 | 1199.70 | 1399.65 | 1399.65 | 1119.72 | 0.00 |
| Webcam HD | 799.90 | 959.88 | 639.92 | 1199.85 | 2319.71 | 1119.86 | 1599.80 | 1279.84 | 1759.78 |
| Wireless Mouse | 1199.60 | 899.70 | 599.80 | 1049.65 | 2039.32 | 959.68 | 1349.55 | 1499.50 | 1139.62 |
6. Summary Statistics and Key Insights
Generate a comprehensive summary of the sales data.
# Generate key business insights
print("=" * 60)
print("📊 SALES ANALYSIS SUMMARY")
print("=" * 60)
# Overall metrics
total_revenue = df['revenue'].sum()
avg_transaction = df['revenue'].mean()
total_units = df['quantity'].sum()
num_transactions = len(df)
print(f"\n💰 Total Revenue: ${total_revenue:,.2f}")
print(f"📦 Total Units Sold: {total_units:,}")
print(f"🧾 Number of Transactions: {num_transactions:,}")
print(f"💵 Average Transaction Value: ${avg_transaction:,.2f}")
# Best performing
best_product = df.groupby('product')['revenue'].sum().idxmax()
best_region = df.groupby('region')['revenue'].sum().idxmax()
best_salesperson = df.groupby('salesperson')['revenue'].sum().idxmax()
print(f"\n🏆 Best Selling Product: {best_product}")
print(f"🌍 Top Region: {best_region}")
print(f"⭐ Top Salesperson: {best_salesperson}")
# Time range
print(f"\n📅 Date Range: {df['date'].min().date()} to {df['date'].max().date()}")
print("=" * 60)
============================================================
📊 SALES ANALYSIS SUMMARY
============================================================
💰 Total Revenue: $278,323.78
📦 Total Units Sold: 1,622
🧾 Number of Transactions: 98
💵 Average Transaction Value: $2,840.04
🏆 Best Selling Product: Laptop Pro
🌍 Top Region: North
⭐ Top Salesperson: Eve Martinez
📅 Date Range: 2025-01-05 to 2025-09-10
============================================================
Next Steps
Now that you’ve learned the basics of Pandas data analysis, you can:
- Visualize your data - Check out the Matplotlib Visualization tutorial
- Perform statistical analysis - See the Python Statistics tutorial
- Work with APIs - Learn to fetch data in the API Requests tutorial
Key Takeaways:
- Use
head(),info(), anddescribe()for initial data exploration - Filter with boolean indexing or the
query()method - Use
groupby()for aggregations by category - Create
pivot_table()for multi-dimensional analysis
Comments