Data Analysis with Pandas

13 min read Updated: July 01, 2026 Jupyter Notebook
Learn essential data analysis techniques using Python and Pandas. This tutorial covers loading data, exploration, filtering, grouping, and pivot tables using a real sales dataset.

Data Analysis with Pandas

Learn essential data analysis techniques using Python and Pandas. This tutorial covers loading data, exploration, filtering, grouping, and pivot tables using a real sales dataset.

Prerequisites:

  • Basic Python knowledge
  • Pandas library installed

What you’ll learn:

  • Loading and exploring CSV data
  • Filtering and querying DataFrames
  • Group by operations and aggregations
  • Creating pivot tables for analysis

1. Setup: Import Libraries

First, let’s import the required libraries for data analysis.

import pandas as pd
import numpy as np
from datetime import datetime

# Display settings for better output formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.float_format', '{:.2f}'.format)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
Libraries imported successfully!
Pandas version: 3.0.0
NumPy version: 2.4.2

2. Load and Explore Data

Let’s load our sales dataset and take a first look at the data structure.

# Load the sales data
df = pd.read_csv('/Users/bamr87/github/zer0-mistakes/assets/data/notebooks/sales_data.csv')

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Display first few rows
print("📊 Sales Data - First 10 rows:")
df.head(10)
📊 Sales Data - First 10 rows:
date product category quantity unit_price revenue region salesperson
0 2025-01-05 Laptop Pro Electronics 3 1299.99 3899.97 North Alice Johnson
1 2025-01-07 Wireless Mouse Electronics 15 29.99 449.85 South Bob Smith
2 2025-01-08 Office Chair Furniture 5 249.99 1249.95 East Carol Davis
3 2025-01-10 Standing Desk Furniture 2 599.99 1199.98 West David Wilson
4 2025-01-12 Monitor 27inch Electronics 8 399.99 3199.92 North Alice Johnson
5 2025-01-15 Keyboard Mechanical Electronics 12 149.99 1799.88 South Bob Smith
6 2025-01-17 Desk Lamp Furniture 20 49.99 999.80 East Carol Davis
7 2025-01-20 Webcam HD Electronics 10 79.99 799.90 West David Wilson
8 2025-01-22 Laptop Pro Electronics 5 1299.99 6499.95 North Eve Martinez
9 2025-01-25 Wireless Mouse Electronics 25 29.99 749.75 South Frank Brown
# Get basic information about the dataset
print("📋 Dataset Information:")
print(f"Shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nColumn names: {list(df.columns)}")
print("\n" + "="*60)
df.info()
📋 Dataset Information:
Shape: 98 rows × 8 columns

Column names: ['date', 'product', 'category', 'quantity', 'unit_price', 'revenue', 'region', 'salesperson']

============================================================
<class 'pandas.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         98 non-null     datetime64[us]
 1   product      98 non-null     str           
 2   category     98 non-null     str           
 3   quantity     98 non-null     int64         
 4   unit_price   98 non-null     float64       
 5   revenue      98 non-null     float64       
 6   region       98 non-null     str           
 7   salesperson  98 non-null     str           
dtypes: datetime64[us](1), float64(2), int64(1), str(4)
memory usage: 6.3 KB
# Statistical summary of numeric columns
print("📈 Statistical Summary:")
df.describe()
📈 Statistical Summary:
date quantity unit_price revenue
count 98 98.00 98.00 98.00
mean 2025-05-08 14:56:19.591836 16.55 324.99 2840.04
min 2025-01-05 00:00:00 2.00 29.99 449.85
25% 2025-03-08 12:00:00 8.00 49.99 1199.74
50% 2025-05-09 00:00:00 15.00 174.99 2124.89
75% 2025-07-09 12:00:00 22.00 399.99 3862.44
max 2025-09-10 00:00:00 50.00 1299.99 11699.91
std NaN 10.74 389.56 2236.43

3. Filtering and Querying Data

Learn different ways to filter and query your DataFrame.

# Filter: Get all Electronics sales
electronics_sales = df[df['category'] == 'Electronics']
print(f"🔌 Electronics Sales: {len(electronics_sales)} transactions")
electronics_sales.head()
🔌 Electronics Sales: 68 transactions
date product category quantity unit_price revenue region salesperson
0 2025-01-05 Laptop Pro Electronics 3 1299.99 3899.97 North Alice Johnson
1 2025-01-07 Wireless Mouse Electronics 15 29.99 449.85 South Bob Smith
4 2025-01-12 Monitor 27inch Electronics 8 399.99 3199.92 North Alice Johnson
5 2025-01-15 Keyboard Mechanical Electronics 12 149.99 1799.88 South Bob Smith
7 2025-01-20 Webcam HD Electronics 10 79.99 799.90 West David Wilson
# Multiple conditions: High-value sales (revenue > $3000) in the North region
high_value_north = df[(df['revenue'] > 3000) & (df['region'] == 'North')]
print(f"💰 High-value North sales: {len(high_value_north)} transactions")
print(f"Total revenue: ${high_value_north['revenue'].sum():,.2f}")
high_value_north
💰 High-value North sales: 14 transactions
Total revenue: $75,348.48
date product category quantity unit_price revenue region salesperson
0 2025-01-05 Laptop Pro Electronics 3 1299.99 3899.97 North Alice Johnson
4 2025-01-12 Monitor 27inch Electronics 8 399.99 3199.92 North Alice Johnson
8 2025-01-22 Laptop Pro Electronics 5 1299.99 6499.95 North Eve Martinez
32 2025-03-28 Monitor 27inch Electronics 12 399.99 4799.88 North Eve Martinez
36 2025-04-08 Laptop Pro Electronics 6 1299.99 7799.94 North Carol Davis
52 2025-05-18 Monitor 27inch Electronics 15 399.99 5999.85 North Alice Johnson
56 2025-05-28 Laptop Pro Electronics 3 1299.99 3899.97 North Eve Martinez
68 2025-06-28 Headphones Wireless Electronics 20 199.99 3999.80 North Eve Martinez
72 2025-07-08 Monitor 27inch Electronics 14 399.99 5599.86 North Carol Davis
76 2025-07-18 Laptop Pro Electronics 5 1299.99 6499.95 North Alice Johnson
80 2025-07-28 Office Chair Furniture 15 249.99 3749.85 North Eve Martinez
88 2025-08-18 Headphones Wireless Electronics 22 199.99 4399.78 North Alice Johnson
92 2025-08-28 Monitor 27inch Electronics 18 399.99 7199.82 North Eve Martinez
96 2025-09-08 Laptop Pro Electronics 6 1299.99 7799.94 North Carol Davis
# Using query() method - more readable for complex filters
q3_sales = df.query("date >= '2025-07-01' and date <= '2025-09-30'")
print(f"📅 Q3 2025 Sales: {len(q3_sales)} transactions")
print(f"Q3 Total Revenue: ${q3_sales['revenue'].sum():,.2f}")
q3_sales.head()
📅 Q3 2025 Sales: 28 transactions
Q3 Total Revenue: $107,193.76
date product category quantity unit_price revenue region salesperson
70 2025-07-02 Office Chair Furniture 11 249.99 2749.89 East Alice Johnson
71 2025-07-05 Standing Desk Furniture 8 599.99 4799.92 West Bob Smith
72 2025-07-08 Monitor 27inch Electronics 14 399.99 5599.86 North Carol Davis
73 2025-07-10 Keyboard Mechanical Electronics 28 149.99 4199.72 South David Wilson
74 2025-07-12 Desk Lamp Furniture 35 49.99 1749.65 East Eve Martinez

4. Group By Operations

Group data by one or more columns and perform aggregations.

# Group by category and calculate total revenue
category_revenue = df.groupby('category')['revenue'].sum().sort_values(ascending=False)
print("💵 Revenue by Category:")
category_revenue
💵 Revenue by Category:





category
Electronics   208677.93
Furniture      69645.85
Name: revenue, dtype: float64
# Group by region and calculate multiple metrics
region_stats = df.groupby('region').agg({
    'revenue': ['sum', 'mean', 'count'],
    'quantity': 'sum'
}).round(2)

# Flatten column names
region_stats.columns = ['total_revenue', 'avg_revenue', 'num_transactions', 'total_units']
region_stats = region_stats.sort_values('total_revenue', ascending=False)

print("🌍 Sales Statistics by Region:")
region_stats
🌍 Sales Statistics by Region:
total_revenue avg_revenue num_transactions total_units
region
North 96996.70 3879.87 25 330
East 85646.80 3568.62 24 320
South 48294.72 1931.79 25 528
West 47385.56 1974.40 24 444
# Top salespeople by revenue
salesperson_revenue = df.groupby('salesperson')['revenue'].sum().sort_values(ascending=False)
print("🏆 Top Salespeople by Total Revenue:")
salesperson_revenue
🏆 Top Salespeople by Total Revenue:





salesperson
Eve Martinez    62347.93
Alice Johnson   61397.89
Carol Davis     58897.68
David Wilson    34786.65
Bob Smith       33356.65
Frank Brown     27536.98
Name: revenue, dtype: float64

5. Pivot Tables

Create pivot tables for multi-dimensional analysis.

# Pivot table: Revenue by Region and Category
pivot_region_category = pd.pivot_table(
    df,
    values='revenue',
    index='region',
    columns='category',
    aggfunc='sum',
    fill_value=0
)

print("📊 Revenue by Region and Category:")
pivot_region_category
📊 Revenue by Region and Category:
category Electronics Furniture
region
East 63798.81 21847.99
North 81598.30 15398.40
South 35094.94 13199.78
West 28185.88 19199.68
# Add month column for time-based analysis
df['month'] = df['date'].dt.to_period('M')

# Pivot table: Monthly revenue by product
pivot_monthly = pd.pivot_table(
    df,
    values='revenue',
    index='product',
    columns='month',
    aggfunc='sum',
    fill_value=0
)

print("📅 Monthly Revenue by Product:")
pivot_monthly
📅 Monthly Revenue by Product:
month 2025-01 2025-02 2025-03 2025-04 2025-05 2025-06 2025-07 2025-08 2025-09
product
Desk Lamp 999.80 749.85 1249.75 2399.52 1099.78 1399.72 1749.65 1999.60 1599.68
Headphones Wireless 0.00 1399.93 1999.90 2399.88 2999.85 5599.72 3599.82 4399.78 0.00
Keyboard Mechanical 1799.88 2699.82 5249.65 3299.78 3749.75 2399.84 4199.72 9749.35 0.00
Laptop Pro 10399.92 2599.98 5199.96 7799.94 14299.89 9099.93 6499.95 11699.91 7799.94
Monitor 27inch 3199.92 2399.94 8799.78 3599.91 5999.85 4399.89 5599.86 13599.66 0.00
Office Chair 1249.95 3499.86 2499.90 1749.93 2999.88 2249.91 6499.74 3249.87 0.00
Standing Desk 1199.98 4199.93 2999.95 3599.94 2399.96 4199.93 8399.86 5399.91 0.00
USB Hub 0.00 879.78 719.82 999.75 1199.70 1399.65 1399.65 1119.72 0.00
Webcam HD 799.90 959.88 639.92 1199.85 2319.71 1119.86 1599.80 1279.84 1759.78
Wireless Mouse 1199.60 899.70 599.80 1049.65 2039.32 959.68 1349.55 1499.50 1139.62

6. Summary Statistics and Key Insights

Generate a comprehensive summary of the sales data.

# Generate key business insights
print("=" * 60)
print("📊 SALES ANALYSIS SUMMARY")
print("=" * 60)

# Overall metrics
total_revenue = df['revenue'].sum()
avg_transaction = df['revenue'].mean()
total_units = df['quantity'].sum()
num_transactions = len(df)

print(f"\n💰 Total Revenue: ${total_revenue:,.2f}")
print(f"📦 Total Units Sold: {total_units:,}")
print(f"🧾 Number of Transactions: {num_transactions:,}")
print(f"💵 Average Transaction Value: ${avg_transaction:,.2f}")

# Best performing
best_product = df.groupby('product')['revenue'].sum().idxmax()
best_region = df.groupby('region')['revenue'].sum().idxmax()
best_salesperson = df.groupby('salesperson')['revenue'].sum().idxmax()

print(f"\n🏆 Best Selling Product: {best_product}")
print(f"🌍 Top Region: {best_region}")
print(f"⭐ Top Salesperson: {best_salesperson}")

# Time range
print(f"\n📅 Date Range: {df['date'].min().date()} to {df['date'].max().date()}")
print("=" * 60)
============================================================
📊 SALES ANALYSIS SUMMARY
============================================================

💰 Total Revenue: $278,323.78
📦 Total Units Sold: 1,622
🧾 Number of Transactions: 98
💵 Average Transaction Value: $2,840.04

🏆 Best Selling Product: Laptop Pro
🌍 Top Region: North
⭐ Top Salesperson: Eve Martinez

📅 Date Range: 2025-01-05 to 2025-09-10
============================================================

Next Steps

Now that you’ve learned the basics of Pandas data analysis, you can:

  1. Visualize your data - Check out the Matplotlib Visualization tutorial
  2. Perform statistical analysis - See the Python Statistics tutorial
  3. Work with APIs - Learn to fetch data in the API Requests tutorial

Key Takeaways:

  • Use head(), info(), and describe() for initial data exploration
  • Filter with boolean indexing or the query() method
  • Use groupby() for aggregations by category
  • Create pivot_table() for multi-dimensional analysis
Comments