title: “Statistical Analysis with Python” description: “Learn statistical analysis using Python’s scipy and pandas libraries with real survey data” date: 2025-01-27 lastmod: 2025-01-27 author: “Zer0-Mistakes Team” layout: notebook difficulty: intermediate tags: [python, statistics, scipy, data-analysis, surveys] categories: [Notebooks, Tutorials] toc: true comments: true —
Statistical Analysis with Python
Learn to perform statistical analysis using Python’s powerful libraries. This tutorial covers descriptive statistics, hypothesis testing, correlation analysis, and more using real survey response data.
What you’ll learn:
- Descriptive statistics (mean, median, mode, variance)
- Correlation analysis
- Hypothesis testing (t-tests, chi-square)
- Normal distribution and normality testing
- Confidence intervals
Setup and Imports
# Import statistical and data libraries
import pandas as pd
import numpy as np
import scipy
from scipy import stats
from scipy.stats import ttest_ind, chi2_contingency, pearsonr, spearmanr
import warnings
warnings.filterwarnings('ignore')
print("✅ Libraries imported successfully!")
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
print(f"SciPy: {scipy.__version__}")
✅ Libraries imported successfully!
Pandas: 3.0.0
NumPy: 2.4.2
SciPy: 1.17.0
Load Survey Data
# Load the survey response dataset
survey = pd.read_csv('/Users/bamr87/github/zer0-mistakes/assets/data/notebooks/survey_responses.csv')
print("📊 Survey Data Preview:")
print(f"Shape: {survey.shape[0]} respondents × {survey.shape[1]} questions\n")
survey.head(10)
📊 Survey Data Preview:
Shape: 75 respondents × 13 questions
| respondent_id | age | gender | education | employment | income_bracket | product_satisfaction | service_rating | would_recommend | purchase_frequency | category_preference | feedback_length | response_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 28 | Female | Bachelor's | Full-time | 50000-75000 | 4 | 5 | Yes | Monthly | Electronics | 142 | 2025-01-15 |
| 1 | 2 | 35 | Male | Master's | Full-time | 75000-100000 | 5 | 4 | Yes | Weekly | Electronics | 89 | 2025-01-16 |
| 2 | 3 | 42 | Female | Bachelor's | Part-time | 25000-50000 | 3 | 3 | Maybe | Quarterly | Furniture | 156 | 2025-01-17 |
| 3 | 4 | 23 | Male | High School | Student | Under 25000 | 4 | 4 | Yes | Monthly | Electronics | 45 | 2025-01-18 |
| 4 | 5 | 51 | Female | Doctorate | Full-time | 100000+ | 5 | 5 | Yes | Weekly | Electronics | 203 | 2025-01-19 |
| 5 | 6 | 31 | Non-binary | Bachelor's | Full-time | 50000-75000 | 4 | 4 | Yes | Monthly | Furniture | 78 | 2025-01-20 |
| 6 | 7 | 45 | Male | Master's | Self-employed | 75000-100000 | 3 | 2 | No | Rarely | Electronics | 312 | 2025-01-21 |
| 7 | 8 | 27 | Female | Bachelor's | Full-time | 50000-75000 | 5 | 5 | Yes | Monthly | Electronics | 67 | 2025-01-22 |
| 8 | 9 | 38 | Male | Bachelor's | Full-time | 75000-100000 | 4 | 4 | Yes | Quarterly | Furniture | 95 | 2025-01-23 |
| 9 | 10 | 56 | Female | High School | Retired | 25000-50000 | 4 | 5 | Yes | Monthly | Furniture | 124 | 2025-01-24 |
Descriptive Statistics
Let’s calculate key descriptive statistics for our numerical columns:
# Calculate comprehensive descriptive statistics
numeric_cols = ['age', 'product_satisfaction', 'service_rating', 'feedback_length']
print("📈 Descriptive Statistics for Survey Responses:")
print("=" * 70)
for col in numeric_cols:
data = survey[col]
print(f"\n{col.upper().replace('_', ' ')}:")
print(f" Mean: {data.mean():.2f}")
print(f" Median: {data.median():.2f}")
print(f" Mode: {data.mode().values[0]}")
print(f" Std Dev: {data.std():.2f}")
print(f" Variance: {data.var():.2f}")
print(f" Range: {data.min()} - {data.max()}")
print(f" IQR: {data.quantile(0.75) - data.quantile(0.25):.2f}")
📈 Descriptive Statistics for Survey Responses:
======================================================================
AGE:
Mean: 38.28
Median: 37.00
Mode: 26
Std Dev: 11.24
Variance: 126.31
Range: 20 - 62
IQR: 18.00
PRODUCT SATISFACTION:
Mean: 4.16
Median: 4.00
Mode: 4
Std Dev: 0.74
Variance: 0.54
Range: 2 - 5
IQR: 1.00
SERVICE RATING:
Mean: 4.07
Median: 4.00
Mode: 4
Std Dev: 0.86
Variance: 0.74
Range: 2 - 5
IQR: 1.00
FEEDBACK LENGTH:
Mean: 130.56
Median: 112.00
Mode: 98
Std Dev: 71.89
Variance: 5167.84
Range: 32 - 312
IQR: 97.50
Correlation Analysis
Examine relationships between satisfaction metrics:
# Calculate correlation matrix for satisfaction metrics
satisfaction_cols = ['product_satisfaction', 'service_rating', 'feedback_length', 'age']
correlation_matrix = survey[satisfaction_cols].corr()
print("🔗 Correlation Matrix (Pearson):")
print(correlation_matrix.round(3))
# Detailed pairwise correlations with significance
print("\n\n📊 Detailed Correlation Analysis:")
print("=" * 60)
pairs = [
('product_satisfaction', 'service_rating'),
('age', 'product_satisfaction'),
('age', 'service_rating'),
('feedback_length', 'product_satisfaction')
]
for col1, col2 in pairs:
r, p_value = pearsonr(survey[col1], survey[col2])
significance = "***" if p_value < 0.001 else "**" if p_value < 0.01 else "*" if p_value < 0.05 else "ns"
print(f"\n{col1} vs {col2}:")
print(f" Pearson r = {r:.4f} ({significance})")
print(f" p-value = {p_value:.4f}")
if abs(r) >= 0.7:
strength = "strong"
elif abs(r) >= 0.4:
strength = "moderate"
else:
strength = "weak"
direction = "positive" if r > 0 else "negative"
print(f" → {strength.capitalize()} {direction} correlation")
🔗 Correlation Matrix (Pearson):
product_satisfaction service_rating feedback_length \
product_satisfaction 1.000 0.709 -0.309
service_rating 0.709 1.000 -0.233
feedback_length -0.309 -0.233 1.000
age -0.218 -0.005 0.647
age
product_satisfaction -0.218
service_rating -0.005
feedback_length 0.647
age 1.000
📊 Detailed Correlation Analysis:
============================================================
product_satisfaction vs service_rating:
Pearson r = 0.7093 (***)
p-value = 0.0000
→ Strong positive correlation
age vs product_satisfaction:
Pearson r = -0.2179 (ns)
p-value = 0.0604
→ Weak negative correlation
age vs service_rating:
Pearson r = -0.0048 (ns)
p-value = 0.9677
→ Weak negative correlation
feedback_length vs product_satisfaction:
Pearson r = -0.3092 (**)
p-value = 0.0069
→ Weak negative correlation
Hypothesis Testing: T-Tests
Compare satisfaction scores between different groups:
# Independent samples t-test: Compare satisfaction between genders
male_satisfaction = survey[survey['gender'] == 'Male']['product_satisfaction']
female_satisfaction = survey[survey['gender'] == 'Female']['product_satisfaction']
t_stat, p_value = ttest_ind(male_satisfaction, female_satisfaction)
print("🧪 Independent Samples T-Test: Product Satisfaction by Gender")
print("=" * 60)
print(f"\nGroup Statistics:")
print(f" Male (n={len(male_satisfaction)}): M = {male_satisfaction.mean():.2f}, SD = {male_satisfaction.std():.2f}")
print(f" Female (n={len(female_satisfaction)}): M = {female_satisfaction.mean():.2f}, SD = {female_satisfaction.std():.2f}")
print(f"\nTest Results:")
print(f" t-statistic = {t_stat:.4f}")
print(f" p-value = {p_value:.4f}")
print(f"\nConclusion at α = 0.05:")
if p_value < 0.05:
print(" ✓ REJECT null hypothesis - significant difference exists")
else:
print(" ✗ FAIL TO REJECT null hypothesis - no significant difference")
🧪 Independent Samples T-Test: Product Satisfaction by Gender
============================================================
Group Statistics:
Male (n=35): M = 4.00, SD = 0.77
Female (n=36): M = 4.28, SD = 0.70
Test Results:
t-statistic = -1.5932
p-value = 0.1157
Conclusion at α = 0.05:
✗ FAIL TO REJECT null hypothesis - no significant difference
Chi-Square Test
Test for association between categorical variables:
# Chi-square test: Association between purchase frequency and category preference
contingency_table = pd.crosstab(survey['purchase_frequency'], survey['category_preference'])
print("📋 Contingency Table: Purchase Frequency × Category Preference")
print(contingency_table)
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print("\n\n🧪 Chi-Square Test of Independence")
print("=" * 60)
print(f"\nResults:")
print(f" Chi-square statistic = {chi2:.4f}")
print(f" Degrees of freedom = {dof}")
print(f" p-value = {p_value:.4f}")
print(f"\nConclusion at α = 0.05:")
if p_value < 0.05:
print(" ✓ REJECT null hypothesis - variables are DEPENDENT")
print(" → Purchase frequency IS associated with category preference")
else:
print(" ✗ FAIL TO REJECT null hypothesis - variables are INDEPENDENT")
print(" → No significant association between purchase frequency and category")
📋 Contingency Table: Purchase Frequency × Category Preference
category_preference Electronics Furniture
purchase_frequency
Monthly 23 19
Quarterly 4 8
Rarely 3 4
Weekly 14 0
🧪 Chi-Square Test of Independence
============================================================
Results:
Chi-square statistic = 14.0252
Degrees of freedom = 3
p-value = 0.0029
Conclusion at α = 0.05:
✓ REJECT null hypothesis - variables are DEPENDENT
→ Purchase frequency IS associated with category preference
Normality Testing
Check if satisfaction scores follow a normal distribution:
# Test normality of satisfaction scores using multiple methods
data = survey['product_satisfaction']
print("📐 Normality Tests for Product Satisfaction Scores")
print("=" * 60)
# Shapiro-Wilk Test (best for n < 5000)
shapiro_stat, shapiro_p = stats.shapiro(data)
print(f"\n1. Shapiro-Wilk Test:")
print(f" W-statistic = {shapiro_stat:.4f}")
print(f" p-value = {shapiro_p:.4f}")
# D'Agostino's K-squared Test
dagostino_stat, dagostino_p = stats.normaltest(data)
print(f"\n2. D'Agostino-Pearson Test:")
print(f" K² statistic = {dagostino_stat:.4f}")
print(f" p-value = {dagostino_p:.4f}")
# Skewness and Kurtosis
skew = stats.skew(data)
kurt = stats.kurtosis(data)
print(f"\n3. Distribution Shape:")
print(f" Skewness = {skew:.4f} ({'right-skewed' if skew > 0 else 'left-skewed' if skew < 0 else 'symmetric'})")
print(f" Kurtosis = {kurt:.4f} ({'leptokurtic' if kurt > 0 else 'platykurtic' if kurt < 0 else 'mesokurtic'})")
print(f"\n📊 Conclusion:")
if shapiro_p > 0.05:
print(" Data appears to be normally distributed (p > 0.05)")
else:
print(" Data deviates significantly from normal distribution (p < 0.05)")
📐 Normality Tests for Product Satisfaction Scores
============================================================
1. Shapiro-Wilk Test:
W-statistic = 0.8159
p-value = 0.0000
2. D'Agostino-Pearson Test:
K² statistic = 3.1266
p-value = 0.2094
3. Distribution Shape:
Skewness = -0.4623 (left-skewed)
Kurtosis = -0.3638 (platykurtic)
📊 Conclusion:
Data deviates significantly from normal distribution (p < 0.05)
Confidence Intervals
Calculate confidence intervals for key metrics:
# Calculate 95% confidence intervals for satisfaction metrics
def confidence_interval(data, confidence=0.95):
"""Calculate confidence interval for mean"""
n = len(data)
mean = np.mean(data)
se = stats.sem(data) # Standard error of the mean
h = se * stats.t.ppf((1 + confidence) / 2, n - 1) # Margin of error
return mean, mean - h, mean + h
print("📏 95% Confidence Intervals")
print("=" * 60)
metrics = {
'Product Satisfaction': survey['product_satisfaction'],
'Service Rating': survey['service_rating'],
'Feedback Length': survey['feedback_length'],
'Age': survey['age']
}
for name, data in metrics.items():
mean, lower, upper = confidence_interval(data)
print(f"\n{name}:")
print(f" Sample Mean: {mean:.2f}")
print(f" 95% CI: [{lower:.2f}, {upper:.2f}]")
print(f" → We are 95% confident the true population mean")
print(f" falls between {lower:.2f} and {upper:.2f}")
📏 95% Confidence Intervals
============================================================
Product Satisfaction:
Sample Mean: 4.16
95% CI: [3.99, 4.33]
→ We are 95% confident the true population mean
falls between 3.99 and 4.33
Service Rating:
Sample Mean: 4.07
95% CI: [3.87, 4.26]
→ We are 95% confident the true population mean
falls between 3.87 and 4.26
Feedback Length:
Sample Mean: 130.56
95% CI: [114.02, 147.10]
→ We are 95% confident the true population mean
falls between 114.02 and 147.10
Age:
Sample Mean: 38.28
95% CI: [35.69, 40.87]
→ We are 95% confident the true population mean
falls between 35.69 and 40.87
Summary Statistics by Group
# Generate comprehensive summary statistics by demographic groups
print("📊 SURVEY ANALYSIS SUMMARY")
print("=" * 70)
# Overall statistics
print(f"\n📋 Dataset Overview:")
print(f" Total Respondents: {len(survey)}")
print(f" Average Age: {survey['age'].mean():.1f} years")
print(f" Gender Distribution: {dict(survey['gender'].value_counts())}")
# Key findings
print(f"\n🎯 Key Satisfaction Metrics:")
print(f" Product Satisfaction: {survey['product_satisfaction'].mean():.2f}/5")
print(f" Service Rating: {survey['service_rating'].mean():.2f}/5")
recommend_yes = (survey['would_recommend'] == 'Yes').sum()
print(f" Would Recommend: {recommend_yes}/{len(survey)} ({100*recommend_yes/len(survey):.1f}%)")
# Category preferences
print(f"\n💻 Category Preferences:")
category_counts = survey['category_preference'].value_counts()
for category, count in category_counts.items():
pct = (count / len(survey)) * 100
print(f" {category}: {count} ({pct:.1f}%)")
# Purchase patterns
print(f"\n⏱️ Purchase Frequency:")
frequency_counts = survey['purchase_frequency'].value_counts()
for freq, count in frequency_counts.items():
pct = (count / len(survey)) * 100
print(f" {freq}: {count} ({pct:.1f}%)")
print("\n" + "=" * 70)
📊 SURVEY ANALYSIS SUMMARY
======================================================================
📋 Dataset Overview:
Total Respondents: 75
Average Age: 38.3 years
Gender Distribution: {'Female': np.int64(36), 'Male': np.int64(35), 'Non-binary': np.int64(4)}
🎯 Key Satisfaction Metrics:
Product Satisfaction: 4.16/5
Service Rating: 4.07/5
Would Recommend: 58/75 (77.3%)
💻 Category Preferences:
Electronics: 44 (58.7%)
Furniture: 31 (41.3%)
⏱️ Purchase Frequency:
Monthly: 42 (56.0%)
Weekly: 14 (18.7%)
Quarterly: 12 (16.0%)
Rarely: 7 (9.3%)
======================================================================
Next Steps
This tutorial covered the fundamentals of statistical analysis with Python. To continue learning:
- Visualize your statistics - Check out the Matplotlib Visualization tutorial
- Analyze larger datasets - See the Pandas Data Analysis tutorial
- Fetch external data - Learn about APIs in the API Requests tutorial
Key Takeaways:
- Use
describe()for quick descriptive statistics scipy.statsprovides comprehensive hypothesis testing tools- Always check assumptions (normality, equal variances) before parametric tests
- Correlation ≠ causation - always interpret results carefully
- Report confidence intervals alongside point estimates
Comments