πŸ“ˆ Regression Analysis

Definition: A statistical method for estimating the relationships between a dependent variable (outcome) and one or more independent variables (predictors). It is the foundation of quantitative business analysis.

Key courses: Wharton STAT 613, Booth BUSN 41100, HBS Analytics


πŸ”‘ The Core Intuition

Question regression answers: β€œHow does Y change when X changes, holding everything else constant?”

This β€œholding everything else constant” (ceteris paribus) is what makes regression powerful β€” it controls for confounders.


πŸ“ Simple Linear Regression

TermMeaning
YDependent variable (what we’re predicting)
XIndependent variable (the predictor)
Ξ²β‚€Intercept (Y when X = 0)
β₁Slope (change in Y for 1-unit change in X)
Ξ΅Error term (unexplained variation)

Example: Predicting Sales from Advertising Spend

Interpretation: Each additional 3.2K more in sales.


πŸ“Š Multiple Regression

Controls for multiple variables simultaneously:


πŸ“ Evaluating a Regression

RΒ² (R-squared) β€” Goodness of Fit

  • RΒ² = 0: Model explains nothing
  • RΒ² = 1: Model explains everything perfectly
  • RΒ² = 0.72: Model explains 72% of variation in Y

Warning: RΒ² always increases when adding variables β†’ use Adjusted RΒ² for multiple regression.

Statistical Significance (p-value)

  • p < 0.05: The coefficient is statistically significant β†’ X reliably predicts Y
  • p > 0.05: Could be noise

Confidence Intervals

  • 95% CI: The true β₁ lies in this range with 95% confidence
  • Narrow CI = more precise estimate

⚠️ Assumptions (OLS)

The Ordinary Least Squares (OLS) estimator requires:

AssumptionIf Violated
LinearityUse polynomial or log transformation
Independence of errorsAutocorrelation in time series data β†’ use time series models
Homoscedasticity (constant variance)Heteroscedasticity β†’ use robust standard errors
No multicollinearityCorrelated predictors β†’ unstable coefficients
Normality of errorsMostly needed for small samples

πŸ”’ Logistic Regression (Binary Outcomes)

When Y is binary (0 or 1), use logistic regression:

  • Outputs a probability between 0 and 1
  • Common in: Credit default prediction, customer churn, fraud detection, disease diagnosis

πŸ’Ό Business Applications

Business ProblemRegression TypeY Variable
Sales forecastingMultiple regressionRevenue
Price elasticityLog-log regressionQuantity sold
Customer churnLogistic regressionChurned (0/1)
House pricing modelMultiple regressionHome price
Ad attributionMultiple/ridge regressionConversions
Credit scoringLogistic regressionDefault (0/1)

🧠 Key Business Interpretation Rules

  1. Coefficients show marginal effects β€” the impact of one variable holding all others fixed
  2. Signs matter: Positive Ξ² = positive relationship, negative Ξ² = negative
  3. Scale matters: Compare standardized coefficients for relative importance
  4. Causation β‰  correlation: Regression shows association; need good design for causality (A-B Testing)
  5. Out-of-sample validity: Always test on held-out data

πŸ”— Connected Concepts


← πŸ“‰ Data & Analytics MOC | Related: A-B Testing Β· Hypothesis Testing Β· Decision Trees