ML From Scratch: Gradient Descent to Gradient Boosting
Implementing core ML algorithms from first principles using only NumPy
Why Build ML From Scratch?
Anyone can call sklearn.fit(). But can you explain why gradient boosting works? Can you derive the update rule for logistic regression? This project forced me to understand the math, not just the API.
What I Built
Seven core algorithms, implemented using only NumPy:
Linear Models
- Linear Regression: Both closed-form (Normal Equation) and gradient descent
- Ridge & Lasso: L2/L1 regularization with coordinate descent for Lasso
- Logistic Regression: Cross-entropy loss with Newton-Raphson optimization
Tree-Based Models
- Decision Trees (CART): Gini impurity and entropy splitting criteria
- Random Forest: Bagging with feature subsampling
- Gradient Boosted Trees: Residual fitting with shrinkage
- XGBoost-style GBM: Second-order Taylor expansion and leaf regularization
The Math That Matters
Gradient Descent Derivation
For linear regression, the gradient of MSE loss:
This simple formula powers everything from linear regression to deep learning.
Why XGBoost is Different
Standard gradient boosting fits residuals. XGBoost uses a second-order Taylor expansion:
Where is the gradient and is the Hessian. This leads to better splits and built-in regularization.
Benchmarks
I tested my implementations against sklearn on standard datasets:
| Algorithm | My Implementation | sklearn | Accuracy Gap |
|---|---|---|---|
| Logistic Regression | 91.2% | 91.5% | 0.3% |
| Random Forest | 94.8% | 95.1% | 0.3% |
| Gradient Boosting | 96.1% | 96.8% | 0.7% |
| XGBoost-style | 96.5% | 97.2% | 0.7% |
Less than 1% gap — and I can explain every line of code.
Key Learnings
- Regularization is geometry — L2 shrinks toward origin, L1 induces sparsity by hitting axes
- Trees are greedy — they find locally optimal splits, not globally optimal
- Boosting reduces bias — each tree corrects the previous, unlike bagging which reduces variance
What’s Next
Extending to neural networks: backpropagation, batch normalization, and attention mechanisms — all from scratch.
Tech Stack
NumPy Matplotlib Pytest