[AI ML| Learning Notes Day 5| Exploratory Data Analysis (EDA)]
From plots to production decisions
Why EDA Exists (Real FAANG Context)
In FAANG ML work, EDA is not about curiosity.
EDA exists to answer risk questions:
Can this data be trusted?
Are there hidden leaks?
Will this model fail silently in production?
What assumptions am I about to bake into the system?
EDA is often the difference between:
A model that “looks good offline”
And a model that survives deployment
This lab trains you to do EDA like an ML engineer, not a data analyst.
Dataset Context (Important)
The dataset is:
Synthetic
Small
Intentionally flawed
Why?
Because FAANG interviews do not test scale — they test thinking.
This dataset contains:
Missing values
Heavy-tailed distributions
Leakage-like columns
Slice-specific churn patterns
Exactly what real production data looks like.
Section 1 — Sanity Checks (Mandatory, Always First)
If you skip this section in an interview, you fail silently.
1.1 Data Grain & Schema
Key Question:
What does one row represent?
In this dataset:
One row = one user
All features must be user-level, not event-level
📌 Why this matters
If the grain is wrong:
GroupBy results become meaningless
Models accidentally see duplicated users
Metrics inflate due to leakage
Interview Gotcha:
Many candidates start plotting before answering this question.
1.2 Missingness & Duplicates
You explicitly build a missingness table instead of eyeballing.
Why?
Because:
Missingness is often non-random
Missing values can encode behavior (e.g., churned users stop generating data)
Also:
“No missing values” can still be wrong
(-1,"unknown", empty strings)
📌 FAANG Insight:
Always check:
Duplicate rows
Duplicate primary keys (
user_id)
Even 1 duplicate can break downstream joins.
Section 2 — Distributions & Outliers
2.1 Numeric Summaries
You compute:
Mean
Median
Quantiles
Why not just mean?
Because ML models care about distribution shape, not averages.
Look for:
Heavy tails (
sessions_last_7d)Long right tails (
avg_session_min)Zero-inflation (many users with zero sessions)
📌 Interview Tip:
If mean ≫ median → expect skew → consider log transforms.
2.2 High-Signal Plots (Not Plot Spam)
You are explicitly asked for only 2 plots.
Why?
Because:
FAANG engineers value signal density
Plot spam = lack of clarity
Good plots answer one clear question:
Is this feature skewed?
Are there extreme outliers?
Is the range suspicious?
Section 3 — Target & Slice Analysis
3.1 Label Imbalance
You compute:
Churn rate
Class counts
Then answer:
Which metric should I use?
Correct thinking:
Accuracy is misleading under imbalance
Prefer PR-AUC / F1 / Recall-at-K
📌 FAANG Mental Model:
Metrics are data-dependent decisions, not defaults.
3.2 Slice Analysis (Where Models Fail)
You analyze churn by:
Country
Plan
Tenure buckets
Why slices matter:
Global metrics hide local failures
Models often fail on minority slices
⚠️ FAANG Gotcha — Simpson’s Paradox
A feature may look helpful globally
but harmful within slices.
This is how models ship broken behavior.
Section 4 — Leakage & Time (Most Important Section)
4.1 Leakage-Prone Features
You are asked to identify leaky columns, not remove them.
Example:
refund_after_churn_flag
Why this leaks:
It is post-outcome information
Would never exist at prediction time
📌 Golden Rule:
If a feature knows the future, your model will too.
Interview Red Flag:
Candidates who say “we’ll let the model figure it out.”
Section 5 — Homework: EDA Write-Up
This is deliberate.
In real FAANG work:
EDA results are communicated, not just computed
You must justify modeling decisions
Your write-up answers:
What can go wrong?
What should we fix first?
What features are worth building?
📌 FAANG Evaluation Lens:
Clear thinking > fancy plots.
Final EDA Mental Model (Very Important)
EDA is not about:
Histograms
Pairplots
Correlations
EDA is about:
Validating assumptions
Preventing leakage
Identifying risk
Guiding modeling strategy
If you can explain:
What you checked
Why you checked it
What decision it informs
You are doing FAANG-level EDA.
What Comes Next
Next labs will build on this foundation:
Feature engineering (safe vs unsafe)
Time-aware validation
Model diagnostics
Offline vs online metric gaps
EDA is not a phase.
It is a discipline.


