[AI ML| Learning Notes Day 5| Exploratory Data Analysis (EDA)]

From plots to production decisions

Jaspinder Singh

Jan 18, 2026

Why EDA Exists (Real FAANG Context)

In FAANG ML work, EDA is not about curiosity.
EDA exists to answer risk questions:

Can this data be trusted?
Are there hidden leaks?
Will this model fail silently in production?
What assumptions am I about to bake into the system?

EDA is often the difference between:

A model that “looks good offline”
And a model that survives deployment

This lab trains you to do EDA like an ML engineer, not a data analyst.

Dataset Context (Important)

The dataset is:

Synthetic
Small
Intentionally flawed

Why?

Because FAANG interviews do not test scale — they test thinking.

This dataset contains:

Missing values
Heavy-tailed distributions
Leakage-like columns
Slice-specific churn patterns

Exactly what real production data looks like.

Section 1 — Sanity Checks (Mandatory, Always First)

If you skip this section in an interview, you fail silently.

1.1 Data Grain & Schema

Key Question:

What does one row represent?

In this dataset:

One row = one user
All features must be user-level, not event-level

📌 Why this matters

If the grain is wrong:

GroupBy results become meaningless
Models accidentally see duplicated users
Metrics inflate due to leakage

Interview Gotcha:
Many candidates start plotting before answering this question.

1.2 Missingness & Duplicates

You explicitly build a missingness table instead of eyeballing.

Why?

Because:

Missingness is often non-random
Missing values can encode behavior (e.g., churned users stop generating data)

Also:

“No missing values” can still be wrong
(-1, "unknown", empty strings)

📌 FAANG Insight:
Always check:

Duplicate rows
Duplicate primary keys (user_id)

Even 1 duplicate can break downstream joins.

Section 2 — Distributions & Outliers

2.1 Numeric Summaries

You compute:

Mean
Median
Quantiles

Why not just mean?

Because ML models care about distribution shape, not averages.

Look for:

Heavy tails (sessions_last_7d)
Long right tails (avg_session_min)
Zero-inflation (many users with zero sessions)

📌 Interview Tip:
If mean ≫ median → expect skew → consider log transforms.

2.2 High-Signal Plots (Not Plot Spam)

You are explicitly asked for only 2 plots.

Why?

Because:

FAANG engineers value signal density
Plot spam = lack of clarity

Good plots answer one clear question:

Is this feature skewed?
Are there extreme outliers?
Is the range suspicious?

Section 3 — Target & Slice Analysis

3.1 Label Imbalance

You compute:

Churn rate
Class counts

Then answer:

Which metric should I use?

Correct thinking:

Accuracy is misleading under imbalance
Prefer PR-AUC / F1 / Recall-at-K

📌 FAANG Mental Model:
Metrics are data-dependent decisions, not defaults.

3.2 Slice Analysis (Where Models Fail)

You analyze churn by:

Country
Plan
Tenure buckets

Why slices matter:

Global metrics hide local failures
Models often fail on minority slices

⚠️ FAANG Gotcha — Simpson’s Paradox

A feature may look helpful globally
but harmful within slices.

This is how models ship broken behavior.

Section 4 — Leakage & Time (Most Important Section)

4.1 Leakage-Prone Features

You are asked to identify leaky columns, not remove them.

Example:

refund_after_churn_flag

Why this leaks:

It is post-outcome information
Would never exist at prediction time

📌 Golden Rule:
If a feature knows the future, your model will too.

Interview Red Flag:
Candidates who say “we’ll let the model figure it out.”

Section 5 — Homework: EDA Write-Up

This is deliberate.

In real FAANG work:

EDA results are communicated, not just computed
You must justify modeling decisions

Your write-up answers:

What can go wrong?
What should we fix first?
What features are worth building?

📌 FAANG Evaluation Lens:
Clear thinking > fancy plots.

Final EDA Mental Model (Very Important)

EDA is not about:

Histograms
Pairplots
Correlations

EDA is about:

Validating assumptions
Preventing leakage
Identifying risk
Guiding modeling strategy

If you can explain:

What you checked
Why you checked it
What decision it informs

You are doing FAANG-level EDA.

What Comes Next

Next labs will build on this foundation:

Feature engineering (safe vs unsafe)
Time-aware validation
Model diagnostics
Offline vs online metric gaps

EDA is not a phase.
It is a discipline.

Jaspinder's Substack

Discussion about this post

Ready for more?