MODULE 03 · FULL SESSION · ALGORITHMS & NEURAL NETS

Core
Machine Learning

Expanded student reference: supervised pipeline (data → train → evaluate → inference), algorithm trade-offs, ensembles teaser, unsupervised depth (k-means, association metrics, PCA peek), neural architectures (CNN / RNN / Transformer), pitfalls, activities — links to Google ML intro + sklearn map + 3B1B.

Supervised & unsupervised
Neural net picture
10 Quiz Questions
8 Real-World Examples
10 Quiz Questions
🔁 ML recap 📊 Supervised 🔮 Unsupervised 🧠 Neural nets 🎯 Map + activities
↓ scroll to begin

What is machine learning?

One sentence students keep: learning patterns from data — not magic, not “the computer knows everything,” but statistics + compute at scale.

🧠

Self-check

Q: “What is machine learning?” A: Learning from examples (data) so the system can generalise to new cases — same spirit as studying past exam papers, but automated.

Foundation
↔️

Same story everywhere

Data → Model → Prediction. You collect inputs (and often correct answers for training). The model stores learned patterns. Prediction scores new inputs at runtime.

Pipeline
📌

Why it matters

Hand-written rules hit a wall (spam, speech, images). ML turns “show me thousands of examples” into behaviour — under the right assumptions and enough quality data.

Motivation

Figure — Every ML product you use still fits this loop; only the data type and model family change.

Supervised learning in five moves (reference framing)

Introductory courses (including Google’s Intro to Machine Learning for developers) describe supervised ML as the same story repeated at every company: data → model → training → evaluation → inference. You do not need the equations yet — you need the vocabulary.

StageWhat happens (plain language)
DataRows of features (inputs the model reads) and often a label (the outcome to learn). Tables, pixels, audio — all become numbers eventually.
ModelA big bundle of parameters (weights) plus an architecture choice — the machine’s current “guess” at the rule from features to label.
TrainingShow many labeled examples; compare prediction vs truth; use loss (how wrong we are) to nudge parameters so the next guess is better on average.
EvaluatingTest on held-out data the model did not tune on — honest estimate of real-world behaviour.
InferenceDeploy: feed new features only, get predictions fast (recommendation, risk score, category).
📏

Loss, without calculus

Think “penalty points” when the model is off. Training = adjust knobs to reduce average penalty on training data, without cheating on the final test set.

Core idea
🌍

Data quality beats buzzwords

Larger, more diverse datasets usually generalise better — if labels are trustworthy. Garbage labels teach garbage rules; rare events need extra care (fraud, rare disease).

Reality

Training ≠ inference

Training can be heavy (GPU hours). Inference is what runs in the app on every request — must be fast, monitored, and sometimes simplified for phones.

Deployment
Rule-based vs learned: “If temperature > 40°C then alert” is classic code. ML shines when the pattern is fuzzy: spam, accents, hair styles in photos — too many edge cases to hand-write.

Where to go deeper (free, reputable)

Google’s Supervised Learning page walks through features, labels, and training in the same order as above. For a longer self-paced path, see the Machine Learning Crash Course (videos + exercises). For “which sklearn class should I try first?”, keep the official choosing the right estimator map open in another tab.

Self-check (2 min): write one supervised and one unsupervised problem you personally care about (campus app, health, sport analytics). Compare with a peer — did you both agree on whether labels exist?


Learning when “correct answers” exist

Supervised learning = training with question–answer pairs (labels). The system learns a pattern that maps inputs to targets — like a teacher showing worked examples before the exam.

💡 Remember: Regression → predict a number. Classification → pick a category. Many famous algorithms do one or the other (or probabilities over categories).
Two supervised “shapes”
TypeLabel looks likeBeginner examples
RegressionA quantity (continuous)rent, delivery ETA, power load tomorrow, exam score
ClassificationA class (finite choices)spam / not spam, fraud / OK, plant species from a photo

Some models output probabilities (e.g. “87% chance of rain”) — you still pick a decision threshold when you need a yes/no for the product.

Labels drive learning

Each training row looks like: input features (size of house, exam scores…) + known outcome (price, pass/fail). The algorithm searches for a rule that minimises mistakes on those examples — then you apply it to new rows.

A labeled example is one row where both sides exist: features and the answer you want the model to imitate. During training the model proposes outputs; wherever it is wrong, a loss function measures “how wrong,” and optimisation nudges internal numbers (weights) so the next epoch does better on average.

That is the same story Google’s crash course tells with weather: predict rainfall from temperature, pressure, humidity — compare to measured rain; repeat across thousands of days until errors shrink.

Contrast: unsupervised (preview)

Unsupervised methods get inputs without per-row answers. They find clusters, shopping rules, or low-dimensional structure. Covered in Part 3 — keep the contrast in mind: “answers given” vs “patterns discovered.”

Semi-supervised and reinforcement learning sit in between (not the focus here): a little labeling, or learning from rewards — revisit in advanced courses.

Three beginner-friendly algorithms
📏
Linear regression
Predict a number · draw a “best line” through points

Idea (no formulas): pretend the relationship is “smooth enough” that a straight line (or hyperplane in many dimensions) captures the trend.

Example: bigger flats in the same area → higher rent. New flat goes in → read height off the line → price estimate.

“Regression = predicting numbers.”

🌳
Decision tree
Yes/no questions · sequence of splits

Idea: repeatedly ask simple questions that split the data (“Is humidity > 70%?”). Leaves give predictions.

Story: cricket match — raining? → maybe cancel; pitch dry? → play. Same structure for loan approval, churn, diagnostics.

“Decision tree = step-by-step thinking.”

👥
k-Nearest Neighbors (KNN)
Look around you · vote with neighbours

Idea: plot training points in feature space. For a new point, look at the k closest known points and take a majority vote (classification) or average (regression).

Intuition: “birds of a feather” — similar feature vectors often share the same label.

“KNN = similar things stay together.”

How practitioners choose among the three (first pass)

Trade-offs (simplified — real projects try baselines + cross-validation)
AlgorithmShines when…Watch out for…
Linear regressionTrend is roughly linear; you want fast, interpretable weights; many numeric features.Curved relationships; strong outliers pulling the line; features on different scales (fix with scaling).
Decision treeMixed numeric + categorical columns; nonlinear interactions; need a white-box rule list.Can overfit if grown too deep — practitioners limit depth, prune, or use Random Forest (many trees vote).
KNNSmall/medium data; local “who looks like me?” logic; minimal training code (stores data).Slow on huge sets; sensitive to useless features; distance gets weird in very high dimensions (“curse of dimensionality”).

Industry rule of thumb: try a simple linear model and a shallow tree before a huge neural net — you get a benchmark and sometimes enough accuracy.

🌲 Ensembles (teaser): Random Forest trains many randomised trees and averages their votes — often stronger than one tree, still more interpretable than a deep net. Gradient boosting (XGBoost, LightGBM) builds trees sequentially to fix previous errors — dominates many tabular competitions.

Left: numeric prediction along a trend. Right: a tiny tree for a binary choice — real trees go deeper.

Figure — New point (orange): majority of the three closest training dots decides the label.

KNN practical notes: choose odd k for binary votes to reduce ties. Scale features so “distance” is not dominated by one huge column (e.g. income in thousands vs age 18–25). In production, nearest-neighbour search uses indexing structures (KD-trees, approximate methods) when data is large.

📧

Classification story · spam

Features might be word frequencies, sender reputation, time of day. Label = human-marked spam or not. Model outputs probability → threshold decides inbox vs junk folder.

Supervised
🏥

Risk scores

Hospitals and insurers use supervised models for triage or readmission risk — high stakes: calibration, fairness across groups, and human oversight matter as much as accuracy.

Ethics
🛒

Churn

Telco / streaming: will this subscriber leave? Trees and gradient boosting are common on tabular CRM data; features encode billing, usage drops, support tickets.

Business ML

Vocabulary (with tiny examples):

TermMeaningExample
FeatureOne input signal the model reads.Square feet, word “free” count in email.
LabelTarget outcome in training.Monthly rent; “spam” vs “ham”.
TrainingFitting parameters on labeled rows.Overnight batch on a server cluster.
ValidationTune choices without touching final test.Pick tree depth using a held-out slice.
InferencePrediction on new live data.Scoring a loan application in 50 ms.
Optional · simulate a toy “regression feeling”

Check the box to show pretend Python output — still the same Data → Learn → Predict story.

features = [sqft, rooms] labels = [rent] model.fit(features, labels) # …learning… predict([950, 2]) → 24_500 (local currency)
Optional · classification printout (same pipeline, different label type)
X = vectorize(email_text) y = [spam, ham, spam, …] clf.fit(X, y) predict(new_mail) → "spam" (P=0.94)
Metrics teaser: accuracy alone misleads on imbalanced data (99% “OK” transactions). Practitioners pair precision, recall, F1, ROC-AUC — Module 2 introduced this; deployment ethics revisits it in Module 7.

Finding structure without per-row answers

You still feed data, but there is no single “correct label” column to copy. The algorithm optimises an internal goal — e.g. tight groups or frequent item sets.

🎯

k-means clustering

Idea: fix the number of groups k. The method alternates between assigning each point to its nearest centre and moving centres to the mean of their points — “groups of similar customers,” “performance bands,” etc.

“Clustering = finding hidden groups.”

Unsupervised
🛒

Association (market basket)

Idea: mine rules like “if bread then butter” from receipts. Retailers use this for shelf layout and bundles — same logic powers “customers who bought X also bought Y.”

“If A happens → B is likely.”

Patterns

k-means — a little more depth (still intuition-first)

Steps in words: (1) Place k centre points in feature space (often random init). (2) Assign every data point to the nearest centre. (3) Move each centre to the average of its members. (4) Repeat until assignments stop changing much.

Choosing k: domain knowledge first (“we want exactly three pricing tiers”). Data teams also eyeball an elbow plot of error vs k or use internal cluster-quality scores — no single magic number without context.

Limits: assumes roughly globular groups; struggles with elongated shapes unless you preprocess or pick another algorithm (DBSCAN, spectral, hierarchical). For a taste of hierarchy: agglomerative clustering merges the two closest mini-clusters repeatedly — produces a tree (dendrogram) of similarity.

Association rules — three handy words

Support, confidence, lift (plain English)
TermThink of it as…
SupportHow often itemset {A,B} appears in baskets — raw popularity.
ConfidenceAmong baskets with A, how often B also appears — conditional frequency.
LiftConfidence divided by baseline rate of B — >1 means positive association beyond chance.

Classic algorithms such as Apriori and FP-Growth prune the explosion of possible rules — you do not enumerate every combination naïvely.

📊

Dimensionality reduction (peek)

PCA finds new axes that keep maximum variance in fewer numbers — used for visualising clusters in 2D and as a preprocessing step. Different goal than k-means (compression vs grouping).

Extra
🔒

Privacy note

Cluster IDs can still re-identify people if combined with other tables. Basket data reveals lifestyle — treat it as sensitive in product design.

Ethics

Figure — Points group visually; k-means finds centres that explain those groups.

Basket → insight: “Bread + butter + jam” might co-occur in the same trip. That is not causation, but it is useful for stocking and recommendations.
Where unsupervised shows up
🏬

Retail · shelf & bundles

High support + high lift pairs become combo deals; low-lift noise is discarded. Seasonality matters — holiday baskets differ from summer baskets.

Shopping
🎓

Campus · enrolment patterns

Cluster students by elective co-enrolment to spot hidden programmes. No grades needed — naming clusters (“track A”) still needs human interpretation.

Education
🏭

Industry · maintenance regimes

Vibration sensors on motors: cluster normal vs wear regimes. Unsupervised flags strange segments before catastrophic failure.

IoT

Layers inspired by brains (but not a literal copy)

Biological neurons motivated early models; modern deep nets are engineering artefacts trained with gradient-based optimisation — still, the layered picture helps beginners.

Question: “How does a brain compute?” — roughly, many connected cells, signals, adaptation. Artificial neural networks stack layers of simple units: weighted sums + nonlinearities. Depth lets the network build features automatically (especially for images, sound, text).

Historically: a perceptron is one neuron-like threshold unit. Chain many → layer. Chain layers → network. Modern networks contain millions or billions of parameters — impossible to hand-tune; they are learned by gradient-based optimisation (variants of stochastic gradient descent).

Activation functions (ReLU, sigmoid, …) insert mild nonlinearity so the stack is not equivalent to a single linear map. Without them, depth buys nothing — composition of linear maps is still linear.

Backpropagation (one sentence): the loss at the output is propagated backward through the graph so each weight knows how much it contributed to the error — calculus automates credit assignment at scale.

Human brain (intuition)Neural network (engineering)
NeuronsNodes / units
Electrical signalsNumbers flowing through the graph
Plasticity / learningUpdating weights from data

Figure — Information flows left to right; “deep” simply means multiple learned layers.

🔥 Deep learning = neural nets with enough depth that hierarchical features matter — speech assistants, medical imaging, autonomous driving stacks, etc. You still need data, compute, and careful evaluation.

Specialised architectures (names to recognise)

🖼️

CNNs · vision

Convolutional networks reuse small filters across the image — translation-friendly, fewer parameters than a fully dense layer on every pixel. Standard for photos, video frames, medical scans.

Spatial data
📝

RNNs / LSTM · sequences

Older workhorse for time series and text before Transformers — hidden state carries memory along a sequence. Still appears in edge devices with tight memory.

Sequences

Transformers · attention

Power most large language models (GPT-style) and vision transformers. Self-attention lets every token look at every other token — scales with data and compute; see Google Crash Course “Intro to Large Language Models” for a gentle tour.

LLM era

Why deep nets need scale: more parameters can fit richer functions — but also overfit small tables. Practitioners use huge public datasets (ImageNet, Common Crawl text), augmentation, dropout, early stopping, and pretrained checkpoints (“fine-tune” on your niche).

Explainability gap: a linear model shows coefficients; a deep net may need saliency maps or probe datasets — governance teams care when decisions affect people.

Further neural explainer (visual): Grant Sanderson’s 3Blue1Brown · Neural networks series builds intuition for weights, layers, and gradient descent without prerequisites beyond basic calculus curiosity.

Activities + real-world thinking

Algorithms are different “tools” for different jobs — picking wisely beats chasing buzzwords.

Cheat sheet · problem → typical tool (first guess)
Problem shapeGood first thought
Predict a quantity (marks, load, delay)Regression / gradient boosting / …
Transparent yes/no or small rulesDecision trees (or linear models)
Small data + “similar past cases matter”KNN (watch scaling & noise)
Segment customers / genes / usage with no labelsk-means or other clustering
Basket / click co-occurrenceAssociation rules
Images, speech, unstructured text at scaleDeep neural networks (often pretrained)
Ranking items for a userCollaborative filtering + embeddings; hybrid with content features
Detecting “weird” machine behaviourUnsupervised anomaly detection on sensor streams

The official scikit-learn estimator map is a flowchart: start from “>50 samples?” and follow branches — it is not law, but excellent for homework and interviews.

⚠️ Common beginner traps: (1) Testing on the same rows you trained on — inflated scores. (2) Leaking future information into features (e.g. using post-outcome columns). (3) Optimising accuracy on a 99% negative class. (4) Assuming correlation implies safe automation without human review.

Activity 1 — identify the algorithm (self-check): guess mentally, then reveal.

1. Predict semester marks from attendance + past scores →

Regression (numeric prediction) or related supervised models — evaluation on held-out terms matters.

2. Group students into study bands with no grades given →

Clustering (unsupervised) — you must interpret and name clusters responsibly.

3. Face recognition at building access →

Deep convolutional networks (vision) — plus policy for consent, security, and bias testing.

4. “Customers who bought phone cases also bought screen guards” →

Association rule mining / collaborative-style recommendations — measure support, confidence, lift; not supervised classification.

5. Divide cities into climate zones with no predefined names →

Clustering (k-means or alternatives) — you label clusters after inspecting weather stats.

6. Translate spoken English to Hindi in a phone app →

Sequence-to-sequence deep models (Transformers) trained on massive parallel corpora — far beyond k-means or small trees.

Try with a partner (5 min): list one supervised and one unsupervised idea for a campus app — e.g. timetable stress forecasting vs club discovery from survey text.

Activity 2 — design AI for a college (brainstorm): attendance risk flags, fair placement into project groups, course demand forecasting, library seat hints… For each, note data you would need and harms if wrong.
🚀

Closing lines to remember

“Algorithms are different brains for different problems.” “Choosing the right tool — and validating it — is what makes AI smart.”

Mindset

Quick Knowledge Check

10 beginner-friendly questions on supervised vs unsupervised, regression, trees, KNN, k-means, baskets, and neural / deep learning — aligned to this module’s story-first content.

Score: 0 / 0

Key Takeaways

Module 3 distilled: full supervised story (features, labels, loss, validation), regression vs classification, linear/tree/KNN trade-offs, clustering & baskets + metrics, neural depth and specialised nets, common pitfalls, extended activities.

↑ Back to Top
📚 Further reading:
• Google: Intro to ML — developers.google.com/…/intro-to-ml · Supervised learning — …/supervised
• Google ML Crash Course (full track) — developers.google.com/…/crash-course
• scikit-learn: choosing an estimator — scikit-learn.org
• 3Blue1Brown · Neural networks — 3blue1brown.com