MODULE 06 · 2+ HOURS · SEEING WITH MATH

Computer
Vision Fundamentals

How computers understand photos and video — pixels, faces, text in pictures, and real uses you already know.

Pictures as numbers
Faces & objects
10 Quiz Questions
8 Real-World Examples
10 Quiz Questions
👁️ Vision intro 🧠 Deep CV 🖼️ Processing 🌍 Applications
↓ scroll to begin

How computers “see” images

To a computer, a photo is not a memory — it is a grid of numbers. Vision AI learns patterns in those numbers, the same way you learned that a banana is yellow and curved.

💡 From Module 3: Machine learning still means “learn from examples.” Here, the examples are photos with labels (cat, scratch, ripe, empty shelf).

What is a pixel? Zoom in on any digital photo and you see tiny coloured squares. Each square is a pixel. The computer stores a number (or a few numbers) for each pixel — how bright it is, and what colour.

Colour in simple terms: Many images use three numbers per pixel — how much red, green, and blue (RGB). Mix those and you get the colour you see on screen.

Humans vs computers: You recognise a friend’s face in a crowd using memory and context. A computer starts with raw numbers and must learn which number patterns mean “face,” “car,” or “crack in the wall.”

Three main jobs vision AI can do:

Image basics — size and file types

Resolution is pixels wide × tall (e.g. 1920×1080). More pixels = more detail and more work for the computer.

Grayscale = one brightness number per pixel. Factories often use it to count dark blobs on a light tray.

JPEG, PNG: Common photo formats. JPEG smaller for photos; PNG for sharp edges and screenshots.

THREE QUESTIONS FOR ONE PHOTO classify whole image detect each object segment each pixel

Figure — Pick one job per project; do not do all three on day one.

Who labels training photos?

WhoGood forWatch out
StudentsLeaves, demosAgree on rules
Factory workersScratch vs OKTired eyes; rotate
ExpertsMedical outlinesPrivacy, law
Example — barcode vs vision: Barcode is exact if present. Vision handles crumpled labels or fruit with no sticker.
FROM PHOTO TO NUMBERS Photo on screen Grid of pixels each with RGB numbers AI finds patterns edges, shapes, objects Answer: “apple”

Figure — The computer never “sees” like you do — it reads number patterns.

A PHOTO = GRID OF TINY SQUARES (PIXELS) Each square has number(s) for colour — AI learns patterns in those numbers.

Figure — Zoom in enough and every picture looks like coloured squares.

Vision jobs — types and what they are used for

Job typeWhat you getUsed forExample
ClassificationOne label for whole imageSort photos, quality OK / not OK“Beach” search in phone gallery
DetectionBoxes around each thingCount people, find cars, track ballSecurity camera person overlay
SegmentationColour each pixel by typeRoad vs sidewalk, tumour outlineSelf-driving research maps
Face detectionFind where faces areCamera focus, blur backgroundPhone portrait mode
Face recognitionMatch face to identityUnlock device (needs care + consent)Face unlock on phone
OCR (read text)Text from imageScans, receipts, signsDeposit a cheque in banking app
Pose / gestureWhere body joints areFitness apps, sign language researchDance game scoring

What training data looks like

To teach vision AI, people collect labelled images:

If you only train on bright sunny photos, the system may fail on dark rainy photos. That is not “stupid AI” — it is missing examples.

Example — phone gallery search: You type “beach” and see beach photos. The phone learned what beach scenes look like from many labelled images — sand, sky, water patterns — not from reading the word on the photo.
Example — school science fair: Students photograph leaves and label them “oak,” “maple,” “birch.” A simple classifier can guess new leaf photos if lighting is similar to training photos.
Try it · Pick one vision job for a school cafeteria
TRAINING LOOP photo → guess → wrong? → adjust repeat thousands of times

Figure — Same as Module 3: learn from labelled examples.

Transfer learning — why it saves time

A model trained on millions of general photos already knows edges and textures. You teach a small new head for your labels — bruised apple, rust spot — with hundreds of photos, not millions.

Bad labels teach bad lessons

MistakeResult
“Dog” on cat photoConfuses both
Only sunny photosFails at night
Same photo in train and testFake high score
Try it · Rules vs camera for “empty chair”?

Would a motion sensor be simpler? What breaks the camera approach?

Would you use classification, detection, or segmentation? What would you label in photos? Who checks mistakes before acting?


How image AI learns (simple idea)

Special programs scan small windows over the photo. They learn edges first, then shapes, then whole objects — like learning letters, then words, then sentences.

You do not need the math. The important ideas are:

  1. Show the network many labelled photos.
  2. It guesses the label and checks if it was wrong.
  3. It slowly adjusts internal settings to do better next time.
  4. Repeat until guesses are good enough on new photos it never saw.

Transfer learning (shortcut): Start from a model already trained on millions of general photos (cats, cars, chairs). Then teach it your smaller job — “bruised apple” vs “good apple” — with fewer pictures of your own.

Things that hurt vision AI: dark rooms, blur, shiny reflections, hidden objects, and labels that disagree (one person says “OK,” another says “defect”).

HOW AI “LEARNS” FROM A PHOTO (SIMPLE VIEW) edges shapes object “apple”

Figure — Early layers see simple parts; later layers combine them into meaning.

TRAIN vs TEST — WHY WE SPLIT PHOTOS Training set AI learns from these photos Test set (held back) AI never saw these — honest score If you test on the same photos you trained on, scores look too good and lie.

Figure — Like studying with practice questions, then taking a new exam.

Common data problems — and what to do

ProblemWhat goes wrongWhat helps
Too few photosGuesses random on new imagesCollect more; use transfer learning
All photos look the sameWorks in lab, fails in real roomAdd night, blur, different angles
Wrong labelsAI learns the wrong lessonTwo people label; spot-check
Class imbalance99% “OK” → always says OKCollect more defect photos; measure fairly
Leaky test setNear-duplicates in train and testSplit by time or camera, not random only
When rules are enough Bright room, fixed camera, same object every time — threshold counting may beat a big neural network.
When AI helps Objects vary in size, colour, or background — many labelled photos teach patterns rules cannot write by hand.
When humans stay boss Medical, legal, policing — AI suggests; human decides and explains to the person affected.
Example — fruit sorting belt: A camera over the conveyor takes a photo of each apple. A network trained on thousands of labelled images flags bruises. A puff of air or a gate pushes bad apples aside. Humans still spot-check random trays.
Example — document scanner app: OCR reads account numbers from a cheque photo. The app checks format (right number of digits) before sending — vision plus simple rules.

Before AI: blur, edges, and filters

Long before big neural networks, engineers used simple picture steps. Many factory and medical systems still use them because they are fast, cheap, and easy to explain.

Why simple steps still matter: If lighting is controlled (same lamp, same distance), you can count white pills or measure a screw head without training a huge model. You can explain every step to an inspector.

Typical pipeline (classic vision):

  1. Take photo with a fixed camera.
  2. Maybe blur to remove speckle noise.
  3. Adjust brightness if needed.
  4. Find edges or turn black/white (threshold).
  5. Count blobs or measure width in pixels.
  6. Convert pixels to millimetres using a ruler in the scene.
CLASSIC IMAGE PIPELINE (NO BIG AI) camera blur brightness threshold count blobs accept / reject

Figure — Same flow in many factories: prepare image → measure → decision.

THRESHOLD → COUNT BLOBS grey B/W → count

Figure — Pills on tray: white blobs after threshold.

Lighting first Fix the lamp before buying a bigger model.
Fixed camera Bolt it down — shake ruins measure.
OpenCV Free library for blur, edges, threshold — used everywhere.

Simple filter types — what they do

Filter / stepWhat it doesUsed forExample
BlurSmooths tiny specklesCleaner image before countingRemove camera noise on grey tray
SharpenMakes edges crisperSee fine cracks (careful — also sharpens noise)Inspecting metal surface
Edge finderHighlights outlinesMeasure width, find shapeIs the screw head round?
Brightness / contrastDarken or brightenSame rule morning and afternoonFactory belt under changing sun
ThresholdBlack and white onlyCount white blobsPills on dark tray
Crop / resizeCut or shrink imageFaster processing; focus on belt onlyIgnore factory background
Colour filterKeep one colour rangeFind red defects on grey partTomato sorting by redness

Rules vs AI — when to pick which

SituationOften best choiceWhy
Same camera, same lighting, same objectRules + thresholdFast, cheap, easy to audit
Objects vary in look or backgroundTrained vision AIHard to write rules for every case
Need to explain every decision to lawRules first; AI as helper“Pixel count > 500” is clearer than “layer 7 said so”
Phone app for billions of photosBig pre-trained AIScale and variety need deep learning
Example — counting pills: Pills are white circles on a dark tray. A high-contrast photo + threshold turns the image black and white. Software counts white blobs. If count is 30, tray is complete. No neural network needed if the tray always looks the same.
Example — reading a dial: Edge detection finds the needle outline; geometry measures the angle. A fixed camera and good lighting matter more than a fancy model.
Discuss · Could a canteen count sandwiches with only thresholding?

What would have to stay the same every day? What would break the system?


Where you already meet vision AI

You use computer vision more than you think — unlocking your phone, scanning homework, filtering selfies, and getting alerts when a camera sees a person. The ideas from Topics 1–3 show up in all of them.

Same loop everywhere: Camera captures image → software finds patterns → something happens (unlock, beep, stop belt, draw box on screen).

Good uses save time, catch defects, or help people with disabilities (text-to-speech on signs).

Risky uses need extra care: face recognition in public, emotion guessing in hiring, fully automatic medical decisions.

Where vision AI shows up — types and uses

WhereWhat vision doesWhy people use itExample product
PhoneFace unlock, photo search, filters, QR scanConvenience, fun, securityGallery search, portrait mode
CarBackup lines, lane hints, sign readHelp driver see dangerReversing camera, lane assist
FactorySpot scratches, wrong parts, missing capQuality 24/7Camera over conveyor belt
ShopRecognise items at self-checkoutFaster queuesCamera above basket
Home securityPerson / animal / package alertNotify owner, not watch 24/7Video doorbell
FarmingWeed vs crop, ripeness, pest damageTarget spraying, less wasteDrone over field
Healthcare (with doctor)Highlight region on scanSecond pair of eyesX-ray assist — doctor decides
AccessibilityDescribe scene, read text aloudHelp blind or low-vision usersPhone “describe image” feature
CAMERA → SOFTWARE → ACTION camera vision AI decision beep / stop

Figure — Same idea as a robot from Module 4: see, think, act.

Uses that need extra care

UseRisk if wrongWhat responsible teams do
Face recognition in publicWrong person accused; privacy harmConsent, law check, human review, audit logs
Emotion AI in hiringUnfair rejection; pseudo-scienceOften avoided; humans interview
Medical image onlyMissed diseaseDoctor makes final call; regulated testing
Security “weapon” detectionFalse alarm, biasTest on diverse data; human verifies alert
Careful uses: Face recognition in public spaces and medical images need extra testing, consent, and human review. “The AI said so” is not enough for life-changing decisions.

Vision in your Module 8 project

IdeaVision jobKeep small
Recycling binsClassify material~50 photos per class
Plant healthOK vs wilted leafSame camera distance
Parking slotCar vs emptyOne camera angle
Bottle capCap missing?Rules may be enough

With IoT (Module 5): Camera → vision says defect → MQTT message → chart or belt stop. Draw the full chain on your poster.

Example — shelf tidiness: Photo each hour; “aligned / messy” for display only — not for grading people.
Example — helmet check demo: Detect head region; check helmet visible. Needs consent if filming people.
Example — video doorbell: Camera sees motion. Small AI on device asks “person or car?” If person, send phone alert. Cloud stores clip if you pay for storage. You decide whether to open the door — vision only informs you.
Example — factory cap check: Camera looks down at bottles. Vision detects missing cap or wrong label colour. Belt stops automatically; operator removes bad bottle. Classic rules or AI both work if lighting is stable.
Try it · List one helpful and one risky use of face recognition

Explain why in one sentence each. Who should be allowed to override the system?


Quick Knowledge Check

10 easy questions on how machines see pictures. Instant feedback on every answer.

Score: 0 / 0

Key Takeaways

Module 6 in short: photos are grids of numbers; AI learns patterns to recognise things.

↑ Back to Top
📚 Further reading:
• OpenCV documentation — docs.opencv.org
• PyTorch vision tutorials — pytorch.org/vision
• ImageNet & modern CV history — peer surveys on arXiv (search “ImageNet deep learning survey”)