Extra Modules: Advanced Topics

Advanced OpenCV features including face recognition, object tracking, and text detection.

Note: Some features require opencv-contrib-python package.

Topics Covered

Face detection and recognition
Object tracking algorithms
Text detection and OCR

1. Face Recognition

Face Recognition Pipeline:

┌─────────────────────────────────────────────────────────────────────┐
│                    Face Recognition System                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    │
│   │  Input   │    │  Detect  │    │  Align   │    │ Extract  │    │
│   │  Image   │───▶│   Face   │───▶│   Face   │───▶│ Features │    │
│   └──────────┘    └──────────┘    └──────────┘    └──────────┘    │
│                                                         │          │
│                                                         ▼          │
│                                                   ┌──────────┐     │
│   ┌──────────┐                                    │  Match   │     │
│   │  Known   │───────────────────────────────────▶│ Against  │     │
│   │   Faces  │                                    │ Database │     │
│   │ Database │                                    └──────────┘     │
│   └──────────┘                                          │          │
│                                                         ▼          │
│                                                   ┌──────────┐     │
│                                                   │ Identity │     │
│                                                   │   or     │     │
│                                                   │ Unknown  │     │
│                                                   └──────────┘     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Face Detection with DNN

Recommended approach using deep learning:

# Load SSD face detector
net = cv2.dnn.readNetFromTensorflow(modelFile, configFile)

# Prepare input
blob = cv2.dnn.blobFromImage(image, 1.0, (300, 300), (104, 177, 123))
net.setInput(blob)

# Detect faces
detections = net.forward()

# Process detections
for i in range(detections.shape[2]):
    confidence = detections[0, 0, i, 2]
    if confidence > 0.5:
        x1, y1, x2, y2 = detections[0, 0, i, 3:7] * [w, h, w, h]

Face Recognition Algorithms

Eigenfaces (PCA)

What it does: Projects faces into lower-dimensional eigenspace.

Principal Component Analysis:

Compute mean face: μ = (1/N) × Σᵢ xᵢ

Center the data: Φᵢ = xᵢ - μ

Compute covariance: C = (1/N) × Σᵢ ΦᵢΦᵢᵀ

Find eigenvectors of C: Cv = λv
   (Eigenfaces are eigenvectors with largest eigenvalues)

Project to eigenspace: ω = Uᵀ × (x - μ)
   Where U = matrix of top k eigenvectors

Recognition:

Project test face to eigenspace
Find nearest neighbor in projected training set
Use Euclidean distance: d = ||ω_test - ω_train||

recognizer = cv2.face.EigenFaceRecognizer_create(
    num_components=80,  # Number of eigenfaces
    threshold=10000     # Recognition threshold
)

Fisherfaces (LDA)

What it does: Maximizes between-class variance, minimizes within-class.

Linear Discriminant Analysis:

Maximize: J(W) = (Wᵀ S_B W) / (Wᵀ S_W W)

Where:
  S_B = between-class scatter matrix
  S_W = within-class scatter matrix

S_B = Σᵢ Nᵢ × (μᵢ - μ)(μᵢ - μ)ᵀ
S_W = Σᵢ Σₓ∈Cᵢ (x - μᵢ)(x - μᵢ)ᵀ

Advantages over Eigenfaces:

Better handles lighting variations
More discriminative features

recognizer = cv2.face.FisherFaceRecognizer_create(
    num_components=0,   # 0 = use all
    threshold=10000
)

LBPH (Local Binary Patterns Histograms)

What it does: Extracts texture features using local binary patterns.

LBP Operator Visualization:

┌─────────────────────────────────────────────────────────────────────┐
│                    Local Binary Pattern (LBP)                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Step 1: Get neighborhood        Step 2: Compare with center      │
│                                                                     │
│   ┌───┬───┬───┐                  ┌───┬───┬───┐                     │
│   │ 7 │ 9 │ 3 │                  │ 0 │ 1 │ 0 │   ≥5 → 1            │
│   ├───┼───┼───┤   center = 5     ├───┼───┼───┤   < 5 → 0            │
│   │ 6 │ 5 │ 2 │  ───────────▶   │ 1 │   │ 0 │                      │
│   ├───┼───┼───┤                  ├───┼───┼───┤                     │
│   │ 1 │ 8 │ 4 │                  │ 0 │ 1 │ 0 │                      │
│   └───┴───┴───┘                  └───┴───┴───┘                     │
│                                                                     │
│   Step 3: Read binary clockwise → 01001010 = 74 (decimal)          │
│                                                                     │
│           1                                                         │
│         ╱   ╲                                                       │
│        0     0      Binary: 01001010                               │
│        │     │      Decimal: 74                                    │
│        1     0      This becomes pixel value                       │
│         ╲   ╱                                                       │
│           0                                                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

LBPH Face Histogram:

┌─────────────────────────────────────────────────────────────────────┐
│                    LBPH Face Representation                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Face divided into grid:         Histogram per cell:              │
│                                                                     │
│   ┌───┬───┬───┬───┐               ┌────────────────┐               │
│   │ 1 │ 2 │ 3 │ 4 │               │  ▓░▓▓░▓░░▓    │ Cell 1        │
│   ├───┼───┼───┼───┤               ├────────────────┤               │
│   │ 5 │ 6 │ 7 │ 8 │               │  ░▓░▓▓░▓░     │ Cell 2        │
│   ├───┼───┼───┼───┤               ├────────────────┤               │
│   │ 9 │10 │11 │12 │   ─────▶      │  ▓▓░░▓░░▓     │ Cell 3        │
│   ├───┼───┼───┼───┤               ├────────────────┤               │
│   │13 │14 │15 │16 │               │     ...       │ ...           │
│   └───┴───┴───┴───┘               └────────────────┘               │
│                                          │                         │
│   8×8 grid = 64 cells                    │                         │
│                                          ▼                         │
│                    Concatenate all histograms → Feature Vector     │
│                    (64 cells × 256 bins = 16,384 features)         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

LBP Operator:

For each pixel p with neighbors n₀...n₇:

LBP(p) = Σᵢ₌₀⁷ s(nᵢ - p) × 2ⁱ

Where s(x) = 1 if x ≥ 0, else 0

Result: 8-bit code (0-255) describing local texture

Circular LBP:

Sample P points on circle of radius R around center:
  xₚ = x + R × cos(2πp/P)
  yₚ = y + R × sin(2πp/P)

Histogram Computation:

Divide face into grid (e.g., 8×8 cells)
Compute LBP for each pixel
Build histogram for each cell
Concatenate histograms → feature vector

Recognition:

Compare histograms using Chi-squared distance:

χ²(H₁, H₂) = Σᵢ (H₁(i) - H₂(i))² / (H₁(i) + H₂(i))

recognizer = cv2.face.LBPHFaceRecognizer_create(
    radius=1,       # LBP radius
    neighbors=8,    # Number of neighbors
    grid_x=8,       # Grid cells in x
    grid_y=8,       # Grid cells in y
    threshold=80    # Recognition threshold
)

Advantages:

Can be updated with new faces
Robust to lighting changes
Faster training

Training and Prediction

# Prepare training data
faces = [gray_face1, gray_face2, ...]  # Same size
labels = np.array([0, 0, 1, 1, 2, ...])  # Person IDs

# Train
recognizer.train(faces, labels)

# Predict
label, confidence = recognizer.predict(test_face)
# Lower confidence = better match

# Update (LBPH only)
recognizer.update(new_faces, new_labels)

# Save/Load
recognizer.save('model.yml')
recognizer.read('model.yml')

Face Alignment

Normalize face orientation before recognition:

def align_face(img, left_eye, right_eye):
    # Calculate rotation angle
    dY = right_eye[1] - left_eye[1]
    dX = right_eye[0] - left_eye[0]
    angle = np.degrees(np.arctan2(dY, dX))

    # Eye center
    eye_center = ((left_eye[0] + right_eye[0]) // 2,
                  (left_eye[1] + right_eye[1]) // 2)

    # Rotate
    M = cv2.getRotationMatrix2D(eye_center, angle, 1.0)
    aligned = cv2.warpAffine(img, M, (img.shape[1], img.shape[0]))

    return aligned

2. Object Tracking

Object Tracking Concept:

┌─────────────────────────────────────────────────────────────────────┐
│                    Object Tracking                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Frame 1            Frame 2            Frame 3            Frame N │
│                                                                     │
│   ┌───────────┐     ┌───────────┐     ┌───────────┐     ┌────────┐│
│   │     ┌─┐   │     │       ┌─┐ │     │         ┌─┐     │    ┌─┐ ││
│   │     │●│   │────▶│       │●│ │────▶│         │●│────▶│    │●│ ││
│   │     └─┘   │     │       └─┘ │     │         └─┘     │    └─┘ ││
│   │           │     │           │     │           │     │        ││
│   └───────────┘     └───────────┘     └───────────┘     └────────┘│
│                                                                     │
│   Initialize        Predict new        Update model     Continuous │
│   with bbox         location           with appearance  tracking   │
│                                                                     │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │ Tracking vs Detection:                                      │  │
│   │ • Detection: Find all objects each frame (slow but robust) │  │
│   │ • Tracking: Follow known object between frames (fast)      │  │
│   │ • Best: Combine both (detect periodically, track between)  │  │
│   └─────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Tracker Types

Tracker	Speed	Accuracy	Occlusion	Description
BOOSTING	Slow	Low	Poor	AdaBoost-based
MIL	Slow	Medium	Poor	Multiple Instance Learning
KCF	Fast	Medium	Poor	Kernelized Correlation Filters
TLD	Medium	Medium	Good	Tracking-Learning-Detection
MEDIANFLOW	Fast	High	Poor	Optical flow based
MOSSE	V.Fast	Low	Poor	Minimum Output Sum of Squared Error
CSRT	Medium	High	Medium	Discriminative Correlation Filter
GOTURN	Slow	High	Good	Deep learning (CNN)

KCF (Kernelized Correlation Filters)

What it does: Tracks using correlation filters in Fourier domain.

Correlation Filter:

Train filter h that produces high response at target:

g = h ⊛ x

Where:
  x = image patch
  h = filter
  g = response map
  ⊛ = correlation

Fourier Domain (fast computation):

G = H* ⊙ X

Where:
  H* = complex conjugate of filter
  ⊙ = element-wise multiplication

Kernel Trick:

Use non-linear kernel for better separation:

k(x, x') = exp(-1/σ² × (||x||² + ||x'||² - 2F⁻¹(X* ⊙ X')))

CSRT (Discriminative Correlation Filter with Channel and Spatial Reliability)

Improvements over KCF:

1. Spatial reliability map:
   - Learns which parts of target are most reliable
   - Reduces background interference

2. Channel reliability:
   - Weights different features (HOG, color)
   - Adapts to target appearance

Tracking API

# Create tracker
tracker = cv2.TrackerKCF_create()
# or
tracker = cv2.TrackerCSRT_create()
tracker = cv2.TrackerMIL_create()

# Initialize with bounding box
bbox = (x, y, width, height)
tracker.init(frame, bbox)

# Update in each frame
success, bbox = tracker.update(frame)

if success:
    x, y, w, h = [int(v) for v in bbox]
    cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), 2)

Multi-Object Tracking

# Manual approach (recommended)
trackers = []
for bbox in initial_bboxes:
    t = cv2.TrackerKCF_create()
    t.init(frame, bbox)
    trackers.append(t)

# Update all
for i, tracker in enumerate(trackers):
    success, bbox = tracker.update(frame)
    if success:
        # Draw bounding box

Tracking + Detection Hybrid

Best Practice:

Detect objects periodically (every N frames)
Track between detections (fast)
Re-initialize trackers when detection available
Handle track-detection association (Hungarian algorithm)

3. Text Detection and OCR

Text Detection and Recognition Pipeline:

┌─────────────────────────────────────────────────────────────────────┐
│                    OCR Pipeline                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Input Image           Text Detection         Text Recognition    │
│                                                                     │
│   ┌─────────────┐       ┌─────────────┐       ┌─────────────┐     │
│   │ Hello      │       │ ┌─────────┐ │       │             │     │
│   │   World    │  ──▶  │ │ Hello   │ │  ──▶  │  "Hello"    │     │
│   │            │       │ └─────────┘ │       │  "World"    │     │
│   │ OpenCV     │  ──▶  │ ┌─────────┐ │  ──▶  │  "OpenCV"   │     │
│   │            │       │ │ World   │ │       │             │     │
│   └─────────────┘       │ └─────────┘ │       └─────────────┘     │
│                         │ ┌─────────┐ │                           │
│                         │ │ OpenCV  │ │                           │
│                         │ └─────────┘ │                           │
│                         └─────────────┘                            │
│                                                                     │
│   ┌───────────────────────────────────────────────────────────┐    │
│   │ Methods:                                                  │    │
│   │ • MSER: Fast text region detection                       │    │
│   │ • EAST: Deep learning text detection (scene text)        │    │
│   │ • Tesseract: OCR engine for character recognition        │    │
│   │ • EasyOCR: All-in-one detection + recognition            │    │
│   └───────────────────────────────────────────────────────────┘    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

MSER (Maximally Stable Extremal Regions)

What it does: Detects stable regions that often correspond to text.

Algorithm:

1. Threshold image at all levels (0-255)
2. Track connected components through levels
3. Find regions that are "stable" (area changes slowly)

Stability criterion:
  q(i) = |Qᵢ₊Δ - Qᵢ₋Δ| / |Qᵢ|

Where Qᵢ = region at threshold i

mser = cv2.MSER_create()
regions, _ = mser.detectRegions(gray)

# Filter by aspect ratio and size
for region in regions:
    x, y, w, h = cv2.boundingRect(region)
    aspect = w / float(h)
    if 0.1 < aspect < 10 and w > 10 and h > 10:
        # Likely text region

EAST Text Detector

Efficient and Accurate Scene Text detector using deep learning.

Network Output:

1. Score map: Probability of text at each location
2. Geometry: Rotated bounding box parameters
   - 4 distances (top, right, bottom, left)
   - 1 rotation angle

Usage:

# Load model
net = cv2.dnn.readNet("frozen_east_text_detection.pb")

# Output layer names
outputLayers = ["feature_fusion/Conv_7/Sigmoid",  # Scores
                "feature_fusion/concat_3"]         # Geometry

# Prepare input (must be divisible by 32)
blob = cv2.dnn.blobFromImage(image, 1.0, (320, 320),
                             (123.68, 116.78, 103.94),
                             swapRB=True, crop=False)

net.setInput(blob)
scores, geometry = net.forward(outputLayers)

# Decode and apply NMS
boxes, confidences = decode_predictions(scores, geometry)
indices = cv2.dnn.NMSBoxesRotated(boxes, confidences, 0.5, 0.4)

OCR with Tesseract

Integration:

import pytesseract
from PIL import Image

# Simple usage
text = pytesseract.image_to_string(image)

# With configuration
config = '--oem 3 --psm 6'
text = pytesseract.image_to_string(image, config=config)

# Get bounding boxes
boxes = pytesseract.image_to_boxes(image)

# Get detailed data
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT)

OEM (OCR Engine Mode): | Value | Description | |——-|————-| | 0 | Legacy engine only | | 1 | Neural nets LSTM only | | 2 | Legacy + LSTM | | 3 | Default (based on available) |

PSM (Page Segmentation Mode): | Value | Description | |——-|————-| | 3 | Fully automatic page segmentation | | 6 | Assume single uniform block of text | | 7 | Treat image as single text line | | 8 | Treat image as single word | | 10 | Treat image as single character |

OCR Preprocessing

def preprocess_for_ocr(img):
    # 1. Grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # 2. Noise removal
    denoised = cv2.fastNlMeansDenoising(gray)

    # 3. Thresholding
    _, binary = cv2.threshold(denoised, 0, 255,
                              cv2.THRESH_BINARY + cv2.THRESH_OTSU)

    # 4. Deskew (if needed)
    # 5. Rescale small text (2-3x)

    return binary

Tips:

Remove noise before OCR
Use adaptive threshold for uneven lighting
Upscale small text
Invert if dark background

EasyOCR Alternative

import easyocr

# Create reader
reader = easyocr.Reader(['en'])  # Languages

# Read text
results = reader.readtext(image)

# Results: [(bbox, text, confidence), ...]
for bbox, text, conf in results:
    print(f"{text} ({conf:.2f})")