Module 8: Deep Learning (DNN)

Using deep neural networks for inference in OpenCV.

Topics Covered

DNN module overview
Model loading (TensorFlow, Caffe, ONNX, Darknet)
Blob preparation
Inference pipeline
Classification and detection

Algorithm Explanations

1. DNN Module Overview

What it does: Runs pre-trained neural networks for inference (not training).

DNN Inference Pipeline:

┌─────────────────────────────────────────────────────────────────────┐
│                     OpenCV DNN Inference                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐    ┌──────────┐ │
│   │ Load      │    │ Create    │    │ Run       │    │ Post-    │ │
│   │ Model     │───▶│ Blob      │───▶│ Inference │───▶│ Process  │ │
│   │           │    │           │    │           │    │          │ │
│   └───────────┘    └───────────┘    └───────────┘    └──────────┘ │
│        │                │                │                │        │
│        ▼                ▼                ▼                ▼        │
│   .weights/.pb     blobFromImage    net.forward()    Parse        │
│   .cfg/.onnx       (normalize,      (GPU/CPU)        outputs      │
│                    resize)                                         │
│                                                                     │
│   ┌─────────────────────────────────────────────────────────────┐  │
│   │          OpenCV handles framework differences               │  │
│   │   TensorFlow ←→ Caffe ←→ ONNX ←→ Darknet ←→ PyTorch        │  │
│   └─────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

2. Blob Format

What it does: Converts image to neural network input format.

Blob Transformation Visualization:

┌─────────────────────────────────────────────────────────────────────┐
│                  blobFromImage() Transformation                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Input Image (HWC)                    Output Blob (NCHW)          │
│   OpenCV format                        Neural network format       │
│                                                                     │
│   ┌───────────────┐                    ┌─────────────────────┐     │
│   │ ┌───────────┐ │                    │ Batch 0             │     │
│   │ │   Blue    │ │                    │ ┌───┬───┬───┐       │     │
│   │ │  Channel  │ │                    │ │ R │ G │ B │       │     │
│   │ ├───────────┤ │   blobFromImage()  │ │   │   │   │       │     │
│   │ │   Green   │ │   ───────────────▶ │ │ C │ C │ C │       │     │
│   │ │  Channel  │ │   • resize         │ │ h │ h │ h │       │     │
│   │ ├───────────┤ │   • scale          │ │ a │ a │ a │       │     │
│   │ │   Red     │ │   • mean subtract  │ │ n │ n │ n │       │     │
│   │ │  Channel  │ │   • swap R↔B       │ │   │   │   │       │     │
│   │ └───────────┘ │                    │ └───┴───┴───┘       │     │
│   │    H × W × 3  │                    │   1 × 3 × H × W     │     │
│   └───────────────┘                    └─────────────────────┘     │
│                                                                     │
│   Shape: (480, 640, 3)       →         Shape: (1, 3, 224, 224)     │
│   Range: [0, 255]            →         Range: [0.0, 1.0] or norm   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

NCHW Format:

N = Batch size
C = Channels (3 for RGB)
H = Height
W = Width

Shape: (1, 3, 224, 224) for typical ImageNet input

blobFromImage Parameters:

blob = cv2.dnn.blobFromImage(
    image,          # Input image (BGR)
    scalefactor,    # Pixel value scaling (e.g., 1/255)
    size,           # Output dimensions (width, height)
    mean,           # Mean subtraction values (B, G, R)
    swapRB,         # Swap R and B channels (BGR→RGB)
    crop            # Center crop to size
)

Common Preprocessing: | Model | scalefactor | size | mean | swapRB | |——-|————-|——|——|——–| | ImageNet | 1/255 | (224, 224) | (0, 0, 0) | True | | VGG | 1.0 | (224, 224) | (103.939, 116.779, 123.68) | False | | SSD | 1.0 | (300, 300) | (104, 177, 123) | False | | YOLO | 1/255 | (416, 416) | (0, 0, 0) | True |

3. Inference Pipeline

Step-by-Step:

# 1. Load model
net = cv2.dnn.readNet('model.weights', 'model.cfg')

# 2. Set backend/target (optional)
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

# 3. Prepare input
blob = cv2.dnn.blobFromImage(image, 1/255.0, (416, 416), swapRB=True)

# 4. Set input
net.setInput(blob)

# 5. Forward pass
output = net.forward()  # Single output
# or
outputs = net.forward(output_layer_names)  # Multiple outputs

# 6. Post-process results

Getting Output Layer Names:

layer_names = net.getLayerNames()
output_layers = [layer_names[i - 1] for i in net.getUnconnectedOutLayers()]

4. Classification

What it does: Assigns image to one of N categories.

Classification Pipeline:

┌─────────────────────────────────────────────────────────────────────┐
│                    Image Classification                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Input Image         Neural Network           Output Vector       │
│                                                                     │
│   ┌───────────┐      ┌─────────────┐      ┌───────────────────┐   │
│   │   🐱     │      │ ┌─────────┐ │      │  cat:     0.92   │   │
│   │   Cat     │  ──▶ │ │ Conv   │ │  ──▶ │  dog:     0.05   │   │
│   │  Image    │      │ ├─────────┤ │      │  bird:    0.02   │   │
│   │           │      │ │ Conv   │ │      │  car:     0.01   │   │
│   └───────────┘      │ ├─────────┤ │      │  ...             │   │
│                       │ │  FC    │ │      │                   │   │
│   224×224×3          │ ├─────────┤ │      │  N classes        │   │
│                       │ │Softmax │ │      │  (probabilities)  │   │
│                       │ └─────────┘ │      └───────────────────┘   │
│                       └─────────────┘                              │
│                                                                     │
│   argmax() → class_id = 0 (cat)                                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Output: Probability vector of shape (1, N)

Processing:

blob = cv2.dnn.blobFromImage(image, 1/255.0, (224, 224), swapRB=True)
net.setInput(blob)
predictions = net.forward()

# Get top prediction
class_id = np.argmax(predictions[0])
confidence = predictions[0][class_id]

Softmax (if not applied in model):

softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ)

5. Object Detection (YOLO)

YOLO Detection Concept:

┌─────────────────────────────────────────────────────────────────────┐
│                    YOLO: You Only Look Once                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Input Image               Grid Division        Per-Cell Output   │
│                                                                     │
│   ┌─────────────┐          ┌───┬───┬───┐       Each cell predicts: │
│   │   🚗       │          │   │ 🚗│   │       • B bounding boxes  │
│   │  ┌───┐     │   ───▶   ├───┼───┼───┤       • Confidence scores │
│   │  │car│     │   S×S    │   │   │   │       • C class probs     │
│   │  └───┘     │   grid   ├───┼───┼───┤                            │
│   │     🐕    │          │   │   │ 🐕│                            │
│   └─────────────┘          └───┴───┴───┘                            │
│                                                                     │
│   Single forward pass → detect all objects at once (fast!)         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

YOLO Output Vector:

┌─────────────────────────────────────────────────────────────────────┐
│                    Detection Output Format                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Each detection = [cx, cy, w, h, obj, c1, c2, c3, ..., cN]        │
│                                                                     │
│   ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐          │
│   │ cx  │ cy  │  w  │  h  │ obj │ c1  │ c2  │ c3  │ ... │          │
│   └──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴──┬──┴─────┴─────┴─────┘          │
│      │     │     │     │     │     │                                │
│      │     │     │     │     │     └── Class probabilities          │
│      │     │     │     │     │         (person, car, dog, ...)      │
│      │     │     │     │     │                                      │
│      │     │     │     │     └── Objectness (P(object))             │
│      │     │     │     │                                            │
│      │     │     └─────┴── Box size (normalized 0-1)                │
│      │     │                                                        │
│      └─────┴── Box center (normalized 0-1)                          │
│                                                                     │
│   Final confidence = objectness × class_probability                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Output Structure (per detection):

[center_x, center_y, width, height, objectness, class_1_prob, class_2_prob, ...]

Processing:

for detection in output:
    scores = detection[5:]
    class_id = np.argmax(scores)
    confidence = scores[class_id] * detection[4]  # objectness × class_prob

    if confidence > threshold:
        center_x = int(detection[0] * width)
        center_y = int(detection[1] * height)
        w = int(detection[2] * width)
        h = int(detection[3] * height)

        x = center_x - w // 2
        y = center_y - h // 2

Non-Maximum Suppression:

indices = cv2.dnn.NMSBoxes(boxes, confidences, score_threshold, nms_threshold)

6. SSD Detection Output

SSD vs YOLO Output Comparison:

┌─────────────────────────────────────────────────────────────────────┐
│                    Detection Output Formats                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   YOLO Output:                                                      │
│   ┌────────────────────────────────────────────────┐               │
│   │ [cx, cy, w, h, obj, class_probs...]            │  Relative     │
│   │  └──normalized 0-1──┘                          │  coords       │
│   └────────────────────────────────────────────────┘               │
│                                                                     │
│   SSD Output:                                                       │
│   ┌────────────────────────────────────────────────┐               │
│   │ [batch, class, conf, x1, y1, x2, y2]           │  Corner       │
│   │                       └──normalized 0-1──┘     │  coords       │
│   └────────────────────────────────────────────────┘               │
│                                                                     │
│   Key Differences:                                                  │
│   • YOLO: center + width/height                                    │
│   • SSD: top-left + bottom-right corners                           │
│   • Both normalized to [0, 1]                                      │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Output Format: (1, 1, N, 7) where each detection is:

[batch_id, class_id, confidence, x1, y1, x2, y2]

Coordinates are normalized [0, 1].

Processing:

for detection in output[0, 0]:
    confidence = detection[2]
    if confidence > threshold:
        class_id = int(detection[1])
        x1 = int(detection[3] * width)
        y1 = int(detection[4] * height)
        x2 = int(detection[5] * width)
        y2 = int(detection[6] * height)

7. Performance Optimization

Profiling:

t, _ = net.getPerfProfile()
time_ms = t * 1000 / cv2.getTickFrequency()

Optimization Strategies:

Use GPU:

net.setPreferableBackend(cv2.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA)

Reduce Input Size:
- Smaller blobs = faster inference
- Trade-off with accuracy

Batch Processing:

blob = cv2.dnn.blobFromImages(images, ...)  # Multiple images

Use FP16 (if supported):

net.setPreferableTarget(cv2.dnn.DNN_TARGET_CUDA_FP16)

Model Optimization:
- Quantization (INT8)
- Pruning
- Knowledge distillation

8. Common Architectures

Tutorial Files

File	Description
`01_dnn_basics.py`	Loading models, blob preparation, inference

Key Functions Reference

Function	Description
`cv2.dnn.readNet()`	Auto-detect and load model
`cv2.dnn.readNetFromDarknet()`	Load Darknet/YOLO
`cv2.dnn.readNetFromTensorflow()`	Load TensorFlow
`cv2.dnn.readNetFromCaffe()`	Load Caffe
`cv2.dnn.readNetFromONNX()`	Load ONNX
`cv2.dnn.blobFromImage()`	Create input blob
`net.setInput()`	Set network input
`net.forward()`	Run inference
`net.setPreferableBackend()`	Set computation backend
`net.setPreferableTarget()`	Set target device
`cv2.dnn.NMSBoxes()`	Non-max suppression

Module 8: Deep Learning (DNN)

Topics Covered

Algorithm Explanations

1. DNN Module Overview

2. Blob Format

3. Inference Pipeline

4. Classification

5. Object Detection (YOLO)

6. SSD Detection Output

7. Performance Optimization

8. Common Architectures

Tutorial Files

Key Functions Reference

Further Reading