A complete, end-to-end machine learning pipeline — from raw video capture to real-time inference on a conveyor belt robotic system.
The end goal of this project is to feed a robotic arm: a vision system that can detect, classify, and track custom wooden objects moving on a physical conveyor belt, at real-time inference speeds, and with sufficient positional accuracy to guide pick-and-place operations. The conveyor belt system is fully 3D printed in PETG, utilizing a modular, open-source design from MakerWorld.
The target objects for recognition are the wooden game pieces from the Catan board game (see reference image below). These four custom object classes — Church, House, Bar, and Dice — are non-standard, small-scale, and visually similar in certain lighting conditions. No existing pre-trained model has ever seen them. This rules out any zero-shot approach and demands a purpose-built, fine-tuned model.
Reference: The Catan board game. The target wooden pieces (Church, House, Bar) are scattered across the board.
The first step of any robust ML project is establishing a baseline — understanding exactly what the off-the-shelf model cannot do. Running the base YOLOv8n (Nano) model on our conveyor belt scene produced spectacularly wrong results.
The model — trained on COCO's 80 generic classes — had no concept of a "Church" shaped wooden block. Instead, it mapped visual features to the nearest object in its training distribution, confidently misclassifying the wooden shapes as "scissors," "surfboards," and "knives." A human hand in the scene was tagged as a "person." Bounding boxes were erratic, appearing and disappearing frame-to-frame with no temporal consistency.
Baseline inference using pre-trained YOLOv8n. Note the severe misclassifications ("scissors", "surfboard") and erratic, temporally inconsistent bounding boxes.
With no existing dataset for these objects, the entire training corpus had to be built from scratch. The approach was pragmatic: capture video of the actual physical setup — same lighting, same camera angle, same conveyor belt surface — and extract frames at a rate that captures sufficient class diversity without redundancy.
Frames were extracted from a recording of the actual conveyor belt scene. This ensures the model trains on the exact domain it will be deployed in — a critical practice that eliminates domain shift between training and inference.
The raw frames were imported into Roboflow, where each object instance was manually annotated by drawing precise bounding boxes and assigning class labels. Each annotation becomes a ground truth data point that the model learns to replicate.
Manual bounding box annotation of custom classes (Church, Bar, Dice, House) within Roboflow. Each shape required individual labeling across all collected frames.
Dataset label distribution: Church (276), Bar (213), House (195), Dice (57). The Dice class imbalance was partially addressed through augmentation.
A fundamental challenge in custom dataset training is overfitting: a small dataset causes the model to memorize training examples rather than learning generalizable features. The solution is data augmentation — synthetically expanding the dataset by applying controlled transformations.
To ensure robust generalizations, the augmentation and preprocessing strategy was configured directly in Roboflow to mirror the specific physical constraints of the physical conveyor belt environment:
Roboflow Dataset v2 configuration dashboard showing preprocessing parameters, augmentation limits, and the 94% / 6% training-to-validation split.
Generating synthetic training data using noise and motion blur augmentations. The dark vignette and grain simulate low-light and motion conditions on the physical belt.
Training was executed on Google Colab using a GPU runtime, leveraging the Ultralytics YOLOv8 framework. The chosen architecture was YOLOv8 Nano (YOLOv8n) — the smallest model in the YOLO8 family, deliberately selected to maximize inference speed on the target edge hardware while still achieving sufficient accuracy for the task.
# Fine-tuning command (Ultralytics CLI)
yolo task=detect mode=train \
model=yolov8n.pt \
data=dataset.yaml \
epochs=100 \
imgsz=640 \
batch=16 \
patience=20
The training curves tell a clear story: both training and validation losses (box, classification, DFL) converge rapidly within the first 20 epochs and stabilize with no signs of divergence or overfitting. Precision and Recall climb to near-ceiling values. The mAP@50 reaches ~97% and mAP@50-95 exceeds 90%, indicating robust detection across a range of IoU thresholds.
Training results over 100 epochs. Rapid convergence of classification and box loss; mAP@50 stabilises above 0.97.
Normalized Confusion Matrix. Church: 1.00, Bar: 0.93, House: 0.92, Dice: 0.50 — the lower Dice score reflects class imbalance (57 vs 195–276 instances for other classes).
The fine-tuned .pt weight file was deployed locally using OpenCV for camera feed capture and the Ultralytics inference API for frame-by-frame detection. The key upgrade over basic detection is the addition of a multi-object tracking algorithm.
Rather than detecting objects independently in each frame (which produces flickering boxes and no temporal consistency), a tracking algorithm like ByteTrack assigns a persistent tracking ID to each detected object across frames. This means the system knows that object #3 (a "House") at position X in frame 150 is the same object at position X+5 in frame 151 — enabling trajectory prediction and confident hand-off to the robotic arm controller.
# Inference with tracking
from ultralytics import YOLO
import cv2
model = YOLO("best.pt") # fine-tuned weights
cap = cv2.VideoCapture(0) # webcam / camera feed
while cap.isOpened():
ret, frame = cap.read()
results = model.track(frame, persist=True, tracker="bytetrack.yaml")
annotated = results[0].plot()
cv2.imshow("Conveyor Tracking", annotated)
Final deployed model. The fine-tuned YOLOv8n accurately classifies custom wooden objects, assigns persistent tracking IDs (blue text), and maintains high confidence scores even while the user interacts with the scene — demonstrating robustness to partial occlusion.