Editor's Note: tryo.labs is a technology support company specializing in machine learning and natural language processing. In this article, the company’s researchers introduce Faster R-CNN, an advanced object detection tool used in their research process, covering its architecture and implementation principles.
Previously, we explained what object detection is and how it is applied in deep learning. Last year, we decided to explore Faster R-CNN in depth. By studying the original paper and related references, we gained a clear understanding of how it functions and how it can be deployed.
We later implemented Faster R-CNN on Luminoth, a computer vision toolkit based on TensorFlow, and shared our findings at the European and Western Conferences of the Open Data Science Conference (ODSC), where it received significant attention.
Based on our research and development efforts with Luminoth, as well as the results we shared, we believe it’s important to document our findings in a blog post for our readers.
Background
Faster R-CNN was first introduced at NIPS 2015 and has since undergone several revisions. It is the third iteration of the R-CNN series developed by Ross Girshick’s team.
In 2014, the first R-CNN paper, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,†introduced a method using selective search to propose regions of interest (RoIs) and a standard convolutional neural network to classify and refine them. In early 2015, R-CNN evolved into Fast R-CNN, introducing ROI Pooling to share computation and speed up the model. Finally, Faster R-CNN was proposed as the first fully end-to-end trainable model for object detection.
Architecture
The architecture of Faster R-CNN is quite complex due to its multiple components. We’ll start with a general overview and then go into detail about each part.
The core problem is to extract from an image:
- A list of bounding boxes
- Labels for each bounding box
- Probabilities for each label and bounding box
[Image: Introduce the target detection tool Faster R-CNN, including its construction and implementation principle]
Faster R-CNN schematic
The input image is represented as a tensor (a multidimensional array) of height × width × depth. Before being passed to the intermediate layer, a pre-trained CNN is used to generate a convolutional feature map. This map serves as the feature extractor for the next stage.
This technique is widely used in transfer learning, especially when leveraging weights trained on large-scale datasets to train models on smaller ones.
Next, we use the Region Proposal Network (RPN) to identify potential regions that may contain objects. The RPN takes the CNN features and generates a set of candidate bounding boxes.
One of the biggest challenges in using deep learning for object detection is generating variable-length bounding box lists. Most deep neural networks produce fixed-size outputs. For example, in image classification, the output is typically a tensor of size (N,), where N is the number of classes.
To address this, Faster R-CNN uses anchors—fixed reference bounding boxes placed uniformly across the image. Instead of directly detecting objects, the problem is split into two parts: determining whether an anchor contains an object and adjusting the anchor to better fit the object.
After obtaining a list of possible objects and their locations, the next step is to extract features from the CNN using ROI pooling to generate a vector representing the object.
Finally, the R-CNN module classifies the content within the bounding box and adjusts the coordinates to better match the object.
Although this description is brief, it captures the overall workflow of Faster R-CNN. In the following sections, we will delve deeper into the architecture, loss functions, and training process for each component.
Basic Network
As mentioned earlier, the first step involves using a pre-trained CNN (such as ImageNet) and extracting features from an intermediate layer. For those familiar with machine learning, this might seem straightforward, but understanding how and why it works is crucial.
The output of the middle layer is also visualized to help understand the model's behavior.
There is no universally best network architecture. Early R-CNN used ZF and VGG, but many other networks have since been developed. For example, MobileNet is a lightweight, efficient network with around 3.3 million parameters, while ResNet-152 has approximately 60 million parameters. Newer architectures like DenseNet have improved performance while reducing parameter counts.
VGG
Before discussing the pros and cons, let’s take VGG-16 as an example to understand how these networks operate.
[Image: VGG architecture]
VGG, named after the team that won the 2014 ImageNet ILSVRC competition, introduced a deep convolutional network in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition.†While not considered "very deep" by today’s standards, it significantly increased the number of layers compared to previous models, starting the trend of “deeper → more capacity → better†performance.
When using VGG for classification, a 224x224x3 tensor (an RGB image) is input. However, since the final layer is fully connected, the input must be of a fixed size. The output of the last convolutional layer is flattened before being fed into the FC layer.
Since we are using the output of an intermediate convolutional layer, the input size doesn’t matter. We don’t need to worry about this part because only convolutional layers are used.
ResNet vs VGG
Today, most people use ResNet as the base network for feature extraction instead of VGG. Kaiming He, Shaoqing Ren, and Jian Sun, the co-authors of Faster R-CNN, also co-authored the original ResNet paper.
ResNet offers several advantages over VGG, including better performance on both classification and object detection tasks. It also makes it easier to train deep networks using residual connections and batch normalization, which were not available in early VGG versions.
Anchors
Once we have the processed image, the next step is to find region proposals (RoIs) for classification. As mentioned, anchors are a key concept in solving the variable-length problem.
Our goal is to detect bounding boxes of different sizes and aspect ratios. One approach is to train a network to predict eight values: xmin, ymin, xmax, ymax for each object. However, this approach has several issues, such as difficulty handling varying image sizes and ensuring valid predictions.
Instead, we predict the offset from a reference frame. This makes the task more manageable. Anchors are fixed bounding boxes placed across the image and serve as a reference for initial object localization.
Since we are working with a convolutional feature map, the anchors are spaced at intervals determined by the subsampling ratio. For example, in VGG, the ratio is 16, meaning each anchor corresponds to 16 pixels in the original image.
[Image: Original image of the anchor center]
For better anchor selection, we define a set of dimensions (e.g., 64px, 128px, 256px) and aspect ratios (e.g., 0.5, 1, 1.5), combining all possible combinations.
[Image: Left: Anchors, Middle: Single Point Anchor, Right: All Anchors]
Region Proposal Network (RPN)
[Image: Input convolution feature map, RPN generates proposals on the image]
The RPN processes all anchors and outputs a set of proposals. Each anchor has two outputs: an objectness score (indicating whether it contains an object) and a bounding box regression (adjusting the anchor to better fit the object).
In a fully convolutional setup, the RPN is efficient. It takes the feature map from the underlying network and applies a 3x3 convolutional layer followed by a 1x1 convolutional layer with two parallel channels.
[Image: Installation of convolution in RPN, k is the number of anchors]
For classification, each anchor outputs two probabilities: background or foreground. For regression, four values are predicted to adjust the anchor’s position and size.
Using the final proposal coordinates and their objectness scores, we can select the top proposals for further processing.
Training, Targets, and Loss Function
The RPN makes two types of predictions: binary classification (object vs background) and bounding box regression. During training, we classify anchors into foreground (IoU > 0.5) and background (IoU < 0.1). We then randomly sample 256 anchors while maintaining the balance between foreground and background.
The loss function for classification is binary cross-entropy, and for regression, we use Smooth L1 loss. However, dynamic batches can cause imbalances, so we sometimes use the anchor with the highest IoU to handle cases where no foreground anchors exist.
Post-processing
Non-Maximum Suppression (NMS)
Because anchors often overlap, proposals tend to overlap on the same object. To resolve this, we apply NMS, which filters out overlapping proposals based on their scores. Proposals with IoU above a certain threshold are removed, keeping only the highest-scoring ones.
Setting the IoU threshold is critical. A value of 0.6 is commonly used.
Proposal Selection
After applying NMS, we keep the top N proposals. In the paper, N=2000, but even with N=50, good results can still be achieved.
Standalone Application
The RPN can be used alone without the need for a second-stage model. In single-class problems, the objectness score can act as the final class probability.
RPNs are particularly useful in tasks like face and text detection, though they still face challenges.
One advantage of using only the RPN is faster training and prediction. Since it’s a simple network composed of convolutional layers, it’s faster than other classification networks.
ROI Pooling
After the RPN phase, we have unclassified proposals. The next challenge is to classify these bounding boxes.
A straightforward approach is to crop each proposal and pass it through the pre-trained network. However, this is inefficient for large numbers of proposals.
Faster R-CNN solves this by reusing existing feature maps. ROI pooling extracts fixed-size features for each proposal, enabling efficient classification.
[Image: ROI pooling]
Luminoth’s approach includes cropping the feature map and using bilinear interpolation to resize it to 14×14×convdepth, followed by a 7×7×convdepth feature map after max pooling.
R-CNN
R-CNN is the final stage of Faster R-CNN. After obtaining the proposal, it extracts features via ROI pooling and classifies them.
R-CNN has two purposes:
- Classify proposals into one of the categories or mark them as background
- Adjust the bounding box based on the predicted category
In the original paper, R-CNN flattens the feature map and uses ReLU and two 4096-unit fully connected layers. It then applies two separate fully connected layers for classification and regression.
[Image: Structure of R-CNN]
Training and Goals
R-CNN’s goal is similar to RPN’s, but with consideration for different categories. Proposals with IoU > 0.5 are considered correct, while those between 0.1 and 0.5 are marked as background.
We randomly sample 64 mini-batches with 25% foreground and 75% background. The classifier uses multi-class cross-entropy loss, while the regression uses Smooth L1 loss.
Post-processing
Like RPN, R-CNN produces categorized proposals that require further processing. We filter out low-probability proposals and apply NMS per category.
Finally, we set a probability threshold and limit the number of objects per category.
Training
In the original paper, training involved multiple steps, but end-to-end joint training has proven more effective.
After combining the models, we get four losses: two for RPN and two for R-CNN. The base network can be fine-tuned or left as-is.
The training of the base network depends on the target object and available computing power. Training it is time-consuming but can improve performance.
Losses are weighted to prioritize certain components. Regularization techniques like L2 are also used depending on the network and training settings.
We use stochastic gradient descent with momentum set to 0.9. Training Faster R-CNN is generally straightforward with proper optimization.
The learning rate starts at 0.001 and drops to 0.0001 after 50,000 iterations. These hyperparameters are often adjusted during training.
Evaluation
The evaluation uses mean average precision (mAP) with a specific IoU range. mAP is a standard metric in information retrieval and is used to evaluate ranking and object detection performance.
Conclusion
Now you should have a good understanding of how R-CNN works. If you want to learn more, you can explore the implementation in Luminoth.
Faster R-CNN demonstrates that the same principles can solve complex computer vision problems during the deep learning revolution. Modern models built on these ideas can be used for object detection, semantic segmentation, 3D object detection, and more. Some borrow from RPN, others from R-CNN, and some combine both. Understanding how they work is essential for tackling future challenges in the field.
Adapter Of Charger,Power Adapter,Notebook Charger,65W Type C Charger
Dongguan Jinglin Communication Technology Co., Ltd. , https://www.jlpcba.com