Summary (MagFace in 30 seconds)
Whereas ArcFace only considers the “orientation” of the feature vectors
While ArcFace only considers the “orientation” of the feature vectors, MagFace focuses on the “orientation” and “size” of the feature vectors and gives them the following meaning
Orientation: similarity between images
Size: quality of the images
By learning to cluster high quality data in the center of the class, we are able to
We have shown high accuracy in face recognition, quality assessment, and clustering experiments.
Motivation of MagFace
We extend ArcFace to simultaneously address the following three motivations: 1.
1.[Quality] We want to quantify the data quality.
Data quality is a complex combination of brightness, angle, hiding, and many other factors.
→ In MagFace, data quality is quantified by defining it as “the difficulty of face recognition”.
2.[Recognition] Improving the accuracy of face recognition
The figure above shows the geometric interpretation of ArcFace (see here), which is highly accurate in face recognition.
However, in ArcFace, sample images with a lot of noise are heavily weighted.
→ In MagFace, the effect of noisy samples is reduced by adding constraints such that “difficulty of face recognition” = “size of feature vector”. 3.
- [Clustering] To improve the accuracy of face clustering
→ MagFace collects high quality data, i.e. data with high confidence, at the center of the clusters to help improve the clustering accuracy.
image.png
What is the difference from ArcFace?
We focus not only on the “orientation” of the feature vectors, but also on their “magnitude”.
The Mag in MagFace stands for magnitude.
Geometric Interpretation
Compare the optimized feature spaces of ArcFace and MagFace.
ArcFace (no normalization)
Uniform sample-independent margin mm is used for training
→ No penalty for three different former presidents of Park Geun-hye who entered the red region.
→ The three types of Park Geun-hye stay in arbitrary positions regardless of data quality
→ Intra-class distribution becomes unstable
MagFace
MagFace is trained under the following two conditions
- the higher the quality of the sample, the larger the size of the feature vector aa
- the larger the size of the feature vector aa, the larger the margin mm
→ The higher quality Park Geun-hye is, the closer it is to the class center WW
The radius of the circle indicates the variance of the features, and a smaller radius means higher quality
Designing the Loss Function
ArcFace
Face recognition loss based on cosine similarity (margin mm)
MagFace
Face recognition loss based on cosine similarity
?(??)?(??) : The margin increases with the size of the feature vector aiai (narrow sense monotonically increasing convex function )
Regularization term that rewards the sample with the larger feature vector size aiai
?(??)?(??) : When the size of the feature vector aiai is large, gg becomes small (narrow sense monotonically decreasing convex function)
λgλg : hyperparameter that determines the strength of the regularization term
→ ?(??)?(??) is not obvious at first glance, but it seems to be a simple design that takes advantage of the property of “high data quality, i.e., only samples with a small radius of the circle have a high degree of freedom in aiai.
In the left diagram of MagFace, B1B1 and B3B3 are probably due to a mistake by the author and are actually reversed.
As a result, ArcFace can be considered as a special case of MagFace with m(ai)=m,g(ai)=0m(ai)=m,g(ai)=0.
Mathematical guarantees
Assuming that the size of the feature vector aiai falls into the bounded interval [la,ua][la,ua] and that λgλg is sufficiently large, we prove that the following two properties always hold when optimizing LiLi on aiai.
→ The optimal solution of aiai is unique and fast convergence is guaranteed
The optimal solution a∗iai∗ increases monotonically as the cos distance to its class center decreases and the cos distance to other classes increases.
→ Guaranteed that the size of the feature can be treated as an indicator of data quality.
For details of the proof, please refer to the supplementary materials of the paper.
Experimental results
After verifying in Section 0 that the MagFace loss we designed was able to learn as intended by the authors.
We then analyze whether the three motivations discussed earlier have been resolved.
- Visualizing the effects of MagFace loss
We run experiments on the widely used MS1M-V2 dataset to examine the relationship between the size of the feature vector and the similarity to the class center for training samples at convergence.
(a) Softmax : The distribution is radial (no correlation)
→ Since there is no clear restriction on the size of the feature vector aa, the value of negative loss for each sample is almost independent of its size.
(b) ArcFace : concentrated on the right side regardless of the quality of the data
High quality samples with a large similarity cosθcosθ with the class center WW have a large ups and downs in the size of the feature vector (vertical axis)
For low quality samples that are difficult to recognize (small cosθcosθ), the fixed angle margin prevents the feature vector from becoming smaller than the lower limit (green dashed line).
(c) MagFace : The motivation of MagFace is correctly reflected.
There is a strong correlation between the size of the feature vectors and the cos similarity to the aa class center
As the size of the feature vector aa decreases, the samples deviate more strongly from the class center
The model was trained with MS1M-V2 and 512 samples of the last iteration are used for visualization.
1.[Quality] Quantification index of data quality “a”.
From the figure below, we can see that as the size of the feature vector aa increases, the corresponding average face becomes more detailed. This is because high quality faces tend to be more front-facing and feature-rich. This means that the size aa of the feature vector learned by MagFace is a good indicator of data quality.
A visualization of the average face of 100k images extracted from the IJB-C dataset. Each average face corresponds to a group of faces based on the size level of the features learned by MagFace.
- [Recognition] Improving the accuracy of face recognition
Experiments conducted in accordance with the settings in the ArcFace paper show that the accuracy is almost saturated, but MagFace gives the best overall results. (Horizontal columns indicate the type of data set)
We also prepared the IJB-B and its extension IJB-C datasets and evaluated them using the metric True Acceptance Rate (TAR@FAR=1e-6,1e-5,1e-4).
As a result, MagFace maintains the top position in all FAR criteria except FAR=1e-6 for IJB-B. This shows that MagFace excels on the most difficult datasets.
Note that in the case of multiple images for a single ID, aggregating the features weighted by the size of the feature vector (MagFace++) further improved the accuracy.
3 [Clustering] Improving the accuracy of face clustering
MagFace consistently outperforms ArcFace on both F-score and NMI metrics regardless of the dataset, likely due to MagFace’s better intra-class feature distribution.
From the figure below we can see that for the IJB-B-1d dataset, the size of the MagFace is positively correlated with the probability of being class-centered. This result reflects the fact that the MagFace feature shows the expected intra-class structure, where high quality samples are distributed in the class center and low quality samples are far from the center.
The probability that each sample is class-centric is estimated based on the adjacency structure defined by the facial features. Samples with dense local connections have a high class-centered probability, while samples with sparse connections or those that lie at the boundary of multiple clusters have a low class-centered probability. See [46] for more details.
Conclusion
MagFace follows two principles to learn face representations with respect to data quality.
Given face images of the same class and different quality, try to learn a within-class distribution such that high quality images are distributed close to the class center and low quality images are distributed at the boundaries.
Since quality is measured at the same time as feature computation, the cost of modifying the existing inference architecture is minimized.
This results in better accuracy compared to existing margin-based methods such as ArcFace.
Furthermore, the idea of focusing on the size as well as the angle of the features could be extended to a variety of tasks in the future.