論文筆記：Data Distillation: Towards Omni-Supervised Learning

最近在看這篇文章，發現網路上的翻譯不太完整，缺少文章最核心的實驗方法。把自己的中文翻譯放上來獻醜，如有錯誤請讀者指教。

參考資料

原文 https://arxiv.org/pdf/1712.04440.pdf

CSDN 部分翻譯 https://blog.csdn.net/sunshine_010/article/details/80038491

知乎 https://www.zhihu.com/question/264009268

---

（前略，請參考 CSDN 的翻譯）

Generating labels on unlabeled data. By aggregating the

results of multi-transform inference, it is often possible to

obtain a single prediction that is superior to any of the

model’s predictions under a single transform (e.g., see Figure

2). Our observation is that the aggregated prediction

generates new knowledge and in principle the model can use

this information to learn from itself by generating labels.

標註未標註的資料

組合多種幾何轉換的推論結果，通常能得到比單一幾何轉換更好的結果。

根據我們觀察，組合的結果能產生新的知識。一般而言，既有模型能夠透過產生新標記，來自我訓練。

Given an unlabeled image and a set of predictions from

multi-transform inference, there are multiple ways one

could automatically generate labels on the image. For example,

in the case of a classification problem the image

could be labeled with the average of the class probabilities

[18]. This strategy, however, has two problems. First, it

generates a “soft” label (a probability vector, not a categorical

label) that may not be straightforward to use when

retraining the model. The training loss, for example, may

need to be altered such that its compatible with soft labels.

Second, for problems with structured output spaces, like object

detection or human pose estimation, it does not make

sense to average the output as care must be taken to respect

the structure of the output space.

給定一組未標記的圖片，經過幾何轉換的推論結果，有多種合併方式，可以產生新標記。

例如，圖片分類問題中，我們可以用推論類別機率的平均值，給圖片標記。

但是這樣做有兩個問題：第一，這種標記是個軟性標籤（各類別的機率向量，而非類別標籤）重新訓練時，並不直覺好用。舉例來說，可能需要修改損失函數，讓軟性標籤也能計算損失量。第二，對於有結構化輸出空間的問題，像是物件偵測、人體關節點辨識等等，將輸出直接平均並不合理。

Given these considerations, we simply ensemble (or aggregate)

the predictions from multi-transform inference in a

way that generates “hard” labels of the same structure and

type of those found in the manually annotated data. Generating

hard labels typically requires a small amount of taskspecific

logic that addresses the structure of the problem

(e.g., merging multiple sets of boxes by non-maximum suppression).

Once such labels are generated, they can be used

to retrain the model in a simple plug-and-play fashion, as if

they were authentic ground-truth labels.

綜合上述考量，我們僅使用能夠產生硬性標籤的作法，來組合多種轉換的預測結果。這樣的做法，必須產生和人工標記一樣的結構。產生硬性標籤通常需要一點能夠說明問題結構的特定知識（例如：使用 non-maximum suppression 合併多個 boxes）一旦產生了這些標籤，就能將新資料當作是真正的人工標記資料，重新訓練模型。

Finally, we note that while this procedure requires running

inference multiple times, it is actually efficient because

it is generally substantially less expensive than training

multiple models from scratch, as is required by model

distillation.

最後，我們必須指出，儘管這種作法需要推論大量資料。和知識蒸餾相比，它仍然是有效率的作法。因為前者需要從頭訓練多個模型。

Knowledge distillation. The new knowledge generated

from unlabeled data can be used to improve the model. To

do this, a student model (which can be the same as the original

model or different) is trained on the union set of the

original supervised data and the unlabeled data with automatically

generated labels.

知識蒸餾

從未標記資料產生的新知識，能夠用來改善模型。為此，必須將原本的人工標記資料，和自動標記資料合併，再用合併後的資料來訓練。

Training on the union set is straightforward and requires

no change to the loss function. However, we do take two

factors into considerations. First, we ensure that each training

minibatch contains a mixture of manually labeled data

and automatically labeled data. This ensures that every

minibatch has a certain percentage of ground-truth labels,

which results in better gradient estimates. Second, since

more data is available, the training schedule must be lengthened

to take full advantage of it. We discuss these issues in

more detail in the context of the experiments.

在合併資料上訓練，十分直覺而且不需要修改損失函數；但是有兩個因素需要考量：第一，必須確保每個訓練的 mini-batch 都包含人工標記資料，和自動標記資料，這能確保每個 mini-batch 的資料中，有足夠的 ground-truth 資料比例，才能得到較好的梯度預測。第二，既然有更多的訓練資料，務必延長訓練時程，來獲得完整的訓練好處。我們稍後會深入討論。

4. Data Distillation for Keypoint Detection

This section describes an instantiation of data distillation

for the application of multi-person keypoint detection.

Mask R-CNN. Our teacher and student models are the

Mask R-CNN [15] keypoint detection variant. Mask RCNN

is a two-stage model. The first stage is a Region Proposal

Network (RPN) [30]. The second stage consists of

three heads for bounding box classification, regression, and

keypoint prediction on each Region of Interest (RoI). The

keypoint head outputs a heatmap that is trained to predict a

one-hot mask for each keypoint type. We use ResNet [16]

and ResNeXt [47] with Feature Pyramid Networks (FPN)

[23] as backbones for Mask R-CNN. All implementations

follow [15], unless specified.

Mask R-CNN

我們的教師和學生模型都是基於 Mask R-CNN 的變形。Mask R-CNN 是兩階段的模型。第一階段是個區域建議網路。第二階段有三個模型，負責辨識 bounding box、迴歸 bounding box、最後在 ROI 上預測關節點。關節點預測模型被訓練來輸出每個關節點的 one-hot mask，最後會輸出一張 heatmap。我們用帶有 Feature Pyramid Networks 的 ResNet, ResNeXt 當作 Mask R-CNN 的骨幹。若沒有特別說明，所有的實作都根據 [15]。

Data transformations. This paper opts for geometric

transformations for multi-transform inference, though other

transformations such as color jittering [20] are possible.

The only requirement is that it must be possible to ensemble

the resulting predictions. For geometric transformations, if

the prediction is a geometric quantity (e.g., coordinates of a

keypoint), then the inverse transformation must be applied

to each prediction before they are merged.

資料轉換

雖然有其他轉換方式，例如 color jittering，本篇論文採用的是幾何轉換（按：旋轉、平移、縮放）。轉換方式的唯一需求是，必須能組合轉換後的預測結果。對於幾何轉換而言，若預測值是幾何量（即：關節點的座標）則在合併前，必須套用反轉換操作（以得到正確的預測值）。

We use two popular transformations: scaling and horizontal

flipping. We resize the unlabeled image to a predefined

set of scales (denoted by the shorter side of an

image): [400, 1200] pixels with a stepsize of 100, which

was selected by measuring the keypoint AP for the teacher

model when applying these transformations on the validation

set. The selected transformations can improve the

model by a good margin, e.g. for ResNet-50 from 65.1 to

67.8 AP, which is then used as the teacher. Note that unless

stated, we do not apply these transformation at test time for

all baseline/distilled models.

我們使用兩種常用的轉換：水平翻轉和伸縮。我們將未標記的圖片伸縮到預定的尺寸：短邊 400 - 1200, 組間差 100 像素。測試教師模型在 validation set 經過（伸縮）轉換後的 AP 表現，我們便能決定適當的伸縮尺寸。表現最佳的轉換能顯著提升辨識率，例如 ResNet-50 的 AP，從 65.1 提升至 67.8，接著，改善後的模型又會用來標註未標記的資料。注意：除非特別說明，我們不會在測試時，套用（幾何）轉換。（按：除了選 resize 尺寸，一般不在資料上套用轉換）

Ensembling. One could ensemble the multi-transform inference

results from each stage and each head of Mask

R-CNN. In our experiments, however, for simplicity we

only apply multi-transform inference to the keypoint head;

the outputs from the other stage (i.e., RPN) and heads

(i.e., bounding box classification and regression) are from

a single-scale without any transformations.

組合（轉換後的資料標記）

我們可以將每個階段、每個 Mask R-CNN 辨識任務產出的多個轉換的推論結果合併在一起。然而，在我們的實驗中，簡單起見，我們僅合併關節點辨識的結果。至於其他階段、其他辨識任務的結果（即：bounding box 分類與迴歸）都是沒有經過幾何轉換與合併的。

Thanks to the above simplification, it is easy for us to

have a consistent set of detection boxes serving as the RoIs

for all transformations (scales/flipping). On a single RoI,

we extract the keypoint heatmaps from all transformations,

and although they are from different geometric transformations,

these heatmaps are with reference to the local coordinate

system of the same RoI. So we can directly average

the output (probability) of these heatmaps for ensembling.

We take the argmax position in this ensembling result and

generate the predicted keypoint location.

多虧了這般簡化，我們很輕易地便能產生一組 detection boxes，當作所有（伸縮、翻轉）轉換的 RoI。在單一 RoI 上，我們取出所有轉換結果的關節點熱點圖。雖然他們經過不同的幾何轉換，這些熱點圖都來自相同的 RoI 區域座標系統。因此，我們可以直接計算（機率）輸出的平均值，得到組合結果。我們在組合的熱點圖上，取出最大值的座標，並且預測關節點位置。

Selecting predictions. We expect the predicted boxes and

keypoints to be reliable enough for generating good training

labels. Nevertheless, the predictions will contain false

positives that we hope to identify and discard. We use the

predicted detection score as a proxy for prediction quality

and generate annotations only from the predictions that are

above a certain score threshold. In practice, we found that a

score threshold works well if it makes “the average number

of annotated instances per unlabeled image” roughly equal

to “the average number of instances per labeled image”. Although

this heuristic assumes that the unlabeled and labeled

images follow similar distributions, we found that it is robust

and works well even in cases where the assumption

does not hold.

篩選自動標記結果

我們預期辨識出來的 (bounding) boxes 和關節點，好到可以用來訓練。但是呢，產生的資料中總會有（bounding box 的） false positives，我們希望能找到並丟棄這些不好的資料。我們將辨識分數當成是辨識效果，只取出從一定分數門檻的模型，產生出來的（自動）標記。實務上我們發現：如果「每張未標記照片，產生的平均標記數量」和「每張有標記的照片，產生的平均標記數量」相仿，這就是個不錯的門檻分數。儘管這樣的經驗法則，假設了未標記、已標記圖片有類似的（每張圖片標記人數）分布，我們發現這法則還是經的起考驗的。就算是在假設不成立（標記分布不均）的情況下。

As a dual consideration to false positives above, there

may be false negatives (i.e., missing detections) in the extra

data, and the annotations generated should not necessarily

be viewed as complete (i.e., absence of an annotation does

not imply true background). However, in our practice we

have tried either to sample or not sample background regions

from the extra data for training detectors, and have

observed no difference in accuracy. For simplicity, in all

experiments we view the generated data as complete, so the

extra data are simply treated as if all annotations are correct.

除了上述 false positives，新資料中還有 false negatives（即：未被偵測的人物標記）我們應該假設資料的標記是不完全的。也就是說，沒有檢出並不代表背景沒有人。但實務上我們測試了，是否從偵測背景（按：RoI 以外區域）採樣，但是辨識率沒有差異。簡單起見，在所有實驗中，我們就當作產生的資料已經沒有未檢出標記（按：忽略 false negatives）將新增（自動標記的）資料，視為有正確標記的資料。

Generating keypoint annotations. Each of the selected

predictions consists ofK individual keypoints (e.g., left ear,

nose, etc.). Since many of the object views do not show

all of the keypoint types, the predicted keypoints are likely

to contain false positives as well. As above, we choose a

threshold such that the average numbers of keypoints are

approximately equal in the supervised and generated sets.

產生關節點標記

每張偵測結果，都有 K 個獨立關節點（例如：左耳、鼻子等）。因為許多圖片並未包含所有的關節點，預測出來的關節點，可能也有 false positives。同上，我們挑選一個讓平均關節點數量，在人工標記資料和自動標記資料差不多的門檻值，

Retraining. We train a student model on the union set of

the original supervised images and the images with automatically

generated annotations. To maintain supervision

quality at the minibatch level, we use a fixed sampling ratio

for the two kinds of data. Specifically, we randomly sample

images for each minibatch such that the expected ratio of

original images to generated labeled images is 6:4, unless

stated otherwise.

重新訓練

我們合併人工標記和自動標記資料，並訓練出學生模型。為保持 minibatch 的監督學習品質，我們將兩種資料的取樣比例固定下來。具體來說，除非特別說明，每個 minibatch 我們會隨機取樣，保持原本資料和新資料的比例在 6 比 4。

We adopt the learning rate schedule similar to [15] and

increase the total number of iterations to account for extra

images. The learning rate starts from 0.02 and is divided by

10 after 70% and 90% of the total number of iterations. The

impact of the total number of iterations will be discussed in

the next section in context of Table 2.

我們採用和 [15] 類似的學習率調整策略。考慮新增的資料，我們增加了整體迭代次數。初始學習率是 0.02，在達到 70% 和 90% 的迭代次數時，學習率調整為 0.002、0.0002。在下一節表 2. 中，會討論整體迭代次數對於學習的影響。

We use a student model with the same architecture as

the teacher. The student can either be fine-tuned starting

from the teacher model or retrained from the initial weights

(i.e., those pre-trained on ImageNet [34]). We found that

retraining consistently results in better performance, suggesting

that the teacher model could have been in a poor

local optimum. We opt for retraining in all experiments.

我們採用和教師模型相同結構的學生模型。學生模型可以從教師模型的權重開始訓練，或是從初始權重（即：如同 [34] 中從 ImageNet 開始訓練的權重）開始訓練。我們發現重新訓練模型，最後都會有更好的辨識結果，代表教師模型可能是在較差的 local optimum。所有的實驗我們都選用（ImageNet 初始權重）重新訓練。

5. Experiments on Keypoint Detection

We evaluate data distillation on the keypoint detection
task of the COCO dataset [24]. We report keypoint Average
Precision following the COCO definitions, including
AP (COCO’s default, averaged over different IoU thresholds), AP50, AP75, APM (medium), and APL (large). In
all experiments we report results on the 2017 validation set
that contains 5k images (called val2017, formerly known
as minival).

我們用 COCO dataset 的關節點偵測結果，來評估資料蒸餾成效。我們根據 COCO dataset 的定義 [24] 計算關節點的平均準確度（AP）包含 AP（COCO 預設值，將所有不同的 IoU 門檻算出的準確度取平均值）AP50, AP75, APM（中尺寸物件）APL（大尺寸物件）。我們使用包含五千張圖片的 2017 validation set 計算以上結果（又叫做 val2017，先前稱做 minival）。

5.1. Data Splits

Our experiments involve several splits of data:

COCO labeled images. These are the original labeled
COCO images that contain ground-truth person and keypoint
annotations. In this paper, we refer to the 80k training
images as co-80, a 35k subset of the 2014 validation images
as co-35, and their union as co-115 (in the 2017
version of COCO, co-115 is the train2017 set). We
do not use the original train/val nomenclature because their
roles may change in different experiments.

我們的實驗包含幾份不同切分方式的資料：

COCO 人工標註圖片
這些是 COCO 資料集原始的標記資料，包含 ground-truth 人物和關節點的標記。這篇論文中，我們將八萬張訓練圖片稱做 co-80；將三萬五千張 2014 年的驗證圖片稱做 co-35；兩份資料合併稱做 co-115（在 2017 年的 COCO 資料集，co-115 就是 train2017 訓練資料）我們並未使用原始的命名：因為在不同實驗中，他們可能有其他（和名字不同的）用途。

COCO unlabeled images. The 2017 version of COCO
provides a collection of 120k unlabeled images, which we
call un-120. These images are expected to have a similar
distribution as the labeled COCO images.

COCO 未標記圖片
2017 年版的 COCO 包含了十二萬張未標記的圖片，我們稱為 un-120。我們預期這些圖片的資料分布，和有標記的圖片一致。

Sports-1M static frames. We will show that our method
can be robust to a dissimilar distribution of unlabeled data.
We collect these images by using static frames from the
Sports-1M [19] video dataset. We randomly sample 180k
videos from this dataset. Then we randomly sample 1 frame
from each video, noting that we do not exploit any temporal
information even if it is possible. This strategy gives us
180k static images. We call this set s1m-180. We do not
use any available labels from this static image set.

Sports-1M 靜態圖片
我們稍後將證明：論文提出的方法，在分布迥異的未標記資料，也有一樣好的效果。我們從 Sports-1M [19] 影片資料中，取出靜態圖片。我們在這份資料中，隨機選取了十八萬個影片。接著，從每段影片隨機取一幀。即便可用，這裡仍未使用任何時間資訊。如此一來，就得到十八萬張圖片，我們稱之為 s1m-180。我們並未利用任何可用的標記資訊。

5.2. Main Results

We investigate data distillation in three cases:
(i) Small-scale data as a sanity check: we use co-35 as the
labeled data and treat co-80 as unlabeled.
(ii) Large-scale data with similar distribution: we use
co-115 as the labeled data and un-120 as unlabeled.
(iii) Large-scale data with dissimilar distribution: we use
co-115 as the labeled data and s1m-180 as unlabeled.
The results are in Table 1, discussed as follows:

我們分析資料蒸餾的三種情境：

1. 少量資料概念驗證：co-35 當作有標記資料、co-80 當作未標記資料。
2. 相似特徵分布的大量資料：co-115 當作有標記資料、un-120 當作未標記資料。
3. 相異特徵分布的大量資料：co-115 當作有標記資料、s1m-180 當作未標記資料。

結果如表 1. 所示，討論如下：

Small-scale data. As a sanity-check, we evaluate our approach
in the classic semi-supervised setting by simulating
labeled and unlabeled splits from all labeled images.

少量資料
為了驗證概念可行，我們將所有的已標記資料分成兩份，模擬有標記和未標記的資料。如同經典的半監督式學習方法一般。

In Table 1a, we show results of data distillation performed
on co-35 as the labeled data and co-80 treated
as unlabeled data. As a comparison, we report supervised
learning results using either co-35 or co-115. This comparison
shows that data distillation is a successful semisupervised
learning method: it surpasses the co-35-only
counterpart by 5.3 points of AP by using unlabeled data
(60.2 vs. 54.9). On the other hand, as expected, the semisupervised
learning result is lower than fully-supervised
learning on co-115 (60.2 vs. 65.1).

表 1. 中，我們列出 co-35 有標記、co-80 未標記的辨識結果。對照組是單用 co-35 或 co-115 的訓練結果。比較結果，我們發現資料蒸餾是成功的半監督式學習方式：藉著未標記的資料，他比單用 co-35 還多了 5.3 的 AP。另一方面，如同預期地，這種半監督式學習方式，表現不如使用所有標記資料學習（60.2 vs. 65.1）。

This phenomenon on small-scale data has been widely
observed for many semi-supervised learning methods and
datasets: if labels were available for all training data, then
the accuracy of semi-supervised learning would be upperbounded
by using all labels. In addition, as the simulated
splits are often at smaller scales, there is a relatively large
gap for the semi-supervised method to improve in (e.g.,
from 54.9 to 65.1).

這種現象，在小規模的資料上十分常見：當我們擁有所有訓練資料的標記，半監督式學習的準確率，表現最多跟完全監督式學習一樣好。此外，由於模擬情境的（訓練）資料量通常較小，相對而言，半監督式學習有較大的改善空間（例如：從 54.9 到 65.1）。

We argue that omni-supervised learning is a real-world
scenario unlike the above simulated semi-supervised setting.
Even though one could label many images, there
are always more unlabeled data available (e.g., at internetscale).
We can thus pursue an accuracy that is lowerbounded.
In addition, when trained with a larger dataset, the
supervised baseline would be much higher (e.g., 65.1), leaving
less room for models to gain from the unlabeled data.

我們認為全自動監督式學習，與上述的模擬半監督式學習不同，才是真實世界的應用情境。就算能人工標註大量圖片，世界上總是有更多未標記的圖片（例如：網際網路的圖片）。因此，我們可以追求有下界的（按：上界無止盡的）準確率。除此之外，當我們使用較大的訓練資料，監督式學習的基準辨識率會高許多（例如：65.1）這樣一來，模型能從未標記資料學習的空間就更少了。

Therefore, we argue that the large-scale, high-accuracy
regime is more challenging and of more interest in practice.
We investigate it in the following experiments.

因此，我們認為在實務上，大規模、高準確率的方式才是比較有挑戰和有趣的。我們會在接下來的實驗中仔細探討。

Large-scale, similar-distribution data. Table 1b shows
the scenario of a real-world omni-supervised learning application:
we have a large-scale source of 120k COCO
(un-120) images on hand, but we do not have labels for
them. Can we improve over our best baseline results using
these unlabeled data?

大量、相似特徵分布資料
表 1b 是真實世界的全自動學習應用情境：我們有十二萬張未標記的 COCO 圖片。我們能否利用這些資料，在最佳辨識率上有所突破呢？

Table 1b shows the data distillation results on co-115
plus un-120, comparing with the fully-supervised counterpart on co-115, the largest available annotated set on
hand. Our method is able to improve over the strong baselines
by 1.7 to 2.0 points AP. Our improvement is observed
regardless of the depth/capacity of the backbone models, including
ResNet-50/101 and ResNeXt-101.

表 1b 顯示了 co-115 加上 un-120 的資料蒸餾實驗結果，相比的是 co-115 全監督式學習，也就是目前所有的已標記資料。我們提出的方法，能夠提升 1.7 到 2.0 AP。不論使用何種深度、capacity 的骨幹模型（包含 ResNet-50/101 與 ResNeXt-101）結果都不受影響。

We argue that these are non-trivial results. Because the
baselines are very high due to using large amounts of supervised
data (115k images in co-115), they might leave
less room for further improvement, in contrast to the simulated
semi-supervised setting. Actually, in recent work [27]
that exploited an extra 1.5xfully-annotated human keypoint
skeletons (contributed by in-house annotators), the
improvement is~3 points AP over their baseline. Given
this context, our increase of~2 points AP, contributed by a
similar amount of extra unlabeled data, is very promising.

我們認為以上結果並非偶然。因為使用了大量的監督式學習標記資料（co-115 中的 11.5 萬張圖片），原本的基準線非常的高。和（前一節的）半監督式學習模擬相比，基準線已經沒有大多進步空間。事實上，最近的論文 [27] 使用了比原本資料多 1.5 倍的人工關節點標記，改善了約 3 點的 AP。相比之下，我們用原本資料差不多大小的未標記資料，改善了近 2 點的 AP；這方法未來大有可為。

Large-scale, dissimilar-distribution data. Even though
COCO data are images “in the wild”, the co-115 and
un-120 sets are subject to similar data distributions. As
one further step toward omni-supervision in real cases, we
investigate a scenario where the unlabeled images are from
a different distribution.

大量、相異特徵分布資料
雖然 COCO 圖片來自荒野（按：從網路上爬出來的，為了應用到日常可見的情境），co-115 和 un-120 其實是很相似的資料。為了進一步探索全自動監督式學習，我們要探討未標記資料，來自相異特徵分布的情形。

Table 1c shows data distillation results on co-115
plus s1m-180. Comparing with the supervised baselines
trained on co-115, our method shows consistent improvement
with different backbones, achieving 1.2 to 1.5 points
of AP increase. Moreover, the improvements in this case
are reasonably close to those in Table 1b, even though the
data distribution in Sport-1M is different. This experiment
shows that our method, in the application of keypoint detection,
is robust to the misaligned distribution of data. This is
a promising signal for real-world omni-supervised learning.
Figure 4 shows some examples of the fully-supervised
results trained in co-115 and the data distillation results
trained in co-115 plus s1m-180.

表 1c 列出 co-115 與 s1m-180 的實驗結果。和基準線（使用 co-115 的監督式學習）相比，我們的方法，就算使用不同的網路骨幹，對辨識率改善效果一致，提升了 1.2 至 1.5 的 AP。儘管 Sport-1M 資料的分布（和 COCO 資料）不同，改善效果和表 1b 類似。實驗顯示，在關節點偵測應用情境中，我們的方法能夠套用在不同分布屬性的資料上。這意味著全自動監督式學習，在真實世界的應用情境中可行。圖 4 列出了在 co-115 和 co-115+s1m-180 的訓練結果。（按：列出 co-115 做不好，但是 co-115+s1m-180 做的很好的例子啦）

5.3. Ablation Experiments

In addition to the above main results, we conduct several
ablation experiments as analyzed in the following:

Number of iterations. It is necessary to train for more iterations
when given more (labeled or distilled) data. To show
that our method does not simply take advantage of longer
training, we conduct a careful ablation experiment on the
number of iterations in Table 2.

除了主要實驗結果，我們還設計了幾個（移除變項看結果是否變化）實驗，並分析如下：

迭代數量
當（有標記或是自動標記的）資料增加，就有必要使用更大的迭代數量。為證明我們的實驗結果，不是受益於更長時間的訓練，我們謹慎地實驗了迭代數量，如表 2 所示。

For the fully-supervised baseline, we investigated a total
number of iterations of 90k (as done in [15]), 130k (~1.5x
longer), and 270k (3xlonger). Table 2 (top) shows that
an appropriately long training indeed leads to better results,
and the original schedule of 90k in [15] is suboptimal. However,
without increasing the dataset size, training longer
gives diminishing return and becomes prone to overfitting.
The optimal number of 130k iterations is chosen and used
in Tables 1 for the fully-supervised baselines.

全監督學習的基準線實驗，測試了 90K（如同 [15]）, 130k（1.5x）, 和 270K（3x）的迭代量。表 2（上方）顯示適當的長時間訓練，確實能產生更好的結果，[15] 中原本的 90K 迭代量並非最佳結果。但是，不增加資料的前提下，訓練更久效果反而更差，且容易 overfitting。表 1 的全監督學習實驗，選用效果最佳的 130K 迭代量。

In contrast, our data distillation method continuously improves
when the number of iterations is increased from 90k
to 360k as shown in Table 2 (bottom). With a short training
of 90k, our method is inferior to its fully-supervised
counterpart (63.6 vs. 64.2), which is understandable: the
generated labels in the extra data have lower quality than
the ground-truth labels, and the model may not benefit from
them unless ground-truth labels have been sufficiently exploited.
On the other hand, our method starts to show
a healthy gain with sufficient training and surpasses its fully-supervised counterpart. Actually, our method’s performance
has not saturated and is likely to improve when
using more iterations. To have manageable experiments,
for all other data distillation results in the paper, our method
uses 360k iterations.

相比之下，我們的資料蒸餾方法能夠隨著迭代量從 90K 增加到 360K，不斷提升辨識率，如表 2（下方）所示。迭代量僅有 90K 時，我們的方法比全監督學習方法來的差（63.6 vs. 64.2）這可理解為：由於自動生成的標記品質比人工標記來的差，除非人工標記的知識已經完全被學習了，不然模型無法學到自動標記的知識。另一方面，訓練量充足時，我們的方法開始健康地的改善學習效果，並超越舊方法。事實上，如果繼續訓練下去，我們的方法可能還會繼續改善下去。為了可行性起見，本論文的資料蒸餾實驗結果，都使用 360K 的訓練迭代量。

Amount of unlabeled data. To better understand the importance
of the amount of unlabeled data, in Figure 5 we
investigate using a subset of the un-120 unlabeled data
for data distillation (the labeled data is co-115).

未標記資料量
圖 5 中我們研究了只用部分 un-120 資料，進行資料蒸餾實驗（有標記資料為 co-115）試圖瞭解：未標記資料量對此方法的影響。

To have a simpler unified rule for handling the various
sizes of the unlabeled set, for this ablation, we adopt a minibatching
and iteration strategy different from the above sections.
Given a fraction ρ of un-120 images used, we sample
each minibatch with on average 1:ρ examples from the
labeled and unlabeled data. The iteration number is adaptively
set as 1+ρ times of that of the supervised baseline
(130k in this figure). As such, the total number of sampled
images from the labeled set is roughly the same regardless
of the fraction ρ. We note that this strategy is suboptimal
comparing with the setting in other tables, but it is a simplified
setting that can apply to all fractions investigated.

這次實驗中，為了有統一和簡化的操作規則，我們採取和之前不同的 minibatch 以及迭代方式。假如有 ρ 比例（按：ρ 為 0 ~ 1 的小數）的 un-120 圖片，我們就在每個 minibatch 中，取出 1 比 ρ 的有標記、未標記資料。迭代數則調整為，監督式學習基準線（圖中的 130k）迭代量的 1+ρ 倍。這樣一來，不管 ρ 是多少，用來訓練的圖片總數都差不多。我們指出，這種策略雖然學習效果較差，但是能夠簡化實驗方式，並且套用到不同的實驗比例 ρ。

Figure 5 shows that for all fractions of unlabeled data,
our method is able to improve over the supervised baseline.

圖五顯示，對於各種未標記資料的比例，我們的方法都能夠改善辨識率，超越基準線。（按：即使增加資料不多，「資料蒸餾」的改良式半監督式學習還是有效提升了辨識率）

Actually, as can be expected, the supervised baseline becomes
a lower-bound of accuracy in omni-supervised learning:
the extra unlabeled data, when exploited appropriately
such as in data distillation, should always provide extra information.
Moreover, Figure 5 shows that there is a general
trend of better results when using more unlabeled data.
A similar trend, in the context of fully-annotated data, has
been observed recently in [40]. However, our trend is observed
in unlabeled data and can be more encouraging for
the future study in computer vision.

事實上，如同預期地，監督式學習的基準線變成了全自動學習的下限：（像資料蒸餾這樣）適當利用的話，新增的未標記資料，總是能提供更多（有利學習的）資訊。除此之外，圖 5 告訴我們，如果使用更多未標記資料，通常能夠提升更多辨識率。近來的研究 [40] 也發現，在完全使用人工標記的學習中，有類似的趨勢。但是我們的實驗使用了未標記資料，這樣的結果對於計算機視覺的未來研究更讓人有信心。

Impact of teacher quality. To understand the impact of
the teacher quality on data distillation, we produce different
teacher models with different AP (see Figure 6 caption).
Then we train the same student model on each teacher. Figure
6 shows the student AP vs. the teacher AP.

教師模型品質的影響
為瞭解教師模型品質，對資料蒸餾方法的影響，我們準備了不同 AP 的教師模型（見圖 6. 標題）。接著我們在各個教師模型的基礎上，訓練了學生模型。圖 4 顯示學生模型與教師模型的 AP。

As expected, all student models trained by data
distillation surpass the fully-supervised baseline. In addition,
a higher-quality teacher in general results in a better
student. This demonstrates a nice property of the data
distillation method: one could expect a bigger improvement
if a better teacher will be developed.

如同預期，透過資料蒸餾，所有的學生模型都超越了監督式學習的基準線。此外，品質較好的教師模型通常能教出更好的學生。這顯示了資料蒸餾的一個好特色：如果教師模型比較好，可預期的改良空間更大。

Test-time augmentations. Our data distillation method exploits
multi-transform inference to generate labels. Multitransform
inference can also be applied at test-time to
further improve results, a strategy typically called testtime
augmentation. Table 3 shows the results of applying
test-time augmentations on a data distillation model.
The augmentations are the same as those used to generate
distillation labels. It shows that test-time augmentations can
still improve the results over our data distillation model.

驗證期結果增強
我們提出的資料蒸餾方法，利用多重幾何轉換來產生資料標記。這樣的技巧也能在驗證期使用，並得到更好的結果，這通常稱做「驗證期結果增強」。表 3 列出了資料蒸餾方法，加上驗證期結果增強的實驗結果。增強的方式和產生標記的方式完全一樣。結果顯示，驗證期結果增強可以在資料蒸餾的基礎上，繼續改善辨識結果。

Interestingly, the student model’s 68.9 AP (ResNet-50,
in Table 3) is higher than its corresponding (test-time augmented)
teacher’s 67.8 AP. We believe that this is a signal
of our approach being able to learn new knowledge from
the extra unlabeled data, instead of simply learning to be
robust to the transforms. Even though we use multiple dataagnostic
transforms, the distilled labels are data-dependent
and may convey knowledge from the extra data.

有趣的是，（表 3，ResNet-50）學生模型的 68.9 AP，超越了（使用了驗證期結果增強的）對應教師模型的 67.8 AP。我們認為：這代表資料蒸餾方法能夠從新增的未標記資料學到新知識。而不是僅僅學會看懂幾何轉換後的圖片。就算我們用了許多不知道資料長相的幾何轉換，蒸餾後的標記還是能夠保存資料中的知識。

This result also suggests that performing data distillation
in an iterative fashion may improve the results further. We
leave this direction for future work.

這個結果可能代表：反覆不斷地蒸餾資料，能夠持續改善辨識率。我們就把這部分的研究留待未來。

6. Experiments on Object Detection

We investigate the generality of our approach by applying
it to another task with minimal modification. We perform
data distillation for object detection on the COCO
dataset [24]. Here our data splits involve co-35/80/115
as defined above. We test on minival.

我們將此方法做了最小幅度的修改，套用到其他辨識工作，來研究此方法的一般性：在 COCO dataset 上用資料蒸餾的方法來辨識物件 [24]。本實驗中，訓練資料如同上述 co-35/80/115 的定義，驗證資料是 minival。

6.1. Implementation

Our object detector is Faster R-CNN [30] with the FPN
backbone [23] and the RoIAlign improvement [15]. We
adopt the joint end-to-end training as described in [31].
Note that this is unlike in our keypoint experiments where
we froze the RPN stage (which created the same set of
boxes for keypoint ensembling). To produce the ensemble
results, we simply take the union set of the boxes predicted
under different transformations, and combine them
using bounding box voting [10] (a process similar to nonmaximum
suppression that merges the suppressed boxes).
This ensembling strategy on the union set of boxes shows
the flexibility of our method: it is agnostic to how the results
from multiple transformations are aggregated.

我們用的物件偵測器是 FPN 骨架 [23] 加上 RoIAlign 改良 [15] 的 Faster R-CNN [30]。我們採用 [31] 所描述的聯合點對點訓練方式。和關節點偵測不同，這裡並沒有固定 RPN stage（為了讓關節點幾何轉換組合，來自相同的偵測框）；我們直接取各種幾何轉換後的 bounding box 的聯集，再用 [10] 的投票方式合併，作為最後轉換結果的組合。（類似合併 suppressed boxes 的 nonmaximum suppression）這種取聯集的方式，顯示了我們方法的彈性：蒸餾方法本身不管你怎麼組合幾何轉換後的辨識結果。

The object detection task involves multiple categories. A
single threshold of score for generating labels may lead to
strong biases. To address this issue, we set a per-category
threshold of score confidence for annotating objects in the
unlabeled data. We choose a threshold for each category
such that its average number of annotated instances per image
in the unlabeled dataset matches the average number of
instances in the labeled dataset. Figure 7 shows some examples
of the generated annotations on un-120.

物件偵測工作有多種類型，只用單一門檻分數來產生標記，可能產生很大的偏差。為了解決這問題，我們為每種類型設定一個門檻分數，再來產生自動標記。門檻分數的決定方式是：有標記、無標記資料的平均框人數最相近的門檻分數。圖 7 列了一些 un-120 標記的範例。

6.2. Object Detection Results

We investigate data distillation in two cases (Table 4):
(i) Small-scale data: we use co-35 as the labeled data and
treat co-80 as unlabeled.
(ii) Large-scale data: we use co-115 as the labeled data
and un-120 as unlabeled.

我們分析以下兩種情形的實驗結果：
(1) 少量資料：co-35 當作有標記資料、co-80 當作沒有標記的資料。
(2) 大量資料：co-115 當作有標記資料、un-120 當作沒有標記的資料。

Small-scale data. Similar to the keypoint case, the semisupervised
learning result of data distillation (Table 4a) is
higher than that of fully-supervised training in co-35, but
upper-bounded by that in co-115. However, in this case,
the data distillation is closer to the lower bound (32.3 vs.
30.5) and farther away from the upper bound. This result
requires further exploration, which we leave to future work.

少量資料
和關節點偵測實驗結果類似，半監督式學習的效果比用 co-35 完全監督式學習來的好，但是又不超過 co-115 的上界。但是在這個實驗中，資料蒸餾的結果又更接近下界（32.3 vs. 30.5）而離上界更遙遠了。這樣的結果需要更深入的探討，我們留待未來。

Large-scale data. Table 4b shows the data distillation result
using co-115 as labeled and un-120 as unlabeled
data, comparing with the fully-supervised result in
co-115. Our method is able to improve over the fullysupervised
baselines. Although the gains may appear small
(0.8-0.9 points in AP and 0.9-1.1 points in AP50), the signal
is consistently observed for all network backbones and for
all metrics. The biggest improvement is seen in the APM
metric, with an increase of up to 1.8 points (from 43.7 to
45.5 in ResNeXt-101-32x4).

大量資料
表 4b 列出：使用 co-115 標記資料、加上 un-120 沒標記資料的結果，作為對照的是 co-115 完全監督式學習的結果。我們的作法能夠超越完全監督式學習的基準線，雖然增加的量不大（AP: 0.8 - 0.9, AP50: 0.9 - 1.1）。不論使用哪種網路骨幹或計算指標（按：AP, AP50, AP75, etc.），辨識效果都有改善。改善最多的指標出現在 APM，增加了 1.8（用 ResNeXt-101-32x4 從 43.7 進步到 45.5）

The results in Table 4a and 4b suggest that object detection
with unlabeled data is a more challenging task, but
unlabeled data with data distillation can still help.

表 4a 與 4b 的結果顯示：以未標記資料訓練物件辨識模型較困難，不過資料蒸餾方法還是有幫助的。

7. Conclusion

We show that it is possible to surpass large-scale supervised
learning with omni-supervised learning, i.e., using
all available supervised data together with large amounts
of unlabeled data. We achieve this by applying data
distillation to the challenging problems of COCO object
and keypoint detection. We hope our work will attract more
attention to this practical, large-scale setting.

我們的實驗顯示，使用所有監督式學習的資料、搭配大量未標記資料的全自動學習，效果是可能超越大規模監督式學習的。透過將資料蒸餾技術應用在 COCO 物件辨識、關節點偵測這樣的困難問題，我們證明了這一點。我們希望這樣的研究成果能夠吸引更多人注意，投入這種實際、大規模的學習情境。

搜尋此網誌

Roger's 生活雜記

論文筆記：Data Distillation: Towards Omni-Supervised Learning

留言

張貼留言

這個網誌中的熱門文章

[開箱] 老公寓的舒適親子大空間

Python 筆記：Package Import, Absolute & Relative