BEVFormer

Ref: github

BEVFormer is a method to fuse multi-camera images to obtain current BEV feature. The current BEV feature can then be used for detection/segmentation tasks.

The architecture from the original paper is shown here, which is hard to understand for a reimplementation, as many details are missed. bev_arch

1. Reorganize the architecture

Alt text

1.1 Encoder

Alt text

The encoder takes the following inputs

multi-camera multi-level image features as (K, V): a tuple of num_levels. Each element has shape (batch_size, num_camera, feat_h, feat_w, feat_dim), or $(B, N_{cam}, h_{feat}, w_{feat}, d_{feat})$
history BEV feature $BEV_{t-1}$. Shape: $(B, N_q, e)$. $N_q$ is the length of BEV feature map of shape $(h_{bev}, w_{bev})$.
BEV query $Q_{BEV}$. Shape: $(B, N_q, e)$. Same shape as BEV feature.

After temporal self-attention and spatial cross attention in each layer, the encoder output current fused BEV feature $BEV_t$.

1.1.1 Encode Layer

Alt text

1.1.1.1 Temporal Self-Attention

The detailed calculation of the temporal self-attention is illustrated as follows. The bottome mechanism is deformable attention, where the query attends to sampling points in the (K,V) map.

Alt text

The notions are:

$B$: batch size
$N_q$: query length. In self-attention, it is the length of flattened BEV feature map of shape (h, w)
$e$: model/embedding dimension
$2B$: due to the concatenation of $(BEV_{t-1}, Q_{BEV, 0})$
$N_h$: number of heads, e.g., 8
$N_l$: number of image feature levels from the backbone, e.g., 4
$N_p$: number of neighbor points to be sampled for each reference point, e.g., 8

1.1.1.2 Spatial Cross-Attention

Alt text

The notations are:

$N_c$: number of cameras
$N_f$: length of flattend feature map at all levels for each image
$N_z$: number of neighborhood points sampled at z-axis of a BEV pillar, e.g., 4. It should be smaller than $N_p$
$l$: number of query points hit on 2-D cameras after projection

2. Detection/Segmentation Head

The detection head is a DETR transformer decoder, which takes the encoder outputs (e.g., BEV feature from cameras) and object queries. More details can be refered to original DETR paper.