SinLane: Siamese Visual Transformer following Pyramid Feature Integration for Lane Detection

1Shanghai Jiao Tong University, 2Zhejiang University,
3University of Notre Dame,
ECAI 2024

Corresponding Author
MY ALT TEXT

Overall architecture of our proposed SinLane network. The backbone first extracts multi-scale features from the input image. PFI is then applied to fully integrate global semantic information and local finer-scale features. Subsequently, Siamese Visual Transformer (Encoder and Decoder) generates lane sequences. Specifically, \( e_0 \) is the initial lane sequence, and \( e_1 \) , \( e_2 \) , and \( e_3 \) denote refined lane sequences optimized by different scales of feature maps from PFI.

Abstract

Lane detection is an important yet challenging task in autonomous driving systems. Based on the development of the Visual Transformer, early Transformer-based lane detection studies have achieved promising results in some scenarios. However, for complex road conditions such as uneven illumination intensity and heavy traffic, the performance of these methods remains limited and may even be worse than that of contemporaneous CNN-based methods. In this paper, we propose a novel Transformer-based end-to-end network, called SinLane, that attains the attention weights focusing on the sparse yet meaningful locations and improves the accuracy of lane detection in complex environments. SinLane is composed of a novel Siamese Visual Transformer structure and a novel Feature Pyramid Network (FPN) structure called Pyramid Feature Integration (PFI). We utilize the proposed PFI to better integrate global semantics and finer-scale features and to promote the optimization of the Transformer. Moreover, the designed Siamese Visual Transformer is combined with multiple levels of the PFI and is employed to refine the multi-scale lane line features output from the PFI. Extensive experiments on three benchmark datasets of lane detection demonstrate that our SinLane achieves state-of-the-art results with high accuracy and efficiency. Specifically, our SinLane improves the accuracy by over 3% compared to the current best-performing Transformer-based method for lane detection on CULane. Our code has be released.

Attention Maps

MY ALT TEXT

Attention map examples of LSTR and our proposed SinLane. The two models are both trained with the same number of epochs. The attention weights of LSTR concentrate on the middle area of the lane lines. On the contrary, the attention weights of our method are evenly distributed from top to bottom on each line on the road.

Quality Results

MY ALT TEXT

Visualization results of ground truth (GT), LSTR, CLRNet, and our SinLane method on the benchmark dataset CULane. The results are generated using the same backbone ResNet18.

Quantitative Results

MY ALT TEXT

Comparison results of recent methods and our method on the CULane dataset. In order to compare the computation speeds in the same environment, we remeasure FPS on the same machine with an RTX3090 GPU using open-source code (if code is available).

MY ALT TEXT

Comparison results on the Tusimple dataset.

MY ALT TEXT

Comparison results on the LLAMAS dataset.

Video Results on CULane & Tusimple & LLAMAS dataset (TODO)

Poster (TODO)

Supplement Material