-
Couldn't load subscription status.
- Fork 2
Description
Hello,
Thank you again for your excellent work. I have a few questions regarding the model design and dataset:
-
In the token decoder, you used 5×5 masked attention for self-attention and a 3×3 convolution block at the end. Have you tried using full (unmasked) attention instead of masked attention, along with the 3×3 convolution block at the end? While masked attention is known to perform better in segmentation tasks due to local region awareness, the token decoder here performs index prediction rather than segmentation.
-
The BEV labels from the nuScenes dataset do not appear to be rotated upward.
The first image is from timestamp 1, and the second image is from timestamp 2. As you can see, the vehicle moves from left to right. Did you rotate the BEV map 90 degrees counterclockwise for visualization in the paper?
- For the MLP head, transformer-based autoregressive models typically use a simple linear head with cross-entropy loss. However, you used an MLP with focal loss. I understand that focal loss addresses class imbalance, but I’m curious how much performance gain you observed from using both the MLP and focal loss.
Thank you again for your great work.