The HiVT method begins by representing the road scene as a collection of vectorized elements. Based on this scene representation, the model hierarchically aggregates spatiotemporal information. The road scene consists of agents and map information. For structured scene representation, vectorized elements are first extracted, including trajectory segments of road agents and lane segments from map data. more...