COMBINING SEGMENTATION, TRACKING, AND CLASSIFICATION MODELS TO SOLVE VIDEO ANALYTICS PROBLEMS

Abstract

The task of detecting obstacles in front of a mobile robot has been successfully solved long ago using laser and ultrasonic sensors. However, obstacles that are not detected by these types of sensors may endanger the safety of the robot. To detect them in the work, it is proposed to use a technical vision system (STZ), the information from which is processed by a semantic segmentation neural network, which returns the mask of the obstacle on the frame and its class. The basis for such a network was the SAM universal segmentation network, which requires further development to be applied to the semantic segmentation task. The peculiarity of this network is its universal applicability, that is, the ability to select any objects in any filming situation. At the same time, SAM does not predict the semantics of the object. In this paper, an additional module is proposed that makes it possible to implement semantic segmentation by classifying the features of the selected objects. The possibility of using such a module to solve the problem of supplementing the network output with new information is substantiated. The classification result is then fed into the same filtering algorithm as the masks to ensure consistency between the result of the universal network and the complementary module. After integrating the module with the model, a new semantic segmentation model was obtained, called RTC-SAM in the work. It was used to perform semantic segmentation of a publicly available dataset with images of an open area. The 45% result obtained by the IoU metric exceeds the result of existing methods by 13%. The images of the results of using the new network shown in the work make it possible to verify its performance. It also describes the testing of the developed solution with a study of the performance of the developed model on a PC and a mobile computer. The algorithm on the mobile computer shows insufficient speed to enter real-time mode – more than 3.5 seconds to process one frame. In this regard, one of the directions of further research in the field of improving system performance.

References

1. Peng Y. et al. The obstacle detection and obstacle avoidance algorithm based on 2-d lidar // 2015 IEEE

international conference on information and automation. – IEEE, 2015. – P. 1648-1653.

2. Gibbs G., Jia H., Madani I. Obstacle detection with ultrasonic sensors and signal analysis metrics //

Transportation Research Procedia. – 2017. – Vol. 28. – P. 173-182.

3. Zhao H. et al. Pyramid scene parsing network // Proceedings of the IEEE confer-ence on computer

vision and pattern recognition. – 2017. – P. 2881-2890.

4. Xie E. et al. SegFormer: Simple and efficient design for semantic segmentation with transformers //

Advances in neural information processing systems. – 2021. – Vol. 34. – P. 12077-12090.

5. Kirillov A. et al. Segment anything // Proceedings of the IEEE/CVF International Conference on

Computer Vision. – 2023. – P. 4015-4026.

6. Архипов А.Е., Фомин И.С., Матвеев В.Д. Комплексирование моделей сегментации, сопровож-

дения и классификации для решения задач видеоаналитики // Известия ЮФУ. Технические

науки. – 2024. – № 1.

7. Cheng B., Schwing A., Kirillov A. Per-pixel classification is not all you need for semantic segmentation

// Advances in Neural Information Processing Systems. – 2021. – Vol. 34. – P. 17864-17875.

8. Anthropic A.I. Claude 3.5 sonnet model card addendum // Claude-3.5 Model Card. – 2024. – Vol. 3.

9. Kalyan K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4 // Natural

Language Processing Journal. – 2024. – Vol. 6. – P. 100048.

10. Chen T. et al. A simple framework for contrastive learning of visual representations // International

conference on machine learning. – PMLR, 2020. – P. 1597-1607.

11. Radford A. et al. Learning transferable visual models from natural language super-vision // International

conference on machine learning. – PMLR, 2021. – P. 8748-8763.

12. Simonyan K. Very deep convolutional networks for large-scale image recognition // arXiv preprint

arXiv:1409.1556. – 2014.

13. Jeong S., Kim H., Cho Y. Diter: Diverse terrain and multi-modal dataset for field robot navigation in

outdoor environments // IEEE Sensors Letters. – 2024.

14. Shah D. et al. Rapid exploration for open-world navigation with latent goal models // arXiv preprint

arXiv:2104.05859. – 2021.

15. Shah D. et al. Ving: Learning open-world navigation with visual goals //2021 IEEE International Conference

on Robotics and Automation (ICRA). – IEEE, 2021. – P. 13215-13222.

16. Wigness M. et al. A rugd dataset for autonomous navigation and visual perception in unstructured

outdoor environments // 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems

(IROS). – IEEE, 2019. – P. 5000-5007.

17. Jiang P. et al. Rellis-3d dataset: Data, benchmarks and analysis // 2021 IEEE international conference

on robotics and automation (ICRA). – IEEE, 2021. – P. 1110-1116.

18. Amari S. Backpropagation and stochastic gradient descent method // Neurocomputing. – 1993. – Vol. 5,

No. 4-5. – P. 185-196.

19. He K. et al. Deep residual learning for image recognition // Proceedings of the IEEE conference on

computer vision and pattern recognition. – 2016. – P. 770-778.

20. Xiao T. et al. Unified perceptual parsing for scene understanding // Proceedings of the European conference

on computer vision (ECCV). – 2018. – P. 418-434.

Скачивания

Published:

2025-04-27

Issue:

Section:

SECTION III. COMMUNICATION, NAVIGATION AND GUIDANCE

Keywords:

Neural network, segmentation, classification of vectors, representation vector, vision system, obstacles