INTEGRATION OF SEGMENTATION, TRACKING AND CLASSIFICATION MODELS TO SOLVE VIDEO ANALYTICS PROBLEMS
Abstract
The integration of several models into one technical vision system will allow solving more complex tasks. In particular, for mobile robotics and unmanned aerial vehicles (UAVs), the lack of data sets for various conditions is an urgent problem. In the work, the integration of several models is proposed as a solution to this problem: segmentation, maintenance and classification. The segmentation model allows you to select arbitrary objects from frames, which allows it to be used in nondeterministic and dynamic environments. The classification model allows you to determine the objects necessary for navigation or other use, which are then accompanied by a third model. The paper describes an algorithm for model aggregation. In addition to models, the key element is the correction of model predictions, which allows you to segment and accompany various objects reliably enough. The procedure for correcting model predictions solves the following tasks: adding new objects to accompany, validating segmented object masks and clarifying the associated masks. The versatility of this solution is confirmed by working in difficult conditions, for example, underwater photography or images from UAVs. An experimental study of each of the models was carried out in an open area and indoors. The data sets used make it possible to assess the applicability of models for mobile robotics tasks, that is, to identify possible obstacles in the robot's path, for example, a curb, as well as moving objects such as a person or a car. They demonstrated a sufficiently high quality of work. For most classes, the indicators exceeded 80% by various metrics. The main errors are related to the size of the objects. The conducted experiments clearly demonstrate the versatility of this solution without additional training of models. Additionally, a study of performance on a personal computer with various input parameters and resolution was conducted. Increasing the number of models significantly increases the computational load and does not reach real time. Therefore, one of the directions of further research is to increase the speed of the system
References
Available at: http://arxiv.org/abs/2304.11968.
2. Cheng H.K., et al. Tracking anything with decoupled video segmentation, IEEE/CVF International
Conference on Computer Vision, 2023, pp. 1316-1326.
3. Zhu J., et al. Tracking anything in high quality, CoRR, 2023, Vol. abs/2307.13974. Available
at: http://arxiv.org/abs/2307.13974.
4. Liu Y., et al. MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small
Objects on Edge Devices, Remote Sensing, 2023, Vol. 15, No. 24, pp. 5665.
5. Cheng Y., et al. Segment and track anything, CoRR, 2023, Vol. abs/2305.06558. Available at:
http://arxiv.org/abs/2305.06558.
6. Kirillov A., et al. Segment anything, CoRR, 2023, Vol. abs/2304.02643. Available at:
http://arxiv.org/abs/2304.02643
7. Cheng B., Schwing A., Kirillov A. Per-pixel classification is not all you need for semantic segmentation,
Advances in Neural Information Processing Systems, 2021, Vol. 34, pp. 17864-17875.
8. Yang Z., Yang Y. Decoupling features in hierarchical propagation for video object segmentation,
Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 36324-36336.
9. Yang Z., Wei Y., Yang Y. Associating objects with transformers for video object segmentation,
Advances in Neural Information Processing Systems, 2021, Vol. 34, pp. 2491-2502.
10. Cherti M., et al. Reproducible scaling laws for contrastive language-image learning,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818-2829.
11. Awadalla A., et al. Openflamingo: An open-source framework for training large autoregressive
vision-language models, CoRR, 2023, Vol. abs/2308.01390. Available at: http://arxiv.org/abs/
2308.01390.
12. Li J., Li D., Xiong C., Hoi S. Blip: Bootstrapping language-image pre-training for unified vision-
language understanding and generation, International Conference on Machine Learning,
2022, pp. 12888-12900.
13. Radford A., et al. Learning transferable visual models from natural language supervision, International
conference on machine learning, 2021, pp. 8748-8763.
14. Mueller M., Smith N., Ghanem B. A benchmark and simulator for uav tracking //Computer
Vision–ECCV 2016: 14th European Conference. 2016. – P. 445-461.
15. Github: fbrs_interactive_segmentation. Available at: https://github.com/SamsungLabs/fbrs_
interactive_segmentation.
16. Sofiiuk K., Petrov I., Barinova O., Konushin A. F-BRS: Rethinking Backpropagating Refinement
for Interactive Segmentation, IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2020, pp. 8623-8632.
17. Fomin I., Arhipov A. Selection of Neural Network Algorithms for the Semantic Analysis of
Local Industrial Area, International Russian Automation Conference, 2021, pp. 380-385.
18. Miao J., et al. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild, IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2021, pp. 4133-4143.
19. Zhang C., et al. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications,
CoRR, 2023, Vol. abs/2306.14289. Available at: http://arxiv.org/abs/2306.14289.
20. Wang A., et al. RepViT-SAM: Towards Real-Time Segmenting Anything, CoRR, 2023,
Vol. abs/2312.05760 Available at: http://arxiv.org/abs/2312.05760.