Investigating the effect of distance on the implementation of RCNN automatic detection technique to the human body

: The identification of humans constitutes a crucial component of monitoring systems, given the significance of the timely detection of individuals. Despite advancements in people detection systems, detecting humans at long distances remains challenging. In this study, we employed the Region-based Convolutional Neural Network (RCNN) approach to training a system on images captured at varying distances between the camera and individuals. The results demonstrate promising outcomes, with the system achieving a maximum detection recall of 1 for identifying people at distances of up to 40 meters and maximum precision of 1 for identifying people at distances of up to 50 meters.

computer vision technology involves identifying whether a human body is present in a given image or footage to promptly and precisely locate individuals.It has found widespread application in fields such as intelligent surveillance, security-assisted driving, and other domains.However, the effectiveness of human detection is impacted by several factors, including the complexity of the background, variations in lighting conditions, differences in clothing, posture, viewing angle, and other such variables.Consequently, it is often challenging to acquire high-quality image feature information, which lowers the recognition rate and detection speed.Therefore, there is a need for improvement in this area [2].In recent years, it has become possible to detect people for observation in difficult environments, and in particular, the accuracy of people detection using Convolutional Neural Networks (CNN) has improved significantly[3].
In the current study, a region-based convolutional neural network (RCNN) technique was used to investigate the effect of different distances on human detection in video scenes and compared the results with those of an approved method (ACF).

Previous Studies:
In 2010, Chern-Horn Sim and his group proposed a scheme for detecting people in random image frames of a video sequence showing a dense scene against a cluttered background.The method used only spatial information, and a trained Viola-Jones-type local detector was used in the first image pass to identify people in a dense scene.This resulted in numerous false positives.Therefore, in the second stage, they sought to reduce the number of false positives.They presented their results in the form of receiver performance curves.For example, with a detection accuracy of 79.0%, the false positive rate is 20.3% [4].In 2018, Tattapon Surasak and his group expanded their research on video people detection method which is directional gradient histogram or HOG by developing an application to import and detect people from videos.They used the HOG algorithm to analyze each frame of the video to find and count people.After analyzing the video from start to finish, the program creates a histogram showing how many people were found and how long the video played human detection from thermal infrared images.The convolutional regression network was fully designed to map the anthropogenic heat signature in the thermal image input to spatial density maps.The regression intensity map was then subsequently processed to detect and localize the human in the image.The regression-based method can detect humans with an accuracy of 99.16% and a retrieval of 98.69% [7]. in 2022, Pei-Fen Tsai et.al.They proposed the use of a thermal imaging camera (TIC) along with a deep learning model as an intelligent approach for detecting humans during emergency evacuations in scenarios with low visibility caused by smoke and fire.Using YOLOv4 technology for real-time object detection.Detection accuracy has been obtained greater than 95% for locating people in a low visibility smoke scenario was achieved at 30 frames per second (FPS) [8].

Methodology:
In this section will address the essential stages of building a human body detection system in digital images captured at varying distances, utilizing the RCNN technique.

The introduced human detection system:
In this study, data was captured by several videographers at various distances from 10 to 70 meters, the videos were converted to frames using the VLC media player, and a large number of frames (240 frames) (10m, 20m) were selected., 30 m and 40 m) and introduced for system training using RCNN technology.RCNN is an object detection model that uses large-capacity CNN to propose upstream regions for object localization and segmentation.Selective search is used to identify a large number of candidate regions ("regions of interest") for bounding box objects, extract features from each region separately, and classify [7].RCNN takes an input frame and uses feature maps generated by convolutional layers to suggest where features might be located

Data collection:
In this study, the data was captured by several videographers at different distances from 10 to 70 meters, and the videos were converted into frames using VLC media player, and a large number of frames (240 frames) were selected at distances (10m, 20m, 30m and 40m) and were introduced to train the system using the RCNN technique.Example of extracted frames as showing table (1).
[5] In 2020, Ejaz Ul Haq and his group published a robust framework for detecting and tracking people in noisy and closed environments using data augmentation techniques.In addition, they used softmax layers and the built-to improve the detection and classification performance of the proposed model.The main attention was paid to fulfilling the tasks of detecting a person in unrestricted conditions[6] . in 2021, A. Haider et.al.They proposed a new regression-based method for [3]  .This means that RCNN runs a classifier based on each sentence, checks the object existence probability, if the probability exceeds a threshold, the sentence is flagged and RCNN runs the network as a pair.means to handle another part.Using the generated feature map, we extract the bounding boxes and identify the highest probability of objects matching each bounding box[8], as shown in figure(1).

Figure 1 :
Figure 1: An illustrative diagram of the stages of R-CNN

Table 8 :
Samples of Extracted frames from videos at different d (between persons and camera)