The traditional video surveillance systems [6] are mainly designed for the offline analysis of the recorded video streams as well as significantly depend on human operators for their processing. Hence, it cannot afford any uninterrupted real-time video surveillance tasks. Recently proposed smart surveillance systems, however, minimized human intervention on the video stream processing, object detection, and abnormal behavior analysis by using various smart deep learning and machine learning algorithms. For example, automatically processes the collected video frames in a cloud to detect and report any unusual events [20]. A general framework of smart surveillance system involves three level tasks:
? Level 1: the low-level conducts information extraction like feature detection and object tracking;
? 1级:底层进行特征检测、目标跟踪等信息提取
? Level 2: the intermediate-level is in charge of mode recognition like action recognition and behavior understanding; and
? 2级:中层负责动作识别、行为理解等模式识别;和
? Level 3: the high-level is about decision making like abnormal event detection.
? 3级:高层负责异常事件检测等决策
As the first step for any video surveillance application, object detection and classification is essential for further object tracking tasks. Compared to normal moving object detection, distinguishing human being in video frames is more challenging due to the occlusion that is resulted from variable appearance, illumination, and background. While there are a lot of work conducted in the surveillance area, there are only few literatures focusing on human object detection and tracking.
The Haar-like rectangle feature that encodes the intensity contrast between neighboring regions is suitable for face detection [19]. However, the intensity contrast between regions of a human body depends on the appearance of the human wear, which varies randomly such that the Haar-like feature is not a discriminating feature for human body detection [15]. Scale invariance feature transformation (SIFT) provides an alternative algorithm for human detection through extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene [14]. Grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection [7] and HOG+SVM algorithm [14] has better performance in human detection. Compared with prevalent machine learning classifiers like Convolutional Neuron Network (CNN) [12], SVM has less classifying complexity through taking advantage of only a small amount of training data called support vectors to construct optimal hyperplane. High generation ability of support vector
networks allows SVM to outperform AdaBoost [16] by using non-linear kernels that fit data better. In this paper, we exploit HOG and SVM as feature description and object classifier individually for human body detection.
After distinguishing human beings from the background, object tracking algorithm is adapted to generate the trajectory of target objects over time through computing their location in each frame of the video streams. Nowadays the most popular tracking schemes include Point Tracking, Kernel Tracking and Silhouette Tracking [21]. Tracking-Learning-Detection (TLD) tracking framework is widely applied in modern object tracking arts of a field [13], which explicitly decomposes the long-term tracking task into tracking, learning, and detection. Boosting [8] and multiple instance learning (MIL) [2] are capable of online training that makes the classifier adaptive while tracking the object. However, a lot of resources are consumed for updating. KCF requires fewer resources with high tracking success rate [9], which becomes a preferable online tracking method in the realtime surveillance system.
In most surveillance systems, the collected video streams are processed and analyzed in cloud centers where abundant computing resources are allocated. Consequently, all video records need to be transmitted to remote servers no matter they are processed online or offline. Researchers concern the delays that are not tolerable in many delay sensitive, mission-critical tasks [17]. Recently, an online, uninterrupted target tracking system has been proposed leveraging the fog computing paradigm to meet the requirements of real-time video processing and instant decision making [3], [4], [5]. Raw video stream generated by drones are merged on near-sight fog computing devices, such as tablet or laptop. However, image transmission between drone and fog computing nodes still consumes a lot of the bandwidth. Thus, it increases the probability of latency and error rate and post challenges to fast-response requirement.
Unlike the existing approaches, we are using edge computing for detecting and tracking human targets. All the generated raw video streams are initially processed locally by edge devices instead of sending the raw video frames to the remote fog nodes or cloud center. Only the processed features and tracking information are transmitted to fog or cloud for higher level processing, such as model recognition and anomalous detection. The hierarchical approach enables quasi-real-time video frame processing by enhancing the parallelism as well as decreases network delay by reducing the network bandwidth utilization and communication workload.





