Moving from Video Analytics to Computer Vision
The security industry pioneered video analytics, aiming to increase the efficiency of people watching video monitors. These efforts led, however, to a disparate collection of often closed and expensive systems that are unreliable in the real world. Below we explore the limitations of conventional video analytics and describe how Sighthound Video is using computer vision to offer homes and businesses more intelligent video security.
Video analytics today
Consumer‐grade solutions
Most IP cameras come with software that prevents the camera from having to record and store video 24/7. The cameras use motion detection to determine if moving objects are present in video. Motion detection analyzes how many pixels have changed between frames. When enough pixels change, video is recorded and/or the customer is notified of a “motion event.” Unfortunately, pixels can also change with clouds passing overhead, leaves swaying in the wind, or flickering lights. More advanced systems let customers ignore selected regions in the video, or set sensitivity levels. Nevertheless, motion detection often generates an unacceptable number of false alerts. Customers are flooded with notifications and recordings of empty scenes. The typical consumer story goes something like this: There’s a theft from a house in the neighborhood. The homeowner installs video cameras for peace of mind. The software that comes with the system is dauntingly complicated to setup and use. If the consumer does manage to set up the product, the feature that allows alerts to be sent to a smartphone triggers hundreds of false alarms. By day three the homeowner has turned off the alerts.
Professional‐grade solutions
Commercial analytics systems use a two‐step process to recognize people in videos. First, pixel changes are analyzed to isolate moving objects from the background. Boxes are drawn around these objects. Second, “object recognition” algorithms are applied to classify the moving objects as people, vehicles or other objects.
Conventional object recognition starts with the assumption that vertical boxes are humans, and horizontal boxes are not. Rules about the boxes are then added to improve accuracy, starting with size. Boxes that are much shorter or taller than humans are filtered out. This requires that a technician calibrate the 3D scene to a 2D video image by entering camera height, lengths of objects at varying places in the screen, etc. In addition, the motion of the box may be analyzed to assess speed, consistency of movement, etc. Some systems attempt deeper analysis of the visual content inside the box. This may sound straightforward until one considers the tremendous variation of shapes in the real world: people carrying umbrellas covering their heads, pushing strollers or carrying large items, “split” when walking behind objects, etc. To a rules‐based recognition system, these cases represent radical departures from the canonical form of a person. Expanding the rules to accommodate this variation unfortunately increases the number of non‐human forms that also meet those criteria.
Sighthound Video: a new approach
Rather than relying on expensive equipment and expert human tuning, Sighthound Video uses smarter algorithms that work with less expensive equipment, out of the box. The software was trained with thousands of video clips of humans, dogs, cats, cars, clouds and other objects. The system was taught what each clip represented. It then can recognize objects it has not seen before by comparison to the trained data sets.
The screenshot on the left below is from a scene where Sighthound Video generated 92% accuracy over the course of an afternoon. The screenshot on the right was taken from a state‐of‐the‐art professional system that, in a side‐by‐side comparison, generated a 75% accuracy figure over the same time period. The screenshot shows an actual failure case seen relatively frequently on the professional system in this experiment: persons who are less than 10 or 20 feet from the camera.
The failure probably occurs because only part of the body is visible. Raw accuracy data is not meaningful out of context, however, since accuracy varies dramatically between and within scenes. The 75% accuracy figure can be increased considerably by mounting an array of cameras so that people are detected in the distance before approaching, and the entire body can be seen. This is likely what a reseller installing a professional system would recommend. On the other hand, the accuracy would drop significantly if a single camera were mounted to detect people outside a door. This is an important use case for the small business and residential customers that Sighthound Video targets.
Solving the right problem, in the most elegant way
Even the best raw technology must target an appropriate problem. Video analytics has been marketed as failsafe protection of critical assets in large deployments. Unfortunately, not only did conventional systems fail to live up to the accuracy claims, but also the solutions they targeted were inappropriate in the first place. Even with a 99% accurate system, false positives can be crippling in the real world. At 99% accuracy, 2000 objects appearing in a day would trigger 20 false alarms.
Sighthound Video, on the other hand, is optimized for rapidly refining searches and reviewing video. The goal is to distill an unwieldy number of video clips into a manageable amount, and to quickly refine searches when looking for a specific event. This usage pattern is more closely aligned to the reality of accuracy that is below 100%.
In this context, user interface plays a role that is as critical as accuracy. Sighthound Video was designed to refine searches in order to reduce significantly the amount of video to scan.
Sighthound sees this approach as a template for solving information overload problems. The key to dealing with too much information is to find relevant information with intelligent computing, and to provide a streamlined interface to find what you want quickly. It becomes another big data problem, but with more CPU processing power and storage than either text or speech.
Summary
Sighthound Video leverages intelligent technology to powerful effect. Intelligent computing, combined with thoughtful interface design, will create simple solutions to complex problems.