Automated video cutting

iconMachine Learning
iconComputer Vision
iconArtificial Intelligence

In the recent decade, as people have moved away from desktops to mobile devices, their consumption patterns have undergone significant transformation. One of such changes is the boost of video consumption from smartphones, which prompted mobile video advertising. Today, by creating a video tailored for mobile devices and the behavior of mobile users, advertisers can now achieve successful performance. At the same time, though growing in popularity, videos become shorter: in 2016, the average video length was 13.14 minutes, in 2017 6.07 minutes, and in 2018 it shrank to 4.07 minutes. So how is it possible to cut videos maintaining the content significance, and adjusting it to the viewer’s preferences? The answer seems obvious: you need to make your videos shorter and more content-intensive. But the reality is not that straightforward: each platform requires its format, length, and size, so you end up creating tons of video materials.



The project objective was to develop a solution that would automatically cut and reshape a video according to the chosen social platform standards. The source videos were usually 30-second TV commercials that needed to be uploaded to many social media platforms, including Instagram, YouTube, and Facebook. The ultimate goal was to make a similar video shorter and, therefore, more content-intensive.


We worked on two subtasks: first of all, we needed to make videos shorter, and, secondly, we needed to resize them according to the requirements of specific social media platforms.


Subtask 1. Video Cutting

The source video length was approximately 30 sec., while the target length varied from 6 to 10 seconds. Therefore, we needed to select only the most relevant parts of the video and glue them together in a new and more content-intensive piece. To fulfill this subtask, we performed such steps:

Step 1. The crucial part of cutting a video is to find relatively stationary shots that present similar information. To do so, we calculated optical flows – patterns of apparent motion of objects, edges, and surfaces of a scene caused by the relative motion between an observer and a scene. Peaks in optical flows can be used as signals that the scene is changing, therefore, a new shop begins. We used peaks as scene-changing markers and considered the zone between the adjacent peaks as a single shot.


Step 2. After finding separate shots, we could analyze their relevance for the video. We used object detection to determine the number of objects present in the shot and its memorability and aesthetics levels. We estimated memorability and aesthetics scores for the first, middle, and last frames of each shot. To assess memorability and aesthetics, we used feature vectors from a TensorFlow Hub model of the MobilNet v2 family and classified them.


Step 3. We calculated a total score for each shot based on the number of detected objects, shot duration, and memorability and aesthetics scores. Afterward, we selected the most relevant shots and compiled them in 6-10 sec. video.


Step 4. We cut the audio track to fit the new video length by taking the closest bit to the boundary between shots.


Subtask 2. Video resizing

Depending on the platform, the video can be square or require a vertical orientation, which is uncommon for standard TV commercials. To make sure that the relevant information remains in the video, we needed to work on all objects and text segments present in the video:

Step 1. We performed object detection for each shot's first, middle, and last frames against the background. We could ensure they remained present with objects detected throughout the whole shot, mimicking the camera movement. In this way, we could cut off the background while preserving more important information.


Step 2. We performed text detection and editing with the help of OpenCV. We detected text for each shot's first, middle, and last frames and saved the results into a .csv file. Furthermore, we added three options for text processing, including text deletion, copying, and writing. After completion of the required operations, a new video overwrote the old one.



We delivered a solution that allowed our clients to automatically resize video commercials to fit the requirements of the selected social media platform. With the process automation, they needed to create only one video, saving much time and effort on video trimming.

Tech Stack