At this year’s Super Bowl, Mountain Dew presented a seemingly simple challenge: count the number of Mountain Dew bottles in its 30-second Super Bowl commercial. Take a second and try to do this yourself here:
Give up yet? It’s harder than it sounds. To solve complex visual challenges at scale you need tools, you need AI, and you need a platform. Sixgill’s AI Platform, Sense, allows businesses to extract meaningful insight from image and video data to help solve business challenges. Our Head of Innovation deemed the Mtn Dew Super Bowl Challenge a critical business challenge, so we got straight to work to use Sense to solve it. In this post, we’ll walk through our approach to this challenge as well as demonstrate how our unified AI Platform can make short work of it.
Step 1: Create the Dataset
Assembling the dataset is typically the first task in an AI process. A typical real world use case would have several hours of video data to pull from but for this challenge, the entire corpus of data was the 30-second video commercial. As a feature of the platform, video can be split into frames at a predefined frame rate or FPS (frames per second). We chose 24 FPS which was the native FPS of the video–24 FPS times 30 seconds results in roughly 720 frames (or images). Because we had a small dataset, we were able to label the entire set.
Step 2: Annotate the Data
Next, we defined the labeling procedure as per the official instructions (with examples). The contest instructions for “qualifying” bottles were pretty detailed. If you’re interested in reading them, they’re here.
Data annotation is a critical (and typically time-consuming) step in creating an accurate machine learning model for this type of computer vision project. The annotations are the way that the model “learns” what is important in the video. The Sense Platform has built-in AI-accelerated labeling that automates common repetitive annotation tasks. These features can decrease labeling time up to 20X and increase the quality of annotations.
We used our smart polygon selection tool, SmartPoly, shown below, so instead of painstakingly drawing an outline on every bottle, the tool snaps to the outline of the bottles. We also used our AI-powered Track Forward feature that takes one annotation and tracks it through the frames in the video. This means that we labeled bottles that were flying through the scene just once and Track Forward took care of the rest. If the dataset was larger, we would have used our Sense Labeling Services to make the annotations for us.
Once all our data was labeled, we trained a model to identify all those bottles.
Step 3: Train the Model
Sense SmartML (currently available to our enterprise customers in Beta and will be generally available soon) can train models directly from labels in Sense without any code. While we had already labeled the entire corpus of data for the challenge, a model serves two purposes: 1) integration into the ML pipeline, and 2) a “second opinion” on labels to catch any bottles that might have been missed during labeling. Since the dataset was relatively small, training only took about an hour.
Training in progress in the Sense Platform
Step 4: Assemble the ML Pipeline
Now that we had a machine learning model, how were we going to get a count? Just running the model on each frame wasn’t going to cut it. We needed to understand object identities from frame to frame so that we didn’t double count any bottles. There were also abrupt scene changes. How were we going to handle those?
Object tracking is the process of tracking an object across frames in a video. Object tracking takes detections as input and outputs object identities. Sense’s ML Pipeline provides robust object tracking out of the box that incorporates object motion, object visual similarity, and occlusion tolerance.
Object tracking assumes a fixed scene when operating, but commercials are rarely the same scene throughout. How do we determine when to reset our object tracker? Scenes can be detected by using a variety of Image Similarity metrics. For example, the chart below shows an example of a scene change by measuring mean pixel intensity alone. By grouping frames into scenes, the object trackers can be reset at the correct time.
Step 5: Count the Bottles!
The last step is adding up the unique object identities from all the scenes. While this count gets us pretty close, one rule in particular is problematic.
- Correct Bottle Count will NOT include repeat images of a qualifying bottle. Bottles depicted in the same place/same location throughout the video will count only once. In the instance of a roller coaster scene with multiple cars, the cars portrayed are on a continuous loop throughout the Commercial, and should only be counted once each.
To correctly solve this requires a few more techniques: notions of object permanence, abstract reasoning, and possibly reading the mind of the challenge designers. Luckily, the Sixgill Team can apply these rules on top of the leg work done for us by the Sense Platform. That gets our number to 223.
In the end, it turned out that this challenge wasn’t very different from the problems our customers face and solve every day using our Sense Platform. By leveraging our AI Platform and our deep experience in building and operationalizing AI for enterprise, Sixgill made short work of this challenge. While time will tell if we counted the right number of bottles (considering there was some interpretation of the rules)**, we hope we’ve shown how the right toolset can make seemingly difficult problems totally dew-able.
**With real-world use cases we would normally use a larger data set to train a model but despite that we were able to achieve 91.7% accuracy, an impressive feat for a model trained on just one video!