A Quantitative Study of Video Duplicate Levels in YouTube

Yao Liu, Sam Blasiak, Weijun Xiao, Zhenhua Li, and Songqing Chen
Proceedings of the 16th Passive and Active Measurement Conference (PAM 2015)
New York City, NY, March 19-20, 2015


The popularity of video sharing services has increased exponentially in recent years, but this popularity is accompanied by challenges associated with the tremendous scale of user bases and massive amounts of video data. A known inefficiency of video sharing services with user-uploaded content is widespread video duplication. These duplicate videos are often of different aspect ratios, can contain overlays or additional borders, or can be excerpted from a longer, original video, and thus can be difficult to detect. The proliferation of duplicate videos can have an impact at many levels, and accurate assessment of duplicate levels is a critical step toward mitigating their effects on both video sharing services and network infrastructure.

In this work, we combine video sampling methods, automated video comparison techniques, and manual validation to estimate duplicate levels within large collections of videos. The combined strategies yield a 31.7% estimated video duplicate ratio across all YouTube videos, with 24.0% storage occupied by duplicates. These high duplicate ratios motivate the need for further examination of the systems-level tradeoffs associated with video deduplication versus storing large number of duplicates.


The dataset used in our paper is made available here for use by the research community. This dataset includes the IDs of videos we have sampled from YouTube, the IDs of candidate videos we have collected from YouTube search results, and the metadata associated with these videos we have crawled using the YouTube public API. Please refer to Section 4 of our paper for our data collection and measurement methodology.

If you use our dataset in your research, please email Yao Liu at "yaoliu AT binghamton DOT edu", and include a reference to our paper (pdf) in your work.