The rapid expansion of Generative AI has outpaced ethical standards for training datasets, often resulting in opaque data collection and significant legal risks such as copyright infringement. To address this gap, we introduce the Compliance Rating Scheme (CRS), a framework designed to evaluate dataset compliance against critical transparency, accountability, and security principles. In tandem, we open-source a Python library called DatasetSentinel, which materializes these principles and enables seamless integration into existing pipelines, allowing dataset authors to verify compliance during scraping and practitioners to evaluate existing datasets before use.
Generative AI has experienced exponential growth powered largely by massive datasets built through opaque data collection practices. While most focus remains on model development, the ethical and legal considerations surrounding dataset creation are frequently overlooked. Our work aims to address this by acting on the following insights:
Given these insights, we devise two artifacts to bridge data provenance with AI dataset scraping and evaluation, emphasizing principles of critical transparency, accountability, and security. We open-source these artifacts for the research community and industry practitioners:
The first application of CRS is during the dataset creation phase. When building a new dataset, authors can leverage the DatasetSentinel library to screen individual datapoints as they are being collected. This proactive approach ensures that only compliant data enters the dataset from the outset, preventing costly legal and ethical issues down the line.
Using CRS During Dataset Creation. The DatasetSentinel library integrates into data scraping pipelines to evaluate individual datapoints before inclusion. By checking provenance metadata against compliance criteria, dataset authors can ensure only ethically and legally compliant data is collected, preventing downstream issues before they occur.
The second application targets AI practitioners who need to assess datasets before using them for training. Rather than relying solely on license descriptions or author claims, practitioners can use DatasetSentinel to independently verify a dataset's compliance. The tool analyzes the entire dataset and produces a comprehensive CRS score, removing the element of trust and providing concrete evidence of compliance—or lack thereof.
Using CRS to Assess Existing Datasets. AI practitioners can evaluate datasets they're considering for training by computing a CRS score. The framework analyzes the entire dataset against six compliance criteria, providing a letter grade (A-G) that indicates the dataset's adherence to ethical and legal standards, along with detailed reasoning for the assessment.
We applied the CRS framework to four publicly available datasets from different distribution platforms to demonstrate its practical utility. The results reveal significant compliance gaps across widely-used datasets in the AI community.
| Dataset | Source | Modality | C1 | C2 | C3 | C4 | C5 | C6 | CRS Score |
|---|---|---|---|---|---|---|---|---|---|
| SOD4SB | GitHub | Images | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | C |
| MS COCO | Custom website | Images | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | F |
| RANDOM People | Hugging Face | Videos | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | B |
| TikTok Dataset | Kaggle | Videos | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | G |
The results demonstrate that even popular, widely-used datasets often fail to meet basic compliance standards. Only one dataset (RANDOM People) achieved a "B" rating, while the widely-used MS COCO dataset received an "F" rating, highlighting significant ethical and legal concerns in current dataset practices.
To facilitate widespread adoption, we envision CRS scores being prominently displayed on dataset distribution platforms. Here are mockups showing how the CRS framework could be integrated into popular platforms like GitHub, Hugging Face, and academic lab websites, making compliance information immediately visible to practitioners evaluating datasets.
@inproceedings{bohacek2025compliance,
title = {Compliance Rating Scheme: A Data Provenance Framework for Generative AI Datasets},
author = {Bohacek, Matyas and Vilanova Echavarri, Ignacio},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
pages = {12150--12159},
year = {2025}
}