Harnessing Weak Supervision for Reliable and Scalable Data Annotation

发布时间:2025-11-26

报告主题:Harnessing Weak Supervision for Reliable and Scalable Data Annotation

报告人:Naiqing Guan

报告时间:2025年11月26日 下午 02:00

报告地点:北京大学王选计算机研究所106报告厅

Abstract: In the era of machine learning, the need for large-scale labeled datasets has become a critical bottleneck in model development and deployment. Programmatic weak supervision (PWS) offers a promising solution by enabling automatic data labeling through labeling functions (LFs), but faces significant challenges in practical applications. In this talk, I will present a list of work to enhance the reliability and scalability of PWS. First, I will introduce DataSculpt, a novel framework that leverages large language models to automate LF creation, reducing labeling costs from hundreds of dollars to mere cents while maintaining high accuracy. Second, I will introduceWeShap, a Shapley value-based metric for efficient evaluation of LF quality, enabling systematic improvement of weak supervision systems through principled assessment of individual LF contributions. Third, I will present ActiveDP, a hybrid framework that combines active learning with PWS to optimize both label quality and quantity, demonstrating robust performance across varying LF set sizes. Together, these innovations form a comprehensive suite of tools that significantly advance the practical applicability of PWS, enabling more reliable and scalable data annotation for real-world machine learning applications. 

Bio: Naiqing Guan is a PhD candidate in the Data Systems Group at the University of Toronto, supervised by Professor Nick Koudas. His research focuses on data curation for machine learning pipelines, leveraging techniques such as LLM agents, weak supervision, and active learning to enhance the efficiency and reliability of data annotation. He previously interned at the PKUMOD Lab under the supervision of Professor Lei Zou and earned his bachelor’s degree from Peking University in 2020.


CLOSE