Tencent AI Lab Researchers Introduces Persona-Hub: Scaling Synthetic Data Creation with 1,000,000,000 Personas
Large language models (LLMs) rely heavily on synthetic data for training. This data, mimicking real-world information, allows researchers to train and evaluate LLMs effectively. It offers several advantages: protecting privacy, minimizing data collection efforts, and creating rich, diverse datasets that boost LLM performance across various applications.
However, generating large-scale and diverse synthetic data remains a challenge. Traditional methods like instance-driven approaches (based on a seed corpus) and key-point-driven methods (leveraging curated key points) struggle to achieve both diversity and scalability. They often create datasets limited in scope and unable to represent a broad range of scenarios.
Researchers at Tencent AI Lab have proposed a novel solution: Persona Hub. This method utilizes a massive collection of one billion diverse personas, automatically generated from web data. Each persona has unique characteristics like knowledge, experiences, interests, and professions.
Persona Hub allows LLMs to generate data from various perspectives, overcoming the limitations of previous methods. By prompting LLMs with specific personas, the system steers them towards creating distinct and varied datasets.
Persona Hub is vast, encompassing one billion personas, representing 13% of the global population. Each persona acts as a carrier of world knowledge, guiding LLMs to produce rich and contextually relevant synthetic data. The researchers developed scalable techniques to create these personas from web data, using both text-to-persona and persona-to-persona methods.
This persona-driven approach has yielded impressive results. Researchers were able to generate various types of synthetic data, including math problems, logical reasoning problems, instructions, knowledge-rich texts, game characters, and tools. Evaluations showcased significant improvements in LLM performance. For instance, a model trained with synthetic math problems achieved superior accuracy to existing open-source LLMs.
The research highlights the potential of persona-driven data synthesis to revolutionize LLM training and development. Persona Hub's ability to generate diverse and scalable synthetic data paves the way for this method to become a standard practice in the field. This innovative approach promises to unlock new capabilities for LLMs and broaden their real-world applications. Ultimately, by addressing the challenges of synthetic data generation, this research has the potential to significantly advance the field of artificial intelligence and machine learning.