Cosmopedia v0.1 (Incomplete)

Reply to topic
DL-List and Torrent activity
Size:  21.6 GB   |    Registered:  5 months 26 days   |    Downloaded:  18 times

Seeder not seen: 5 months 26 days -> msgraves

 
Author Message

msgraves ®

Registered: 5 months 26 days

Posts: 12

Post 10-Jun-2024 14:16 | #1 · Author

[Code]

Cosmopedia v0.1
Cosmopedia is a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.The dataset contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.
It covers a variety of topics; we tried to map world knowledge present in Web datasets like RefinedWeb and RedPajama, and generate synthetic content that covers them. This is the v0.1 of Cosmopedia, with ample room for improvement and topics to be more comprehensively covered. We hope this dataset will help the community's research efforts in the increasingly intriguing domain of synthetic data. You can find a clickable map by Nomic at https://atlas.nomic.ai/map/cosmopedia.
This work is inspired by the great work of Phi1.5. You can find more details about the dataset in our blog post: https://huggingface.co/blog/cosmopedia
TL;DR
This is a synthetic dataset of 30M samples generated by Mixtral-8x7B-Instruct-v0.1. It contains 8 splits depending on the source of the seed samples we use in the prompts, the model is asked to generate content related to them. The splits range from web samples to educational resources like Stanford, OpenStax and KhanAcademy, we also use some instruction-tuning datasets as seed samples for stories.
Here's how you can load a dataset split:
from datasets import load_dataset
ds = load_dataset("HuggingFaceTB/cosmopedia", "stories", split="train", num_proc=12)
ds[0]

If you want a smaller subset of the dataset check Cosmopedia-100k. We also trained a 1.8B model on Cosmopedia Cosmo-1B.
Dataset splits
The prompts are all based on the concept of using a seed sample (for example an extract from a web page) and asking the model to generate new content (textbook, story, blogpost..) related to that seed sample.
The dataset consist of 8 splits depending on the source of the seed data used in the split. Some seed samples may appear more than once when we ask for a different style (e.g academic textbook vs blogpost) or audience (e.g young children vs college students). For example, each sample in stanford was used with 4 different prompt styles and audiences, check the format and audience columns for more details. We observed that tailoring the audience and prompt style accordingly significantly enhances diversity; the proportion of duplicates eliminated via MinHash was under 1%.
The graph below shows the distribution of seed datasets, generations formats and audiences in Cosmopedia:
https://cdn-uploads.huggingface.co/production/uploa...fLO5TxKPUXs4.png
Below are the 8 splits:
web_samples_v1: this and web_samples_v2 are the largest splits (they make up~75% of the dataset), where we use samples from an internal web dataset similar to RefinedWeb. These samples were selected based on their topic, using a clustering method explained in the section below.
web_samples_v2: similar to web_samples_v2 using different samples. We call it v2 because we refined the prompts for this split (e.g asking for more depth over breadth in the concepts explanations and requesting the model to not generate a title and introductory sentences, which might be redundant across samples).
stanford: we scraped course outlines from stanford.edu, and each time we prompt the model with one of the course units.
stories: we generated stories to add some commonsense and day-to-day knowledge aspect to the dataset. For this split we use samples from UltraChat -only questions about the world subset- and OpenHermes2.5. These are synthetic instruction-tuning datasets that are already curated and cover a wide range of topics.
wikihow: in this split, we asked the model to generate WikiHow articles from WikiHow titles that we scraped, the list is avilable here. Note that you can find more WikiHow articles in the other splits by looking for it in the format column.
openstax: we scraped course outlines with unit introductions from OpenStax, a resource suggested by AFAIK team.
khanacademy: we scraped the outlines for the courses on KhanAcademy, and asked the model to genrate a textbook for each.
automathtext: to improve the science knowledge of the model, we use samples from AutoMathText dataset as seed samples. The dataset covers more than just math. See this clustering plot we made.
Dataset features
The dataset has the following features:
prompt: the prompt we used to generate the content with Mixtral-8x7B-Instruct-v0.1.
text: the synthetic generated content.
seed_data: the prompts include some text fromanother dataset/an external source, seed_data is the name of that dataset (e.g web, Stanford courses...)
token_length: the number of tokens in text, computed using Mistral-7B's tokenizer
format: the style of text, this can for example be a textbook, a blogpost, a story.. It can also be inferred from the prompt.
audience: the target audience defined in the prompt
Dataset creation
The "Dataset splits" section already provides an overview of the data creation pipeline. In this section, we will explain the topic clustering method for web samples and our iterative process for refining the prompts, in addition to decontamination.
Topic clustering
Our goal was to generate a vast quantity of synthetic data covering a wide range of topics (essentially, anything useful found on the web) in a cleaner format like textbooks. A natural strategy was to begin with web samples, using them as seeds for the generation. This approach, employed by Li et al. in Phi-1.5, appears to be the most scalable method for synthetic data generation, given the availability of web datasets with trillions of tokens.
The prompted model will use an extract from these seed samples as a reference for generation, so the topic might matter more than the actual content of the file. To filter out less relevant topics and to provide the model with context for generating content, we first clustered millions of files from a web dataset. Then we prompted Mixtral 8x7B with extracts from 10 random samples in each cluster and asked it to find the topic they have in common and to provide an educational score for that topic. The dataset with clusters and topics is available in this demo, the code is available in text-clustering and a demo for inspection. The educational score seems to work for "very uneducational" topics like adult content and "highly educational" topics like College Mathematics, but isn't very relevant in-between. So we manually inspect the 145 clusters we find, and discard 35 of them. The final list of topics is available here.
We don't do any further filtering inside the clusters but we include the topic of the sample in the prompt 100% of the time for web_samples_v1, but only 50% of the time in web_samples_v2, where we tried to refine the prompts, in case the topic isn't accurate or the topic list isn't comprehensive. Below are the clusters found in Cosmopedia:
https://cdn-uploads.huggingface.co/production/uploa...EfH3j8iZYXVN.png
Diversity
We find that when using the same seed sample multiple times, changing the generation style and/or the audience and their target format results in different generations, covering the same topic from different angles. For example when asking the model for a children's textbook, we needed to remind it that it can't use complex concepts and that the tone should be adapted to children. The same goes when asking for textbooks for college students vs for researchers, we had to emphasize the level of depth we wanted for each, and how acadmeic the textbooks should be.
By carefully iterating on the prompts using HuggingChat and then generating few hundreds samples, we managed to reduce the redundancy. For example, we noticed that the model always started the stories with "Once upon a time" and the forums posts with "A few years back", asking it to explicitly avoid these sentences when starting the generation results in more diverse beginnings (don't worry "Once upon a time" still appears in stories!). Same goes for blogposts and textbooks where the introductory sentences were initially repetitive.
Running MinHash deduplication on the splits detects less than 1% of the files as duplicates.
Decontamination
Given how we generate synthetic content, there is a possibility that the seed samples or the model's training data could have benchmarks contamination. Therefore, we run a decontamination piepline to make sure we don't have any samples from the test benchmarks in our dataset.
We use a 10-gram overlap to retrieve potentially contaminated samples, similarly to Phi-1. After retrieving the candidates, we run a diff between the dataset sample and the benchmark sample using difflib.SequenceMatcher and discard the sample if len(matched_substrings)/len(benchmark_sample) > 0.5. We run decontamination against all the benchmarks we evaluated the Cosmo-1B model on: MMLU, HellaSwag, PIQA, SIQA, Winogrande, OpenBookQA, ARC-easy, ARC-challenge.
We report the number of contaminated samples removed from each dataset split, as well as the number of unique benchmark samples that they correspond to (in brackets):
Dataset group ARC Easy ARC Challenge BoolQ HellaSwag MMLU OpenBookQA PIQA WinoGrande
web_samples_v1 + web_samples_v2 + stanford + openstax 30 (13) 19 (3) 386 (41) 6 (5) 1 (1) 0 (0) 5 (3) 0 (0)
auto_math_text + khanacademy 4 (4) 13 (2) 34 (7) 1 (1) 0 (0) 0 (0) 0 (0) 0 (0)
stories 33 (20) 20 (12) 27 (21) 3 (3) 1 (1) 2 (2) 6 (4) 3 (2)
Code
The code for topic clustering of the web samples, building the prompts, content generation and data deduplication & decontamination can be found in the Cosmopedia GitHub repository.
Citation
@software{benallal2024cosmopedia,
author = {Ben Allal, Loubna and Lozhkov, Anton and Penedo, Guilherme and Wolf, Thomas and von Werra, Leandro},
title = {Cosmopedia},
month = February,
year = 2024,
url = {https://huggingface.co/datasets/HuggingFaceTB/cosmopedia}
}
Cosmopedia v0.1 (Incomplete) [aitracker.art-31].torrent  
Torrent: Registered [ 2024-06-10 14:16 ]

info_hash: ACEC0E1D3ED78FB63EBEA649F091F99EBE89550F
info_hash v2: 29CB7773AC6CB68AABC817C848850D2A3BE78E05FCF14841AE041B8C5F29D2A6

Download .torrent

Download

648 KB

Status: * not checked
Downloaded: 18 times
Size: 21.62 GB
[Profile] [PM]

leaf

Registered: 5 months 27 days

Posts: 35

Location: Earth

Post 10-Jun-2024 14:20 | #2 (after 4 minutes)

Why does this show as 20gb? I have this dataset and it’s about 80gb in total.
[Profile] [PM]

msgraves ®

Registered: 5 months 26 days

Posts: 12

Post 10-Jun-2024 14:20 | #3 · Author (after 43 seconds)

THIS APPEARS TO BE INCOMPLETE. I can't delete it for some reason, so i'll leave it up for now, but IT'S NOT COMPLETE!!
[Profile] [PM]

Dreamertist

Registered: 6 months

Posts: 13

Post 10-Jun-2024 15:49 | #4 (after 1 hour 28 minutes)

Can you try editing it and uploading a new version if you have a full torrent?
[Profile] [PM]
Display posts:    
Reply to topic
Move to top

Current time is: 07-Dec 07:06

All times are UTC + 1



You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You can download files in this forum

Cover our bill…

Donate at ko-fi.com

Or use crypto:

We accept BTC, XMR, LTC, BCH & DOGE