OpenHermes 2.5 Dataset: A Deep Dive into Conversational Data for LLMs

Tools & Engineering

The Engineer

2 Feb 2024 · 3 min read

OpenHermes 2.5 offers developers and researchers a trove of conversational data in the efficient Parquet format, enhancing LLM training and evaluation with its rich, structured dataset.

The Hugging Face team has recently released the OpenHermes 2.5 dataset, a valuable resource for developers and researchers working with large language models (LLMs). This dataset is particularly interesting because it provides a wealth of conversational data that can be used to fine-tune or evaluate LLMs. Let's dive into what makes this dataset unique and why it matters to practitioners.

What Changed Technically?

The OpenHermes 2.5 dataset has been auto-converted to the Parquet format, which is a columnar storage format optimized for efficient data processing. This conversion is significant because:

Performance Improvements: Parquet files are designed to be read and written more efficiently than traditional formats like CSV or JSON. This means faster load times and better performance when querying large datasets.
Compatibility: The dataset can now be easily integrated into a variety of data processing frameworks, including Apache Spark, Dask, and Pandas, making it accessible to a broader range of users.

Dataset Overview

The OpenHermes 2.5 dataset contains 1 million rows, all of which are part of the train split. Here's a breakdown of the key fields:

id: A unique identifier for each conversation.
title: The title or topic of the conversation.
topic: A more detailed description of the conversation's subject matter.
language: The language in which the conversation is conducted (44 different languages).
model: The name of the model used to generate the responses (1 unique value, likely a placeholder).
conversations: A list of message objects, each containing:
- from: The sender of the message (e.g., "human" or "gpt").
- value: The content of the message.
- weight: An optional weight for the message (often null).
avatarUrl: URL to an avatar image associated with the conversation.
custom_instruction: Custom instructions provided for the conversation (1 unique value, likely a placeholder).
system_prompt: A system prompt used to guide the conversation (6 different values).
views: The number of views the conversation has received (ranges from 0 to 9.13k).
category: The category of the conversation (28 different categories).
idx: An index value for the conversation.
model_name: The name of the model used (1 unique value, likely a placeholder).
source: The source of the data (14 different values).
skip_prompt_formatting: A boolean indicating whether to skip prompt formatting (2 different values: true or false).
hash: A hash value for the conversation.

Example Conversations

Here are a few examples of conversations from the dataset:

Example 1: Mathematical Problem

{
  "from": "human",
  "value": "Every day, a tree drops 7 leaves. How many leaves would it drop in a month of February in a non-leap year? Include your logic.",
  "weight": null
},
{
  "from": "gpt",
  "value": "Here's the logic behind this:\\n\\n1. We know that February has 28 days in a non-leap year.\\n2. If the tree drops 7 leaves every day, then over 28 days it would drop: 7 * 28 = 196 leaves.\\n3. Therefore, the tree would drop 196 leaves in February of a non-leap year.",
  "weight": null
}

Example 2: Analytical Chemistry

{
  "from": "human",
  "value": "In analytical chemistry, what is the principle behind the use of an internal standard in quantitative analysis?\\nA. It compensates for variations in sample preparation and instrumental response.\\nB. It enhances the sensitivity of the analytical method.\\nC. It reduces the detection limit.\\nD. It increases the precision of the measurements.",
  "weight": null
},
{
  "from": "gpt",
  "value": "The principle behind using an internal standard in