
Share
This article explores how RLHF can fine-tune reward models to predict Hacker News upvotes efficiently, using minimal GPU resources-just $4.80 worth-for content creators seeking visibility.
In the world of natural language processing (NLP), reinforcement learning from human feedback (RLHF) has emerged as a powerful technique for enhancing model performance, especially in generating high-quality content. In this article, we’ll dive into how to build a reward model that predicts upvote counts for Hacker News (HN) stories using RLHF, all for just $4.80 of GPU time.
Hacker News is a community-driven platform where users submit and vote on tech-related news articles. The front page is highly competitive, with only the most engaging and relevant stories making it to the top. However, not every deserving story gets the attention it deserves. This is where our fine-tuned model comes in. Using RLHF, we can identify stories that are likely to do well on HN, even if they haven’t received any upvotes yet.
Reinforcement learning (RL) is a subset of machine learning where an agent learns to make decisions by performing actions in an environment and receiving rewards or penalties based on those actions. Over time, the model adjusts its behavior to maximize rewards and minimize penalties.
Reinforcement learning from human feedback (RLHF) takes this concept further by incorporating human preferences into the training process. The key steps are:
To build our reward model, we need to:
We collected a dataset of HN stories, including those that didn’t make it to the front page or receive any upvotes. This dataset is crucial because it helps us understand what makes a good story even before it gets any traction.

We used a transformer-based model (e.g., BERT) to predict upvote counts. The model was fine-tuned on our dataset using the Hugging Face Trainer API. Key hyperparameters included:
The training process took approximately 10 minutes on a single GPU, costing around $4.80.
To evaluate our model, we used a holdout set of HN stories that were not part of the training data. The model’s performance was measured using mean absolute error (MAE) and root mean squared error (RMSE).
These metrics indicate that our model can predict upvote counts with reasonable accuracy.
Let’s look at some examples of stories identified by our model as likely to do well on HN:
None of these stories made it to the front page or received any upvotes, but they were all identified by our model as high-potential stories. Subjectively, I agree with the model; these stories deserve more attention.
In future posts, we’ll explore how to use this reward model in conjunction with RL to create a model that can write high-value HN stories. The goal is to generate content that not only gets upvotes but also
Tags
Original Sources
About the author
Kai built ML infrastructure at a Bay Area startup before developing an obsession with transformer architectures and inference optimisation that eventually pulled him out of product work entirely. A stint at a compute research lab sharpened his instinct for what actually matters in a model release versus what is marketing. He writes from the inside — from the perspective of someone who has debugged the systems he is describing at three in the morning. He is allergic to hype and instinctively drawn to the unglamorous plumbing questions that everyone else skips over.
More from The Engineer →This Week's Edition
4 November 2024
88 articles
Related Articles
Related Articles
More Stories