ChatGPT to leverage Reddit data for its Large Language Model, Reddit is the most useful resource available on the Internet today outside of Youtube, but unlike Youtube it as not yet got to the inflection point between “usefulness” vs “Monetisation” at the cost of its users level yet, Youtube is playing a very close game here and is at the point where a competitor will take away much of their advantage – one can hope anyway!
Below is the reddit post discussing the new agreement
https://www.redditinc.com/blog/reddit-and-oai-partner
Opinion?
So now that Reddit data to be used to train ChatGPT under this new deal with OpenAI, what I going to change?…… I use Reddit almost every day and am an active contributor, the platform providers a genuine mechanism for questions and answers on almost any topic as well as a platform for those who are passive consumers of content rather than actively participate – to those who are passive consumers, one day – try and post or provide a response, it is a rewarding experience – so long as you do not take a polarising view without an iron mental state and a flak jacket!
You see – Reddit posts either go one of two ways, either the hive mind “supports” or hive mind “dishes on” a poster or commenter, there can be huge amounts of hate loaded and sometimes it is not clear why the opinion goes the way it does.
Reddit Data into an LLM?
Including the data from Reddit is going to bump up the “intelligence” of ChatGPT’s Large Language Model, but On My Gosh it could get seriously opinionated and bitchy extremely quickly – how will this be handled by ChatGPT considering LLMs do what they do without our understanding of why they product the answers they do.
An LLM is almost a black box, (see LLMs here for more information) data is used to train a model can not be completely known when the model is large, because there is not sufficient people or time to be able to pre-filter the data being used to train the model. All the is known is the weightings placed on the data that is being ingested, the output is not understood, basically the heavier the weighting towards a particular aspect, then the more a certain output will be likely to be given.
A model can not be trained to “not” provide certain answers: Racism, sexism, Prejudice etc are always in the training data because humans have provided the training data and the weightings. What can be done is answers can be filtered based on the words but this is unreliable and as has happened many times, “Prompt hacking” is used get around the LLM and have it answering questions that the creator of the LLM does not want to have happen, currently there are very limited methods available to prevent this.
Reddit user submitted content covers every single possible aspect of human life, even delving into somewhat illegal activities, for ChatGPT to use this data for good is going to be an interesting experiment and one that will absolutely “smarten” up LLMs but are we ready for the outputs it will generate?
Over a good thing
I for one want to see this pan out and to have the collective intelligence of reddit contributors available to OpenAi, as today many websites “pretending” to provide information nowadays are bot generated dros, designed to bend a search engine to recommending them at the top of the list, the reddit posts tend to be the most useful search results today.
The above is not even the first 5 results from a Google search on “Endpoint Protection Solutions”, because the top results were sponsored, the results I present above are 4 or 5 “Review site” generated rubbish and vendors, not one of these search results would help me make an informed decision because of “monetisation”, a company paid for their product or review site to appear here, unfortunately we still use Google for searches, hell even Bing Search is beginning to give better results today.
Recently there was a Podcast about this “Dead internet Theory” and it is worth a listen.
This is where integrating Reddit results into LLM’s is going to become most powerful, Search engines from both Microsoft and Google are starting to use AI results in their searches and having real content from Reddit is going to improve the search experience (with a bit of Bias thrown in).
Summary
Having Reddit data used for training is a good sign for Large Language model development, there are big risks with the content reddit contains and surely there will need to be “safe for work” filters etc but being that it is user generated, the data should be more valuable than 90% of the web content out there in 2024.
I am looking forward to further pushes forward with AI and Large Language Models becoming better at predicting the answer I want and being able to provide better responses, roll on 2025.
Leave a Reply