Loading Jaywing website
12 May 2025 / News

What do you rely on Chatbots for?

Jaywing

A really interesting piece of research by Yougov was published this week showing the results of a survey to find out whether the UK public thought a chatbot was good or bad at a range of different tasks: https://yougov.co.uk/technology/articles/52062-what-do-britons-think-ai-chatbots-are-good-at 

This is a pertinent question for all of us because an increasing number of people are using chatbots both for business and personal reasons, yet how many people have had any formal training about how to use these tools and get the best out of them?   And how many people have systematically tested the responses and outputs of the various chatbots to measure how good they are? 

Thankfully the technical community does just such a thing and Huggingface has a chatbot arena which ranks the various large language models (LLMs) based on reviewer submissions, judging over 200 LLMs based on nearly 3m votes (https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard).  Similarly the models are tested against a number of benchmark tests to see how they perform at various categories of tasks (https://arxiv.org/pdf/2403.04132)     

This is very useful information, and when put together with an understanding of the data these models are trained on, how they work and a careful consideration of the ethics about using them for certain purposes, we drew some conclusions from a more informed point of view about what the best and worst use cases of chatbots are and quantified them on a range from -100 (a very poor or inappropriate use case) to +100 (an ethical use case where the chatbot performs very highly).  The results were as follows: 

The strongest uses at the top are chatbot staples and deliver high quality and accurate responses.  It is worth noting that nothing scores +100 which is a recognition that even for core tasks, the human needs to stay in control and do some checking and assessment of the output.  As the numbers decrease below 75 and the “doing simple sums” use case, it’s important to recognise that chatbots can make mistakes if a question is worded poorly or the calculation requested is ambiguous.  Getting onto the more complicated maths, the performance against benchmark maths tests suggests good performance, but certainly not getting all the answers right.  A score of 35 suggests much more human input to make sure problems are correctly solved, and perhaps stepping through methods and answers with the chatbot. 

Suggestions for TV programmes or gifts score positively because, whilst the quality of answers can’t be guaranteed, suggestions which can be taken or ignored are a sound way to use the tool to gain ideas but still exercise human judgement about which ideas are best.  This extends to solving business problems and even giving advice on child behavioural issues where suggestions might be useful but it’s important to recognise that the response may not be authoritative or appropriate in the specific circumstances in question.  It’s also good to consider here whether you want to use the learned knowledge base of the model, or whether you want to direct the chatbot at a particular authoritative source of your choice.  For example, if you wanted to know the answer to a question about GDPR it’s a good idea to direct the chatbot to answer it based on information contained on the Information Commissioner's website rather than leaving it to choose where it gets the information from. 

At the bottom of the list in Figure 1 are the use cases which start to become quite problematic.  Giving legal advice which is relied on is not recommended, although getting a general idea of the law on a certain subject as part of an information gathering exercise and using it as a guideline can be useful.  But the bottom four use cases are fraught with ethical, medical and well being issues and are definitely use cases to take great care with, never favouring the output of chatbots over specific professional or medical advice. 

To compare public perceptions against these values, net approval rates were calculated from the Yougov survey for each use case by subtracting the number of people who thought a use case was very or fairly bad from the number of people who thought it was very or fairly good (this is the usual method to arrive at net approval ratings for public figures).  A difference was then calculated by subtracting the public net approval score from the expert view to assess how consistent the public view recorded by Yougov is with a more expert view, and where it is most misaligned.  A high positive value suggests the public is under-estimating the value of the use case, whereas a large negative value suggests the pubic is over-estimating its value.  

An important point to note up front about the Yougov data is that only 47% of the respondents said they had ever used a chatbot and only 18% used it regularly (i.e. weekly or more often).  Clearly therefore many of the answers were submitted as a perception and not based on direct experience.  The net approval rating for the full respondent base shows that non-users significantly under-estimate the value of using chatbots for certain purposes. 

People having experience of using the technology would help to encourage chatbot use for the items at the top of this list.  Progressing further down the list to +28, use cases like helping someone budget and improve their finances are robust but they require some level of training or self-teaching to be brought into play to get the best results; it’s a task which requires some skill at prompting, interpreting responses and challenging and refining them to ensure a good outcome.  This lack of prompting knowledge and/or skill is a major barrier to getting value from chatbots in these more nuanced use cases and it’s why training on chatbot use is so essential to encourage safe and correct usage and realise value from increased efficiency for business purposes.  There are pitfalls for lazy and untrained chatbot use and they carry the potential for data breaches or other reputational damage to the brand.    

The Yougov data also tells us that the younger generations are much more likely to use chatbots regularly with 53% of 18-24 using them at least weekly or more often.  Across all age groups it’s only about 18% of the population which uses chatbots at this frequency, but their opinions about good chatbot use are obviously significant since they are the people making most use of the technology.  Comparing their net approval rating by use case against the expert view gives the following differences: 

Frequent users have correctly identified the best use cases and very accurately reflect the expert view.  But they do appear to systematically over-estimate the ability of chatbots to perform reliably and ethically across a wide range of other use cases.  Only using a chatbot as a romantic partner gets a significant negative net approval rating from this group (-48) whilst some of the most problematic use cases such as getting medical advice or using it as a mental health therapist get fairly modest negative scores (-16 and -12 respectively).  This suggests a sizeable proportion of the group sees a chatbot as offering good responses for these purposes.  The accessibility of a chatbot to answer personal questions privately and discreetly is a real problem when the people asking might be much better advised to consult with medical professionals, mental health support or trained well-being services.  

The answers also allude to an over-confidence and reliance in the responses in a lot of use cases where more careful consideration and checking of responses is warranted.  It is a real danger of chatbots that they answer every question confidently and are so plausible in their answers.  It’s easy to be lulled into the conclusion that they are always right and answer a question as well as it can be answered.  Our experience is that for many of these use cases, using a chatbot does have value but you need to work with it, ask additional questions, challenge it and collaborate with it to get the best outputs.  Lazy prompting and assuming an answer is correct can lead to poor outcomes.   

The results of the Yougov survey are both interesting and important.  They suggest a real need both in industry and in the public domain for people to be trained in how to use chatbots to get the most out of them.  There are plenty of training materials in the public domain but the technology is so easily accessible that many people just open up a browser, navigate to the website and dive in.  Whilst learning through experience is positive, it’s important to remain curious, challenging, healthily sceptical and switched on to what these tools are and not be taken in by the sheer plausibility of their responses; assuming a reliability and accuracy across the range of use cases above which is unwarranted.   

The power of this technology is already huge and the models are only going to get better and better, probably at quite a rapid rate.  But we need to work with them as partners, directing them to perform roles and tasks, and not just delegating tasks to them and taking whatever output they offer from an initial prompt.  Ask a lazy question, you’ll probably get a lazy answer; AI is only as sharp as the prompt which wields it.