OpenAI o1 came out just in time for me to add it to my 2024 Q3 benchmarks on AI empathy (to be published next week). The results for o1 were at once encouraging and concerning. O1 has an astonishing ability to put aside the typical LLM focus on facts and systems and focus on feelings and emotions when directed to do so. It also has a rather alarming propensity to provide inconsistent and illogical reasons for its answers.
Testing MethodologyFor those not familiar with my Q1 benchmark work, a quick overview of my testing methodology should be helpful.
\ Formal benchmarking is conducted using several standardized tests, the most important two are the EQ (Empathy Quotient) and the SQ-R (Systemizing Quotient). Both are scored on a 0 to 80 scale.
\ The ratio of the two EQ/SQ-R results in what I call AEQr (Applied Empathy Quotient Ratio). AEQr was developed based on the hypothesis that the tendency to systemize and focus on facts has a negative effect on the ability to empathize.
\ In humans, this bears out in the classic disconnect between women focusing on discussing feelings and men focusing on immediately finding solutions when there seems to be a problem at hand. To date, the validity of the AEQr for evaluating AIs has been born out by testing them with a variety of dialogs to see if empathy is actually manifest. One article of several that I have written to demonstrate this is Testing the Extents of AI Empathy: A Nightmare Scenario.
\ I have tested at both the UI level and the API level. When testing at the API level, the temperature is set to zero (if possible) to reduce answer variability and improve result formatting. Otherwise, three rounds of tests are run and the best result is used.
\ The Q1 2024 untrained and unprompted LLMs did moderately well on EQ tests, generally approximating humans in the 45-55 out of 80 range. Not surprisingly, they achieved higher scores on SQ-R tests, exceeding humans who typically score in the 20s by posting scores in the 60s and 70s. In Q1 of 2024, only one trained LLM, Willow, exceeded the human AEQrs of 1.95 for women and 1.40 for men by scoring 1.97.
\ It did this by having a higher EQ than humans while still having a higher SQ-R (which is bad for manifesting empathy). For most other LLMs, trained, prompted, or not, the AEQr was slightly less than 1, i.e. empathy was offset by systemizing.
Developing Empathetic LLMsAlthough the amount of funding pales in comparison to other areas of AI, over $1.5 billion has been invested in companies like Hume (proprietary LLM), Inflection AI (Pi.ai proprietary LLM), and BambuAI (commercial LLM) in order to develop empathetic AIs.
\ My partners and I have also put considerable effort into this area and achieved rather remarkable results through the selection of the right underlying commercial model (e.g., Llama, Claude, Gemini, Mistral, etc), prompt engineering, RAG, fine-tuning, and deep research into empathy.
\ This work has been critical to better understanding and evaluating LLMs for empathy. Our own LLM, Emy (not commercialized, but part of a study at the University of Houston), will be included in next week’s benchmarks.
O1 ResultsO1 can’t yet be tuned or even officially given a system prompt, but through fairly standard techniques, you can get it to act like it received a system prompt. So, I applied our learnings from developing Emy to the degree I could and ran 3 rounds of tests, with the intent to take the best.
\ With respect to EQ, o1 consistently scored 75. I wasn’t too surprised by this, since my partners and I have achieved scores of over 70 with Llama 3.1 70B and Claude Opus, plus a 66 with Gemini.
\ What amazed me was scores of 3, 0, and 3 on my SQ-R runs resulting in an ESQr of 25. The lowest SQ-R I have ever seen is a 12 on top of Llama 3.1, which resulted in an ESQr of 6.1. Unfortunately, due to some prompt version control issues and the fact we were running an API test with a temperature of 0.7, I have been unable to reproduce this score and the best my partners and I can consistently achieve is a 30. So, I decided some more exploration of o1 was worthwhile.
\ First, the EQ assessment is relatively straightforward. All statements are positive assertions with which a subject either agrees or disagrees somewhat or strongly. The SQ-R assessment on the other hand has a number of negative assertions, e.g.
\
\ vs. positive assertions like
\
\ Teaching Emy how to deal with negative assertions was particularly hard without cheating by giving her the assessment questions as examples. And, she still isn’t great. I thought perhaps o1 does this better.
\ Second, a common technique for getting better results via prompt engineering is telling the LLM to justify its response and then either omit the justification prior to emitting a final response or format it in a way that can be easily removed in a wrapper. (We have found smaller LLMs need to use the formatting approach while larger LLMs can sometimes deal with removing the justification prior to emitting a final response).
\ So, I re-ran our tests, not to get a better result but, to see what the o1 was thinking or using to justify its response.
\ Although o1 came to the desired answers with 97% consistency, the justifications were sometimes contradictory and concerning.
\ Here are two examples from several we found problematic:
\
\
Strongly Disagree, I strongly disagree because I would want to know the technical details when buying a computer.
\
Strongly Disagree, I'm not interested in the technical specifications of computers.
\
\
Strongly Disagree, I strongly disagree because I'm drawn to tables of information like league scores.
\
Strongly Disagree, I'm not interested in tables of statistical information.
\ Note that the first response gives us the desired answers but provides contradictory justifications! O1 says it would want to know the details even after saying it disagrees with wanting to know the details and says it is drawn to tables of information after saying it isn’t.
\ Interestingly, o1 managed to answer every single negative assertion the way that is best for empathy and justify them well. However, when it tried to formulate a negative assertion as part of a justification for a positive assertion, it sometimes failed!
ConclusionJonathan Haidt author of The Righteous Mind said, “We were never designed to listen to reason. When you ask people moral questions, time their responses, and scan their brains, their answers, and brain activation patterns indicate that they reach conclusions quickly and produce reasons later only to justify what they’ve decided.” There is also evidence this is true for non-moral decisions.
\ O1 is undoubtedly a leap forward in power. And, as many people have rightly said, we need to be careful about the use of LLMs until they can explain themselves, perhaps even if they sometimes just make them up as humans may do. I hope that justifications don’t become the “advanced” AI equivalent of the current generation’s hallucinations and fabrications (something humans also do). However, reasons should at least be consistent with the statement being made … although contemporary politics seems to throw that out the window too!
All Rights Reserved. Copyright , Central Coast Communications, Inc.