Recent research conducted by Apple engineers highlights the flaws of artificial intelligence models, particularly those of the generative AI type, such as ChatGPT. Although these systems are often perceived as problem-solving experts, tests show that they are incapable of demonstrating genuine logical reasoning when faced with simple mathematical problems, raising questions about their reliability and truly logical understanding.
Apple’s research on the limits of generative AIs
A study conducted by a team of six engineers at Apple examined the capabilities of large language models in the field of mathematics. The researchers chose to test these intelligences on common mathematical problems, where apparently, AI should encounter no difficulty. However, despite these systems’ skills in certain situations, the results revealed a major concern: their inability to handle contextual variables that do not follow a learned pattern.
Systematic testing and unexpected results
During the tests, the AIs initially performed well by correctly answering questions such as “Olivier picks 44 kiwis on Friday, 58 kiwis on Saturday, and on Sunday he picks twice as many as on Friday. How many kiwis did he collect?”. However, when additional, seemingly insignificant elements were added to the statement, such as “5 of the kiwis were a bit smaller,” the models quickly showed signs of weakness, making errors in their calculations.
A lack of logical understanding
What is particularly concerning about these results is that the AIs tend to interpret this new information as clues for mathematical operations, even if they were not relevant to the question asked. Thus, they “read” the statement as a set of operations rather than understanding the context and the logical relationships between the different elements of the query. This phenomenon illustrates the fragility of reasoning in generative AIs, which appears to be more based on memorization than on genuine understanding.
The implications of these results
The research highlights a critical flaw in the very architecture of language models, which are primarily based on statistical learning rather than deep cognition. Engineers observed a decrease in the accuracy of the AIs of up to 17.5% for the best models and 65.7% for the least performing models when subjected to these tests. Additionally, simple changes in the parameters of the problems, such as replacing a first name, further reduced their success rate, calling into question their ability to adapt to varied situations.
Conclusion of the study and perspectives
The results of this study pose a significant challenge to the generally optimistic view of artificial intelligence. The authors conclude by stating that minor perturbations within mathematical problems expose a fundamental limitation of AIs to recognize and critically evaluate information. These conclusions invite reflection on the future of artificial intelligence, particularly regarding its integration into systems where logical reasoning and decision-making are essential.







