WATER TODAY - News, Reports, Interviews, Boil Water Advisory Maps


	Canada Mexico USA: New York Georgia Louisiana Ohio California

July 4, 2025

HOME | ABOUT | ADVISORY INFO WT INTERNATIONAL

THE COST OF REASONING

The more accurate the AI models, the bigger the carbon footprint according to new research out of Münich University

“The environmental impact of questioning trained LLMs is strongly determined by their reasoning approach, with explicit reasoning processes significantly driving up energy consumption and carbon emissions. We found that reasoning-enabled models produced up to 50 times more CO₂ emissions than concise response models.” -- first author Maximilian Dauner, a researcher at Hochschule München University of Applied Sciences

No matter which questions we ask an AI, the model will come up with an answer. To produce this information – regardless of whether than answer is correct or not – the model uses tokens. Tokens are words or parts of words that are converted into a string of numbers that can be processed by the LLM.

This conversion, as well as other computing processes, produce CO₂ emissions. Many users, however, are unaware of the substantial carbon footprint associated with these technologies. The researchers in Germany measured and compared CO₂ emissions of different, already trained, LLMs using a set of standardized questions.

WATERTODAY learned more about “thinking AI” from Maximilian Dauner

By Suzanne Forcese

THE RESEARCH

In our latest paper, “Energy costs of communicating with AI” (Frontiers in Communication, June 2025) Professor Gudrun Socher and I investigate the energy consumption of large language models (LLMs) during inference—specifically, when a user interacts with an already trained model. Our goal: to better understand the trade-offs between model performance and environmental sustainability.

We tested 14 LLMs (ranging from 7 to 72 billion parameters) on 500 multiple-choice questions from the MMLU benchmark.
All experiments were carried out on a local NVIDIA A100 GPU (80 GB), which enabled precise measurement of energy use, memory consumption, and response time during evaluation.
Emissions were converted to CO₂ equivalents using a standard factor of 480 g CO₂/kWh.

KEY FINDINGS

Reasoning models, on average, created 543.5 ‘thinking’ tokens per questions, whereas concise models required just 37.7 tokens per question. Thinking tokens are additional tokens that reasoning LLMs generate before producing an answer. A higher token footprint always means higher CO₂ emissions. It doesn’t, however, necessarily mean the resulting answers are more correct, as elaborate detail that is not always essential for correctness.

The most accurate model was the reasoning-enabled Cogito model with 70 billion parameters, reaching 84.9% accuracy. The model produced three times more CO₂ emissions than similar sized models that generated concise answers.

“Currently, we see a clear accuracy-sustainability trade-off inherent in LLM technologies,” said Dauner. “None of the models that kept emissions below 500 grams of CO₂ equivalent achieved higher than 80% accuracy on answering the 1,000 questions correctly.” (CO₂ equivalent is the unit used to measure the climate impact of various greenhouse gases.)

Subject matter also resulted in significantly different levels of CO₂ emissions. Questions that required lengthy reasoning processes, for example abstract algebra or philosophy, led to up to six times higher emissions than more straightforward subjects, like high school history.

PRACTISING THOUGHTFUL USE

“Users can significantly reduce emissions by prompting AI to generate concise answers or limiting the use of high-capacity models to tasks that genuinely require that power,” Dauner pointed out.

Choice of model, for instance, can make a significant difference in CO₂ emissions. For example, having DeepSeek R1 (70 billion parameters) answer 600,000 questions would create CO₂ emissions equal to a round-trip flight from London to New York. Meanwhile, Qwen 2.5 (72 billion parameters) can answer more than three times as many questions (about 1.9 million) with similar accuracy rates while generating the same emissions.

The researchers said that their results may be impacted by the choice of hardware used in the study, an emission factor that may vary regionally depending on local energy grid mixes, and the examined models. These factors may limit the generalizability of the results.

“If users know the exact CO₂ cost of their AI-generated outputs, such as casually turning themselves into an action figure, they might be more selective and thoughtful about when and how they use these technologies,” Dauner concluded. View the full research here.

WT Canada Mexico USA: New York Georgia Louisiana Ohio California

Have a question? Give us a call 613-501-0175

All rights reserved 2025 - WATERTODAY - This material may not be reproduced in whole or in part and may not be distributed,
publicly performed, proxy cached or otherwise used, except with express permission.