Web lists-archives.com

Hopefully spurious temperature / throttling warnings on a ThinkPad W550s




Hi,

My system (kernel / mcelog) frequently (numerous times a day) spits out
warnings such as these:

Feb 20 21:10:07 lila kernel: [ 7808.093821] CPU1: Core temperature above threshold, cpu clock throttled (total events = 1760)
Feb 20 21:10:07 lila kernel: [ 7808.093832] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1760)
Feb 20 21:10:07 lila kernel: [ 7808.093837] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2292)
Feb 20 21:10:07 lila kernel: [ 7808.093838] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2292)
Feb 20 21:10:07 lila kernel: [ 7808.093840] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2292)
Feb 20 21:10:07 lila kernel: [ 7808.093841] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2292)
Feb 20 21:10:07 lila kernel: [ 7808.093852] mce: [Hardware Error]: Machine check events logged
Feb 20 21:10:07 lila kernel: [ 7808.101828] CPU1: Core temperature/speed normal
Feb 20 21:10:07 lila kernel: [ 7808.101829] CPU0: Core temperature/speed normal
Feb 20 21:10:07 lila kernel: [ 7808.101830] CPU3: Package temperature/speed normal
Feb 20 21:10:07 lila kernel: [ 7808.101831] CPU2: Package temperature/speed normal
Feb 20 21:10:07 lila kernel: [ 7808.101832] CPU0: Package temperature/speed normal
Feb 20 21:10:07 lila kernel: [ 7808.101834] CPU1: Package temperature/speed normal
Feb 20 21:10:07 lila kernel: [ 7808.101834] mce: [Hardware Error]: Machine check events logged
Feb 20 21:10:07 lila mcelog: Processor 0 heated above trip temperature. Throttling enabled.
Feb 20 21:10:07 lila mcelog: Please check your system cooling. Performance will be impacted
Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger'
Feb 20 21:10:07 lila mcelog: Processor 1 heated above trip temperature. Throttling enabled.
Feb 20 21:10:07 lila mcelog: Please check your system cooling. Performance will be impacted
Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger'
Feb 20 21:10:07 lila mcelog: warning: 16 bytes ignored in each record
Feb 20 21:10:07 lila mcelog: consider an update
Feb 20 21:10:07 lila mcelog: CPU 0 on socket 0 received unknown error
Feb 20 21:10:07 lila mcelog: Location: CPU 0 on socket 0
Feb 20 21:10:07 lila mcelog: CPU 1 on socket 0 received unknown error
Feb 20 21:10:07 lila mcelog: Location: CPU 1 on socket 0
Feb 20 21:10:07 lila mcelog: Processor 0 below trip temperature. Throttling disabled
Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger'
Feb 20 21:10:07 lila mcelog: Too many trigger children running already
Feb 20 21:10:07 lila mcelog: Processor 1 below trip temperature. Throttling disabled
Feb 20 21:10:07 lila mcelog: Running trigger `unknown-error-trigger'
Feb 20 21:10:07 lila mcelog: Too many trigger children running already
Feb 20 21:10:07 lila mcelog: warning: 16 bytes ignored in each record
Feb 20 21:10:07 lila mcelog: consider an update

Sometimes I see just the kernel warnings, without the MCE stuff:

Feb 20 21:25:07 lila kernel: [ 8708.134901] CPU1: Package temperature above threshold, cpu clock throttled (total events = 2556)
Feb 20 21:25:07 lila kernel: [ 8708.134903] CPU2: Package temperature above threshold, cpu clock throttled (total events = 2556)
Feb 20 21:25:07 lila kernel: [ 8708.134904] CPU0: Package temperature above threshold, cpu clock throttled (total events = 2556)
Feb 20 21:25:07 lila kernel: [ 8708.134906] CPU3: Package temperature above threshold, cpu clock throttled (total events = 2556)
Feb 20 21:25:07 lila kernel: [ 8708.141929] CPU1: Package temperature/speed normal
Feb 20 21:25:07 lila kernel: [ 8708.141931] CPU2: Package temperature/speed normal
Feb 20 21:25:07 lila kernel: [ 8708.141932] CPU0: Package temperature/speed normal
Feb 20 21:25:07 lila kernel: [ 8708.141933] CPU3: Package temperature/speed normal

I suspect that these warning are spurious, possibly a kernel bug. They
do not seem to correlate with times that the system is actually under
stress: they often seem to occur when the system is under no particular
stress, and conversely, I can stress the system without a whimper.

[E.g., "sysbench --num-threads=1 --test=cpu --cpu-max-prime=35000 run",
which revs the cpu frequencies to their maximum of 3 GHz and raises
their temperatures to ~71 C, or "sysbench --num-threads=4 --test=cpu
--cpu-max-prime=50000 run" which runs at only ~2.65 GHz, but raises
their temperatures to 80 C.]

Moreover, the warnings always seem to come in pairs, with the
temperatures / speeds reported as returning to normal immediately (see
log timestamps).

This is a ThinkPad W550s, with a dual core hyperthreaded i7-5500U, with
a base speed of 2.4 GHz, and Turbo Boost to 3 GHz. It's a fairly new
(manufacturer refurbished) machine. I'm running mostly stable, with
some backports and the occasional bit from unstable, when installable
without ripping out half the basic stable installation. Recent kernels
have been self-built from vanilla sources, in the 4.7.x-4.9.10 range.

Searching the internet, reactions to similar problems fall into two
categories:

1) You're frying your system! Your fan or your thermal interface needs
to be cleaned / replaced immediately!

2) These are just spurious artifacts.

Some discussions - the best seems to be the first one:

https://bugzilla.redhat.com/show_bug.cgi?id=924570
https://bbs.archlinux.org/viewtopic.php?id=191347
https://www.centos.org/forums/viewtopic.php?t=24420

Any thoughts?

Celejar