OS/Comm Failures

Topics about the Software of Revolution Pi
ejensen
Posts: 6
Joined: 17 Aug 2022, 23:31
Answers: 0

OS/Comm Failures

Post by ejensen »

I've got dozens of RevPis deployed, and I'm having issues with devices becoming unreachable on the network. This has happened a handful of times in the past couple of years, but is happening with increased frequency in the last couple of weeks. We did recently run 'apt update' and 'apt dist-upgrade' on some of these devices, but the issue has also been seen on devices that haven't been updated recently. When this occurs, I'm unable to ping the devices on either ethernet interface. The LEDs remain in the normal state. Cycling power resolves the issue. Looking through log files, it appears the OS stops functioning (e.g. cron jobs don't run as scheduled). In one case, I found an entry in kern.log (pasted below), but in most cases there is nothing of note. It has happened to several devices installed across multiple networks and locations. I have Cores, Connects, and Compacts deployed, but this issue is occurring on Compacts running Buster. Any advice or recommended troubleshooting steps are appreciated.

Code: Select all

Aug 12 09:35:25 revpi kernel: [244060.084807] ------------[ cut here ]------------
Aug 12 09:35:25 revpi kernel: [244060.084819] WARNING: CPU: 0 PID: 0 at kernel/sched/core.c:2498 set_task_cpu+0x23c/0x31c
Aug 12 09:35:25 revpi kernel: [244060.084843] Modules linked in: sha256_generic cfg80211 rfkill 8021q garp stp llc raspberrypi_hwmon snd_bcm2835(C) snd_pcm snd_timer snd bcm2835_isp(C) bcm2835_codec(C) v4l2_mem2mem bcm2835_v4l2(C) videobuf2_dma_contig bcm2835_mmal_vchiq(C) vc_sm_cma(C) videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc uio_pdrv_genirq uio spidev ks8851_spi ks8851_common eeprom_93cx6 piControl(O) ad5446 ti_dac082s085 mcp320x iio_mux mux_gpio mux_core fixed gpio_74x164 spi_bcm2835aux spi_bcm2835 gpio_max3191x crc8 industrialio i2c_dev ip_tables x_tables ipv6
Aug 12 09:35:25 revpi kernel: [244060.084975] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G         C O      5.10.103-rt62-v7 #1
Aug 12 09:35:25 revpi kernel: [244060.084982] Hardware name: BCM2835
Aug 12 09:35:25 revpi kernel: [244060.084986] Backtrace:
Aug 12 09:35:25 revpi kernel: [244060.084991] [<80a55220>] (dump_backtrace) from [<80a555ac>] (show_stack+0x20/0x24)
Aug 12 09:35:25 revpi kernel: [244060.085008]  r7:80feace4 r6:00000000 r5:60000193 r4:80feace4
Aug 12 09:35:25 revpi kernel: [244060.085011] [<80a5558c>] (show_stack) from [<80a59918>] (dump_stack+0xbc/0xe8)
Aug 12 09:35:25 revpi kernel: [244060.085024] [<80a5985c>] (dump_stack) from [<80120010>] (__warn+0xfc/0x114)
Aug 12 09:35:25 revpi kernel: [244060.085041]  r9:00000009 r8:801562d0 r7:000009c2 r6:00000009 r5:801562d0 r4:80d0f298
Aug 12 09:35:25 revpi kernel: [244060.085044] [<8011ff14>] (__warn) from [<80a55c0c>] (warn_slowpath_fmt+0x70/0xc0)
Aug 12 09:35:25 revpi kernel: [244060.085057]  r7:000009c2 r6:80d0f298 r5:80f07808 r4:00000000
Aug 12 09:35:25 revpi kernel: [244060.085060] [<80a55ba0>] (warn_slowpath_fmt) from [<801562d0>] (set_task_cpu+0x23c/0x31c)
Aug 12 09:35:25 revpi kernel: [244060.085074]  r9:00000000 r8:b6b57180 r7:80f01d2c r6:80f07858 r5:00000001 r4:8178e000
Aug 12 09:35:25 revpi kernel: [244060.085077] [<80156094>] (set_task_cpu) from [<8016da90>]
Aug 12 18:47:19 revpi kernel: [    0.000000] Booting Linux on physical CPU 0x0
Aug 12 18:47:19 revpi kernel: [    0.000000] Linux version 5.10.103-rt62-v7 (support@kunbus.com) (arm-linux-gnueabihf-gcc (Debian 8.3.0-2) 8.3.0, GNU ld (GNU Binutils for Debian) 2.31.1) #1 SMP PREEMPT_RT Tue, 24 May 2022 11:41:05 +0000
User avatar
nicolaiB
KUNBUS
Posts: 871
Joined: 21 Jun 2018, 10:33
Answers: 7
Location: Berlin
Contact:

Re: OS/Comm Failures

Post by nicolaiB »

Hi,

I'm sorry to hear that you have trouble with your systems. In order to investigate the cause, please add the following details / answer the questions:
  • What workload / programs do you run on the compacts?
  • Which image release do you use (cat /etc/revpi/image-release)?
  • Did you try to connect a monitor and keyboard? Is the system still responsive after network is unavailable? Are there any messages visible?
  • Could you test with a serial console (over RS485)?
  • Are all compacts affected or only some?
  • Is the dwc2 driver used instead of dwc_tg (ps axww | grep -i dwc)
  • Could you please share a SOS Report?
  • Is it possible to test our latest Buster image on one device?
Also we just published an updated kernel a few days ago (5.10.120), which includes a patch regarding the power-down detection of the network controller (announcement post will follow shortly). Could you please check if the error still occurs with the latest kernel?

-Nicolai
ejensen
Posts: 6
Joined: 17 Aug 2022, 23:31
Answers: 0

Re: OS/Comm Failures

Post by ejensen »

  • Most of the deployed devices, and the one's I'm seeing issues with, are only running 2 python applications and typically sit below 20% average CPU usage.
  • 2021-07-01-revpi-buster.img
  • I have not been able to do this because I don't have physical access to the devices.
  • See the answer above.
  • I'm not seeing the same devices fail repeatedly, and several have not failed at all so far. It's difficult to say at this point. I'm using deployment scripts to update and deploy our application, so the applications and dependencies should be the same across all devices.
  • irq/66-dwc2_hso
  • I generated a SOS report. Is there a way for me to send it directly to you?
  • Due to the issue mentioned above, it is difficult to reimage the devices. If the issue persists, and I can't solve it remotely, I will keep this in mind.
  • This could be done, but not easily. The devices are being used in a critical application, and I would have to make arrangements and schedule downtime.
User avatar
nicolaiB
KUNBUS
Posts: 871
Joined: 21 Jun 2018, 10:33
Answers: 7
Location: Berlin
Contact:

Re: OS/Comm Failures

Post by nicolaiB »

Please send the SOS report to support@kunbus.com an with the subject "SUP-5193: SOS report"

Thanks!

- Nicolai
User avatar
nicolaiB
KUNBUS
Posts: 871
Joined: 21 Jun 2018, 10:33
Answers: 7
Location: Berlin
Contact:

Re: OS/Comm Failures

Post by nicolaiB »

We have received the SOS report and will have a look. In the meantime it would be great if you find a way to test with our latest image.

Thanks,

Nicolai
ejensen
Posts: 6
Joined: 17 Aug 2022, 23:31
Answers: 0

Re: OS/Comm Failures

Post by ejensen »

We had another device fail, and I emailed the SOS report from that device as well. I'm expecting to install the latest kernel in the next few days. Updating the image isn't really reasonable at this time due to the remote location and the number of devices.
zhan
Posts: 52
Joined: 16 Apr 2019, 13:31
Answers: 0

Re: OS/Comm Failures

Post by zhan »

Hi ejensen,

I have tried it with kernel "1:9.20220524-5.10.103+revpi1"(which is the same as it on your device, as I see it in kernel log from you) on my Compact, the problem is not reproduced.
However, I put almost no load on it, and not too much operations on it. I would like to ask,
1. are there some devices connected with the compact through USB or other interfaces.
2. what operations have you done on the device, and especially what might be possible happens near the time kernel oops occurs.

From the I see following abnormal issues:
1. the time of kern.log has just record from 14.August, but the kernel failure message was recorded on 12.Aug, normally the kern.log befor the kernel failure is expected.
2. the host name in oops is "revpi", but in kern.log is others, and in the hosts file, it has following item, which looks abnormal to me - repeated item to 127.0.1.1.
Here I just try to give you some information, see if it helps you to get the possible triggers of the kernel failure.

BR
Simon
Simon
ejensen
Posts: 6
Joined: 17 Aug 2022, 23:31
Answers: 0

Re: OS/Comm Failures

Post by ejensen »

zhan:
Thank you for your response. We are running more than 100 of these devices, and have only seen the failure on 8-10, so reproducing is a challenge. Some of the failures have generated a kernel log and others haven't.
There are no USB devices or even IO connected to these Compacts.
The devices just run our python application, and no abnormal operations were performed near the time of failure.
I'm not sure if I understand you correctly with regard to the dates, but if you're saying the kern.log in the SOS doesn't match my initial post, I've sent 2 SOS reports, and I pulled them from the most recently failed devices hoping the relevant info would still be in the logs.
The hostnames are expected. Please see your PMs for more info.
ejensen
Posts: 6
Joined: 17 Aug 2022, 23:31
Answers: 0

Re: OS/Comm Failures

Post by ejensen »

We are still experiencing issues with devices hanging. The only new information I have is that we aren't seeing repeating failures. So far, I haven't seen the same controller fail more than once, but I'm seeing about 1 failure every 3 days. We have some performance monitoring of CPU/memory/temperature included in our python application, and we aren't seeing rises in any of these metrics prior to failure. All devices are now running the 5.10.120 kernel. As I mentioned, reimaging will cost a great deal of time and effort. Can you provide any specifics about the benefit of using the latest image, if we're already running the latest kernel?
Last edited by ejensen on 07 Sep 2022, 22:22, edited 1 time in total.
User avatar
nicolaiB
KUNBUS
Posts: 871
Joined: 21 Jun 2018, 10:33
Answers: 7
Location: Berlin
Contact:

Re: OS/Comm Failures

Post by nicolaiB »

I'm sorry to hear that the problem still exists, even with the latest kernel. Our problem ist, that we're not able to reproduce this kind of error in our lab. Would it be possible for you to send us one of the affected devices so we can investigate further?

-Nicolai
Post Reply