Add Self-Monitoring to your Raspberry Pi


Description

The Raspberry Pi is awesome, but every once in a while mine seems to lose its wi-fi connection and stops performing the way I need it to. And once it looses it’s connection, you can’t just SSH or VNC into it to reboot it because, well, there’s no network connection. So if your running headless (without monitor, keyboard and mouse), you just have to unplug it and plug it back in to reboot it. Obviously that’s not the safest solution and can result in a corrupted SD card. So I have added some simple self-monitoring to my Raspberry Pi servers to help eliminate this problem.

In this project, we’ll create a simple Python program that reads a file (ip.txt) to get current statistics, pings the router, and writes the results back to the file for future use. If a consecutive number of ping failures occur, this program will reboot the Pi. If too many consecutive reboots happen without getting a network connection, it will shut the Pi down.

Then we’ll modify our healthcheck path in our server and our health.html file to show us this extra data.

Finally, we’ll setup a cron job to run this program on a schedule.

This project builds on the Basic Raspberry Pi Web Server project.

Parameters

This is some information about this project and the conditions under which it was done. If you try to replicate it in the future and it doesn’t work, you can evaluate these parameters to see if any changes between these and your configuration might have impacted your results. An example might be changes to future versions of Python or Flask.

  • Date: July 22, 2021
  • Skill: Beginner
  • Raspberry Pi Model(s): Zero-W, 3B+, 4B
  • OS: Raspberry Pi OS version 10 (Buster)
  • Python Version: 3.7.3
  • Flask Version: 1.0.2

Steps

  1. Let’s start by creating a new file in ThonnyGeany, or your favorite Python editor.
  1. Enter the following lines of code:
#!/usr/bin/env python

import os
import time
from datetime import datetime

IP_FILE = '/home/pi/ip.txt'
REBOOT_FILE = '/home/pi/reboot.txt'

failCnt = 0
rebootCnt = 0
curDate = datetime.now().strftime('%m/%d/%y %I:%M:%S %p')

if not os.path.isfile(IP_FILE):
    failCnt = 0
else:
    try:
        f = open(IP_FILE, 'r')
        fdata = f.readlines()
        f.close()
        failCnt = int(fdata[0].strip())
        rebootCnt = int(fdata[1].strip())
    except:
        print('ip.txt file not found. Creating...')

print('%s: Starting failCnt = %d \ Starting rebootCnt = %d\n' % (curDate, failCnt, rebootCnt))

ret = os.system('ping -c 1 -W 10 192.168.2.1')
print('ret = ' + str(ret))
if (ret != 0):
    failCnt +=1
else:
    failCnt = 0
    rebootCnt = 0

print('Endinging failCnt = %d' % failCnt)

if (failCnt >3):
    rebootCnt += 1
    f = open(IP_FILE, 'w+')
    f.write('%d\n%d\n%s' % (failCnt, rebootCnt, curDate))
    f.close()
    if (rebootCnt < 4):
        print('Rebooting...')

        f = open(REBOOT_FILE, 'a+')
        f.write('%s: Rebooting.  Fail Cnt: %d\n' % (curDate, failCnt))
        f.close()

        print('Unable to restore connection after ' + str(failCnt) + ' tries.\nRebooting in 30 seconds...')
        time.sleep(30)
        os.system('sudo reboot')
    else:
        print('Rebooted too many consecutive times (%d).  Shutting down...' % rebootCnt)
        f = open(REBOOT_FILE, 'a+')
        f.write('%s: Shutting down.  Fail Cnt: %d\n' % (curDate, failCnt))
        f.close()

        print('Unable to restore connection after ' + str(rebootCnt) + ' reboots.\nShutting down in 30 seconds...')
        time.sleep(30)
        os.system('sudo shutdown -h now')
else:
    f = open(IP_FILE, 'w+')
    f.write('%d\n%d\n%s' % (failCnt, rebootCnt, curDate))
    f.close()

Note that you will need to replace the highlighted ip address above with the ip address of your router. If you’re not sure what it is, you can enter the following command in the Terminal and it should give you what you need:

traceroute google.com

Normally your router will be the first ip listed.

For this project, we will save the file to the same directory as your web server /home/pi/sample, but you can save it anywhere you choose. Let’s name it ping_check.py.

  1. Manually run the program to make sure it works. Open a Terminal and enter the following commands:
cd /home/pi/sample
sudo python3 ping_check.py

The first time you run it, it will create an ip.txt file in the /home/pi directory. It will write three lines, the first is the failCnt value, which is how many consecutive ping failures have occurred, the second is the rebootCnt, which is how many consecutive reboots there have been without a successful ping, and the third is just the date of execution for debug purposes, just to make sure it’s running as expected. In our version, we allow 3 ping failures before rebooting and 3 reboots before shutting down.

When a reboot or shutdown occurs, we write a new line to the reboot.txt file, which is also written to the /home/pi directory. It serves as a history log file.

  1. Now let’s setup a cron job to run our ping_check.py program on a schedule. Open up a Terminal window and type in the following:
sudo crontab -e

The first time you edit crontab it will ask which editor you want to use to modify the crontab file. I recommend nano, but you can pick another if you have a preference. Arrow down to the bottom of the file and enter this:

*/5 * * * * sudo python3 /home/pi/sample/ping_check.py

Your screen should look like this if you’re using nano:
_

Press ctrl+x to exit, press y to save the changes, and Enter to update the current file.

This program will now run every 5 minutes to check for a network connection and will update ip.txt.

  1. Now let’s load this information in our healthcheck route in our server so we can pass it to health.html. Open sample.py and make the following changes:
#!/usr/bin/env python3

from flask import Flask, render_template
from datetime import datetime
import io
import os

START_DATE = datetime.now().strftime('%m/%d/%y %I:%M:%S %p')

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/healthcheck')
def healthcheck():
    curDateTime = datetime.now().strftime('%m/%d/%y %-I:%M:%S %p')

    #Get ip.txt contents
    try:
        f = open('/home/pi/ip.txt', 'r')
        fdata = f.readlines()
        fcnt = 'Fail Cnt: %s' % (fdata[0].strip())
        rcnt = 'Reboot Cnt: %s' % (fdata[1].strip())
        f.close()
    except:
        fcnt = 'Fail Cnt: No Data'
        rcnt = 'Reboot Cnt: No Data'

    #Get reboot.txt contents
    try:
        f = open('/home/pi/reboot.txt', 'r')
        fdata = f.readlines()
        rhist = ''
        for x in range(len(fdata)):
            rhist += fdata[x]
        f.close()
    except:
        rhist = 'None'

    # Report available disk space
    stat = os.statvfs('/home/pi')
    gbFree = '%0.2f GB' % (stat.f_bfree*stat.f_bsize/1024/1024/1024)

    # Report CPU Temp
    try:
        tFile = open('/sys/class/thermal/thermal_zone0/temp')
        temp = float(tFile.read())
        tempC = '%0.1f C' % (temp/1000)
        tempF = '%0.1f F' % ((temp/1000) * 1.8 + 32)
    except:
        tempC = 'ERR'
        tempF = 'ERR'

    tFile.close()

    return render_template('health.html', curDate=curDateTime, startDate=START_DATE, gbFree=gbFree, tempC=tempC, tempF=tempF, fcnt=fcnt, rcnt=rcnt, rhist=rhist)

@app.route('/reboot')
def reboot():
    ret = os.system('sudo reboot')
    return 'Rebooting'

@app.route('/shutdown')
def shutdown():
    ret = os.system('sudo shutdown -h now')
    return 'Shutting down'

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

Notice that we are also passing 3 new parameters to our health.html file.

  1. Now let’s modify our health.html file to display these new parameters. Open health.html and make the follow changes:
<!DOCTYPE html>
<html lang='en'>
<head>
  <meta charset='utf-8' />
  <meta name='viewport' content='width=device-width, initial-scale=1'>
  <title>Healthcheck</title>

  <style>
    body {
      font-size: 28px;
      background-color: lightblue;
    }
    table, th, td {
      border: 1px solid black;
      border-collapse: collapse;
      padding: 15px;
    }
    th {
      text-align: right;
      background-color: skyblue;
    }
    td {
      background-color: white;
    }
  </style>

</head>

<body>
  <h2>Health Check</h2>
  
  <table>
    <tr>
      <th>Health Status as of:</th>
      <td><b>{{curDate}}</b></td>
    </tr>
    <tr>
      <th>Running Since:</th>
      <td><b>{{startDate}}</b></td>
    </tr>
    <tr>
      <th>Available Disk Space:</th>
      <td><b>{{gbFree}}</b></td>
    </tr>
    <tr>
      <th>CPU Temperature:</th>
      <td><b>{{tempF}} ({{tempC}})</b></td>
    </tr>
  </table>
  <p>{{fcnt}}</p>
  <p>{{rcnt}}</p>
  <p>Reboot History:<pre>{{rhist}}</pre></p>
  <p><a href='/'>Index</a></p>
</body>

</html>

Save the changes and test from your browser. You should see the new data appear in your healthcheck. This will allow you to keep an eye on how often your server is having to reboot.

Summary

We now have a server that is a lot more reliable and we can also keep an eye on to see how things are going with the help of our healthcheck route. In a future project I’ll show you how to add the ability to email, which you can use to proactively send a notification after each reboot, once a network connection has been reestablished. That way you don’t even have to look at healthcheck to know a reboot has occurred (as long as it can connect to the network before it shuts down).

Learn More

Raspberry Pi / Raspberry Pi OS

Python / HTML / CSS

Flask

Leave a Comment

Your email address will not be published. Required fields are marked *