Additional monitoring for ETH2 staking nodes

EDIT: I don’t use this monitoring anymore — it falls down if the node completely crashes, e.g. due to power failure. Instead, I use Amazon CloudWatch agent per instructions here, and some manually created CloudWatch alarms pushing messages to PagerDuty, which altogether costs $1–2 a month but is very reliable.

If you’re like me you followed one of SomerEsat’s excellent guides for setting up a staking node on ETH2.0. There were a few extra things I added to my node that I thought would be useful to share.

Adding logs to the Grafana dashboard

Grafana dashboard with logs for every service

It’s annoying to have to ssh into your box and check the logs of each service separately. I wanted these service logs to be directly inline in the Grafana dashboard. To get logs on your Grafana dashboard, you need to set up a log ingesting stack. The Grafana-integrated log stack is loki + promtail.

First download promtail and loki for your platform from the latest release on GitHub. Unzip them and move them to /usr/bin . I recommend changing the owner and permissions s.t. only loki/promtail users can execute these binaries.

~> ls -la /usr/bin/promtail
-rwxr--r-- 1 promtail promtail 67968168 Oct 26 15:59 /usr/bin/promtail*
~> ls -la /usr/bin/loki
-rwxr--r-- 1 loki loki 55742464 Oct 26 15:56 /usr/bin/loki*

Then create a user for promtail and loki.

sudo useradd --no-create-home --shell /bin/false promtail
sudo useradd --no-create-home --shell /bin/false loki

The promtail user needs to be able to read from the system journal. To enable this, add the following groups: systemd-journal

sudo usermod -a -G systemd-journal promtail

Create directories for their configuration.

sudo mkdir /etc/loki
sudo mkdir /etc/promtail
sudo touch /etc/loki/config.yml
sudo touch /etc/promtail/config.yml
sudo chown -R loki:loki /etc/loki
sudo chown -R promtail:promtail /etc/promtail

Edit the promtail configuration at /etc/promtail/config.yml

server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://localhost:3100/loki/api/v1/push
scrape_configs:
- job_name: journal
journal:
json: false
max_age: 12h
path: /var/log/journal
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'

And the configuration at /etc/loki/config.yml

auth_enabled: falseserver:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_idle_period: 1h # Any chunk not receiving new logs in this time will be flushed
max_chunk_age: 1h # All chunks will be flushed when they hit this age, default is 1h
chunk_target_size: 1048576 # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
chunk_retain_period: 30s # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
max_transfer_retries: 0 # Chunk transfers disabled
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /tmp/loki/boltdb-shipper-active
cache_location: /tmp/loki/boltdb-shipper-cache
cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: filesystem
filesystem:
directory: /tmp/loki/chunks
compactor:
working_directory: /tmp/loki/boltdb-shipper-compactor
shared_store: filesystem
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
ruler:
storage:
type: local
local:
directory: /tmp/loki/rules
rule_path: /tmp/loki/rules-temp
alertmanager_url: http://localhost:9093
ring:
kvstore:
store: inmemory
enable_api: true

Then create the service files, /etc/systemd/system/promtail.service

[Unit]
Description=Promtail service
After=network.target
[Service]
Type=simple
User=promtail
Group=promtail
Restart=always
RestartSec=5
ExecStart=promtail -config.file /etc/promtail/config.yml
[Install]
WantedBy=multi-user.target

And /etc/systemd/system/loki.service

[Unit]
Description=Loki service
After=network.target
[Service]
Type=simple
User=loki
Group=loki
Restart=always
RestartSec=5
ExecStart=loki -config.file /etc/loki/config.yml
[Install]
WantedBy=multi-user.target

Start and enable both services.

sudo systemctl start promtail
sudo systemctl enable promtail
sudo systemctl start loki
sudo systemctl enable loki

Adding Loki to Grafana is as simple as adding a datasource.

Adding a data source in Grafana

The default URL http://localhost:3100 is correct and all you need to enter.

Once you set up the data source, here is a dashboard you can import to get you started. First click the + sign in the menu and then import:

Then import the following JSON. Note that the service names on my node may differ from yours, and each panel can be configured to point to different systemd logs, so not all the logs panels will work for you out of the box.

{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"description": "All the logs to care about",
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 11,
"links": [],
"panels": [
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 8,
"w": 24,
"x": 0,
"y": 0
},
"id": 3,
"options": {
"showLabels": false,
"showTime": false,
"sortOrder": "Descending",
"wrapLogMessage": false
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"geth.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "ETH1 geth logs",
"type": "logs"
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 8
},
"id": 8,
"panels": [],
"title": "ETH2 logs",
"type": "row"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 9,
"w": 11,
"x": 0,
"y": 9
},
"id": 2,
"options": {
"showLabels": false,
"showTime": false,
"sortOrder": "Descending",
"wrapLogMessage": false
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"beacon.service\"} != \" DEBG \"",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "Beacon node logs",
"type": "logs"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 9,
"w": 13,
"x": 11,
"y": 9
},
"id": 6,
"options": {
"showLabels": false,
"showTime": false,
"sortOrder": "Descending",
"wrapLogMessage": false
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"validator.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "Validator logs",
"type": "logs"
},
{
"collapsed": false,
"datasource": null,
"gridPos": {
"h": 1,
"w": 24,
"x": 0,
"y": 18
},
"id": 10,
"panels": [],
"title": "Others",
"type": "row"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 7,
"x": 0,
"y": 19
},
"id": 5,
"options": {
"showLabels": false,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": true
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"cron.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "cron",
"type": "logs"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 8,
"x": 7,
"y": 19
},
"id": 12,
"options": {
"showLabels": false,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": true
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"systemd-timesyncd.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "timesyncd",
"type": "logs"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 9,
"x": 15,
"y": 19
},
"id": 14,
"options": {
"showLabels": false,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": true
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"wpa_supplicant.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "wpa_supplicant",
"type": "logs"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 7,
"x": 0,
"y": 26
},
"id": 17,
"options": {
"showLabels": false,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": true
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"ssh.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "ssh",
"type": "logs"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 8,
"x": 7,
"y": 26
},
"id": 15,
"options": {
"showLabels": false,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": true
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"systemd-networkd.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "networkd",
"type": "logs"
},
{
"datasource": "Loki",
"fieldConfig": {
"defaults": {
"custom": {}
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 9,
"x": 15,
"y": 26
},
"id": 16,
"options": {
"showLabels": false,
"showTime": true,
"sortOrder": "Descending",
"wrapLogMessage": true
},
"pluginVersion": "7.3.4",
"targets": [
{
"expr": "{unit=\"prometheus.service\"}",
"legendFormat": "",
"refId": "A"
}
],
"timeFrom": null,
"timeShift": null,
"title": "prometheus",
"type": "logs"
}
],
"refresh": "1d",
"schemaVersion": 26,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {},
"timezone": "",
"title": "Logs",
"uid": "-HF1TKAGk",
"version": 15
}

Email alerts

Example email alert from Grafana

You will want to know when your node goes down, the CPU is excessive, the temperature is too high, the blockchain is not synced, etc.. These alerts are easy to configure and test directly in the Grafana dashboard. However you need to configure an alert target to get emails.

For this I recommend Zapier. Zapier can receive a webhook from Grafana containing alert information, and trigger any other action with information from the webhook.

The zap configuration in Zapier

When you’re creating your zap, you will get a unique URL in the webhook step. You then enter this URL in notification channels. Click test and Zapier should check all the fields in the webhook and make them available in the next step.

Grafana dashboard alert notification channel configuration

Then once Zapier has received the test hook, you can insert the fields that it found in the test hook into your email however you prefer.

Zapier send email configuration

Configure some alerts in your dashboard and set this channel as the recipient, and you should start getting email alerts.

Bootleg dynamic DNS

When you’re on the road and your node goes down, you have to connect to it and make sure everything starts back up cleanly. The first step is finding your node’s IP on the public internet.

Many ISPs do not assign a static IP address to your home. You can end up locked out of your home network just because you don’t know the IP address. I set up a subdomain of one of my Cloudflare domains to point at my home network, and a cron job to update it every 30 minutes.

A lot of routers support dynamic DNS directly, so first check that your router can do this. It will be more reliable in cases where your node goes down but your internet connection is up.

If you use Cloudflare registrar, you can use the following script, replacing the variables with what you get from Cloudflare.

#!/usr/bin/env bashIP=$(dig +short myip.opendns.com @resolver1.opendns.com) ZONE_ID="" 
RECORD_ID=""
API_TOKEN=""
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_TOKEN" \
-d "{\"content\":\"$IP\"}"

Then set up the scheduled job with crontab.

crontab -e

Add the following line to your cron file. Note the output is ignored so cron does not try to email it to you, which requires additional setup.

*/30 * * * * /home/username/update_cloudflare_dns.sh >/dev/null 2>&1

Sr. Software Engineer at Uniswap