TLDR

⏱️

Using RFC5424-based formatter in syslog output from ESXi 8.0 build 22380479 U2 IA, the output seems to include an extra NILVALUE after the MSGID, breaking the STRUCTURED-DATA (SD) parts (being read as MSG instead). The PARAM-VALUE is also missing double quote characters, making it harder for parser to read (e.g. Promtail in Loki.

Background

I’m currently (ok, always) evaluating alternative (sys)log solutions. I’ve been slowly leaning towards Loki lately, in addition to existing Elastic and VMware vRealize Log Insight aka vRLI (guess it’s called now Aria-something, but I usually ignore marketing people, and focus on the technical bits). Since ESXi 8.x, VMware introduced some new goodies for us “log fanatics” (maybe just me), including improved standards, both for locally stored logs (scratch) and by forwarding logs using syslog (vmsyslogd). This was a great time for me to (re)evaluate using the improved RFC5424 (which obsoletes RFC3164). In turn, this led me down a little “troubleshooting session”, where main project (still) being researching ESXi 8 syslog to Loki best practices (which I might write more about later). Let’s solve some issues first.

Summary

Troubleshooting notes (RAW)

Inspect: syslog packets sent from ESXi using tcpdump-uw

Using tcpdump-uw to view (unencrypted, for now..) syslog traffic to Loki (port 1514, tcp) with some “structured data” included

  • filter (grep) on 6876 which is the IANA assigned enterpriseId number for VMware
  • example output from the command tcpdump-uw port 1514 -An|grep 6876, run on esx-13

Example syslog message

<166>1 2023-09-27T14:47:53.655Z esx-13.esod.local vsansystem 1050126 - - [vSAN@6876 sub=Libs] VsanInfoImpl: Refresh config generation

Finding: extra NILVALUE after the MSGID, causing STRUCTURED-DATA to be NILVALUE

Possibly extra NILVALUE after the MSGID (also NILVALUE), where I guess the STRUCTURED-DATA is suppose to be(?)

Unexpected structure

vsansystem   1050126    -      -     [vSAN@6876 sub=Libs]
   ^            ^       ^      ^           ^
app-name      procid  msgid   ?NIL?   structured data

Expected structure

vsansystem   1050126    -       [vSAN@6876 sub=Libs]
   ^            ^       ^             ^
app-name      procid  msgid     structured data

Example: on VALID STRUCTURED-DATA from RFC5424

Based on RFC5424, 6.5 Examples > Example 3 - with STRUCTURED-DATA; could it be an extra NILVALUE (hyphen) issue in log entries from VMware ESXi syslog?

Example from https://www.rfc-editor.org/rfc/rfc5424#section-6.5 (see Example 3 - with STRUCTURED-DATA)
<165>1 2003-10-11T22:14:15.003Z mymachine.example.com evntslog - ID47 [exampleSDID@32473 iut="3" eventSource="Application" eventID="1011"] BOMAn applicationevent log entry...

Testing: syslog receiver with modified data to verify STRUCTURED-DATA

To verify issues, I’m using a simple echo and netcat (nc) to send the original and modified syslog to the (very same) server

echo '<166>1 2023-09-27T14:47:53.655Z esx-dummy.esod.local vsansystem 1050126 - - [vSAN@6876 sub=Libs] VsanInfoImpl: Refresh config generation' | nc -v -w 0 loki-testing.esod.local 1514
Connection to loki-testing.esod.local port 1514 [tcp/fujitsu-dtcns] succeeded!

Results: ❌ Structured data is parsed as MSG (FAIL)

[
    {
        "line": "[vSAN@6876 sub=Libs] VsanInfoImpl: Refresh config generation",
        "timestamp": "1695843921656941908",
        "fields": {
            "Line": "[vSAN@6876 sub=Libs] VsanInfoImpl: Refresh config generation",
            "host": "esx-dummy.esod.local",
            "job": "syslog",
            "syslog_app_name": "vsansystem",
            "syslog_connection_hostname": "archer.esod.local.",
            "syslog_connection_ip_address": "192.168.1.99",
            "syslog_facility": "local4",
            "syslog_procid": "1050126",
            "level": "informational"
        }
    }
]

Testing: to remove the extra NILVALUE (hyphen)

echo '<166>1 2023-09-27T14:47:53.655Z esx-dummy.esod.local vsansystem 1050126 - [vSAN@6876 sub=Libs] VsanInfoImpl: Refresh config generation' | nc -v -w 0 loki-testing.esod.local 1514
Connection to loki-testing.esod.local port 1514 [tcp/fujitsu-dtcns] succeeded!

Results: 🔍 Seems to work better, but additional (new) error detected

Notice we now get this (new) output from the promtail log (syslogtarget.go)

level=warn ts=2023-09-27T19:49:25.88006848Z caller=syslogtarget.go:105 msg="error parsing syslog stream" err="expecting a structured data parameter (`key=\"value\"`, both part from 1 to max 32 US-ASCII characters; key cannot contain `=`, ` `, `]`, and `\"`, while value cannot contain `]`, backslash, and `\"` unless escaped) [col 89]"

So the error is “expecting a structured data parameter”, and the expected key=\"value\"

I guess the above error is thrown due to missing double quote characters around the PARAM-VALUE?

Testing: to use double quote characters around the PARAM-VALUE

e.g. sub="Libs" vs previous sub=Libs

echo '<166>1 2023-09-27T14:47:53.655Z esx-dummy.esod.local vsansystem 1050126 - [vSAN@6876 sub="Libs"] VsanInfoImpl: Refresh config generation' | nc -v -w 0 loki-testing.esod.local 1514
Connection to loki-testing.esod.local port 1514 [tcp/fujitsu-dtcns] succeeded!

Results: ✅ STRUCTURED-DATA now parsed successfully

Notice that the parser now successfully reads the STRUCTURED-DATA

  • After removing the extra NILVALUE
  • After surrounding PARAM-VALUE in double quote characters
  • After correcting the source_labels for STRUCTURED-DATA to case-sensitive (promtail-part)

Verify: troubleshooting and theory

echo '<166>1 2023-09-27T14:47:53.655Z esx-dummy.esod.local vsansystem 1050126 - [vSAN@6876 sub="Libs" test="esod"] VsanInfoImpl: Refresh config generation' | nc -v -w 0 loki-testing.esod.local 1514
Connection to loki-testing.esod.local port 1514 [tcp/fujitsu-dtcns] succeeded!

The additional PARAM-NAME ( test) is just for testing

The STRUCTURED-DATA is now being parsed successfully.

{
	"line": "VsanInfoImpl: Refresh config generation",
	"timestamp": "1695845695416159983",
	"fields": {
		"Line": "VsanInfoImpl: Refresh config generation",
		"host": "esx-dummy.esod.local",
		"job": "syslog",
		"syslog_app_name": "vsansystem",
		"syslog_connection_hostname": "archer.esod.local.",
		"syslog_connection_ip_address": "192.168.1.99",
		"syslog_facility": "local4",
		"syslog_procid": "1050126",
		"syslog_sd_vsan_6876_sub": "Libs",
		"syslog_sd_vsan_6876_test": "esod",
		"level": "informational"
	}
}

For reference, this is my working scrape_configs for syslog (using Promtail directly), including relevant relabel_configs (in YAML) for STRUCTURED-DATA (which is not working if parsed as NIL)

Note that the STRUCTURED-DATA examples are based on output directly from ESXi syslog, as detected when viewing (unencrypted) traffic using tcpdump-uw - directly from ESXi sending the syslog.

scrape_configs:
  - job_name: syslog
    syslog:
      listen_address: 0.0.0.0:1514
      listen_protocol: tcp
      idle_timeout: 60s
      label_structured_data: yes
      labels:
        job: "syslog"
    relabel_configs:
      - source_labels: [__syslog_message_hostname]
        target_label: host
      - source_labels: [__syslog_message_severity]
        target_label: level
      - source_labels: [__syslog_message_app_name]
        target_label: syslog_app_name
      - source_labels: [__syslog_message_facility]
        target_label: syslog_facility
      - source_labels: [__syslog_message_proc_id]
        target_label: syslog_procid
      - source_labels: [__syslog_message_msg_id]
        target_label: syslog_msgid
      - source_labels: [__syslog_connection_ip_address]
        target_label: syslog_connection_ip_address
      - source_labels: [__syslog_connection_hostname]
        target_label: syslog_connection_hostname
      - source_labels: [__syslog_message_sd_Originator_6876_sub]
        target_label: syslog_sd_originator_6876_sub
      - source_labels: [__syslog_message_sd_vSAN_6876_sub]
        target_label: syslog_sd_vsan_6876_sub

And while testing PARAM values, I created this name (temporarily)

- source_labels: [__syslog_message_sd_vSAN_6876_test]
target_label: syslog_sd_vsan_6876_test