6.9 KiB
title | excerpt | date | updated | tags | |
---|---|---|---|---|---|
Configure Splunk Universal Forwarder to ingest JSON files | Parse single-line JSON into separate events | 2023-06-17 | 2023-12-05 |
|
The recommended logging format according to Splunk best practice looks like this:
{ "datetime": 1672531212123456, "event_id": 1, "key1": "value1", "key2": "value2", "key3": "value3" }
{ "datetime": 1672531213789012, "event_id": 2, "key1": "value1", "key2": "value2", "key3": "value3" }
{ "datetime": 1672531214345678, "event_id": 3, "key1": "value1", "key2": "value2", "key3": "value3" }
- Each event is in JSON, not the file.
- This also means the log file is not a valid JSON file.
- Each event is separated by newline.
The format can be achieved by exporting live event in JSON and append to a log file. However, I encountered a situation where the log file can only be generated by batch. Exporting the equivalent of the previous "example.log" in JSON without string manipulation looks like this:
[{"datetime": 1672531212123456, "event_id": 1, "key1": "value1", "key2": "value2", "key3": "value3"}, {"datetime": 1672531213789012, "event_id": 2, "key1": "value1", "key2": "value2", "key3": "value3"}, {"datetime": 1672531214345678, "event_id": 3, "key1": "value1", "key2": "value2", "key3": "value3"}]
I will detail the required configurations in this post, so that Splunk is able to parse it correctly even though "example.json" is not a valid JSON file.
UF inputs.conf
[monitor:///var/log/app_a]
disabled = 0
index = index_name
sourcetype = app_a_event
monitor directive is made up of two parts: monitor://
and the path, e.g. /var/log/app_a
. Unlike most Splunk configs, this directive does't require the backslash (used in Windows path) to be escaped, e.g. monitor://C:\foo\bar
.
A path can be a file or a folder. When (*) wildcard matching is used to match multiple folders, another wildcard needs to be specified again to match files in those matched folders. The wildcard works for a single path segment only. For example, to match all the following files, use monitor:///var/log/app_*/*
. Splunk also supports "..." for recursive matching.
/var/log/
├── app_a
│ ├── 1.log
│ ├── 2.log
│ └── 3.log
├── app_b
│ ├── 1.log
│ ├── 2.log
│ └── 3.log
└── app_c
├── 1.log
├── 2.log
└── 3.log
Specify an appropriate value in sourcetype config, the value will be the value of sourcetype
field in the ingested events under the "monitor" directive. Take note of the value you have configured, it will be used in the rest of configurations.
Forwarder props.conf
[app_a_event]
description = App A logs
INDEXED_EXTRACTIONS = JSON
# remove bracket at the start and end of each line
SEDCMD-remove_prefix = s/^\[//g
SEDCMD-remove-suffix = s/\]$//g
# separate each object into a line
LINE_BREAKER = }(,){\"datetime\"
# a line represents an event
SHOULD_LINEMERGE = 0
TIMESTAMP_FIELDS = datetime
## default is 2000
# MAX_DAYS_AGO = 3560
# TIME_FORMAT = %s
The directive name should be the sourcetype value specified in the inputs.conf.
- SEDCMD: sed script,
SEDCMD-<name>
can be specified multiple times to run different scripts, each with different name.s/^\[//g
removes "[" at the start of each line.s/\]$//g
removes "]" at the end of each line.
- LINE_BREAKER: Search for string that matches the regex and replace only the capturing group with newline (\n). This is to separate each event into separate line.
}(,){\"datetime\"
searches for},{"datetime"
and replaces "," with "\n".
- SHOULD_LINEMERGE: only used for event that spans multiple lines. In this case, it's the reverse, the log file has all events in one line.
- TIMESTAMP_FIELDS: Refers to
datetime
key in theexample.json
. - MAX_DAYS_AGO (optional): Specify the value if there are events older than 2,000 days.
- TIME_FORMAT: Optional if Unix time is used. When Unix time is used, it is not necessary to specify
%s%3N
when there is subsecond.
The location of "props.conf" depends on whether the universal forwarder is centrally managed by a deployment server.
Path A: $SPLUNK_HOME/etc/deployment-apps/foo/local/props.conf Path B: $SPLUNK_HOME/etc/apps/foo/local/props.conf
If there is a deployment server, then the config file should be in path A, in which the server will automatically deploy it to path B in the UF. If the UF is not centrally managed, it should head straight to path B.
Search head props.conf
[app_a_event]
description = App A logs
KV_MODE = none
AUTO_KV_JSON = 0
SHOULD_LINEMERGE = 0
# MAX_DAYS_AGO = 3560
In Splunk Enterprise, the above file can be saved in a custom app, e.g. "$SPLUNK_HOME/etc/app/custom-app/default/props.conf"
For Splunk Cloud deployment, the above configuration can be added through a custom app or Splunk Web: Settings > Source types.
Ingesting API response
It is important to note SEDCMD
runs after INDEXED_EXTRACTIONS
. I noticed this behaviour when I tried to ingest API response of LibreNMS.
{"status": "ok", "devices": [{"device_id": 1, "key1": "value1", "key2": "value2"}, {"device_id": 2, "key1": "value1", "key2": "value2"}, {"device_id": 3, "key1": "value1", "key2": "value2"}], "count": 3}
In this scenario, I only wanted to ingest "devices" array where each item is an event. The previous approach not only did not split the array, but "status" and "count" fields still existed in each event despite the use of SEDCMD
to remove them.
The solution is not to use INDEXED_EXTRACTIONS
(index-time field extraction), but use KV_MODE
(search-time field extraction) instead.
# forwarder
[api_a_response]
description = API A response
# remove bracket at the start and end of each line
SEDCMD-remove_prefix = s/^\{"status": "ok", "devices": \[//g
SEDCMD-remove_suffix = s/\], "count": [0-9]+\}$//g
# separate each object into a line
LINE_BREAKER = }(, ){\"device_id\"
# a line represents an event
SHOULD_LINEMERGE = 0
# search head
[api_a_response]
description = API A response
KV_MODE = json
AUTO_KV_JSON = 1
SHOULD_LINEMERGE = 0