From ce73b39d1ebdf45fc74e968e2752606ba217271e Mon Sep 17 00:00:00 2001 From: Ming Di Leom <2809763-curben@users.noreply.gitlab.com> Date: Sat, 25 Dec 2021 07:26:09 +0000 Subject: [PATCH] post: Parsing NGINX log in Splunk --- source/_posts/nginx-splunk-field-extractor.md | 208 ++++++++++++++++++ 1 file changed, 208 insertions(+) create mode 100644 source/_posts/nginx-splunk-field-extractor.md diff --git a/source/_posts/nginx-splunk-field-extractor.md b/source/_posts/nginx-splunk-field-extractor.md new file mode 100644 index 0000000..820c04d --- /dev/null +++ b/source/_posts/nginx-splunk-field-extractor.md @@ -0,0 +1,208 @@ +--- +title: Parsing NGINX log in Splunk +excerpt: Configure regex in field extractor to create relevant fields +date: 2021-12-25 +tags: +- splunk +- nginx +--- + +For web server's access log, Splunk has built-in support for Apache only. Splunk has a feature called field extractor. It is powered by delimiter and regex, and enables user to add new [_fields_](https://docs.splunk.com/Documentation/Splunk/8.2.3/Knowledge/Aboutfields) to be used in a search query. This post will only covers the regex patterns to parse nginx log, for instruction on field extractor, I recommend perusing the [official documentation](https://docs.splunk.com/Documentation/Splunk/8.2.3/Knowledge/ExtractfieldsinteractivelywithIFX). + +To illustrate, say we have a log format like this: + +``` +{id} "{http.request.host}" "{http.request.header.user-agent}" +``` + +An example log is: + +``` +123 "example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0" +``` + +While you could search for a specific keyword, e.g. attempts of {% post_link log4shell-log4j-unbound-dns 'Log4shell exploit' %}, since there are no fields, you cannot run any statistics like [`table`](https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Table) or [`stats`](https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/stats) on the search results. + +Splunk is able to understand Apache log format because its field extractor already includes the necessary regex patterns to parse the relevant fields of each line in a log. Choosing a source type is equivalent of choosing a log format. If a format is not listed in [the default list](https://docs.splunk.com/Documentation/Splunk/8.2.3/Data/Listofpretrainedsourcetypes), we can either use an add-on or create new fields using field extractor. There is a Splunk [add-on](https://docs.splunk.com/Documentation/AddOns/latest/NGINX) for nginx and I suggest to try it before resorting to field extractor. + +I create five patterns which cover most of the nginx events I encountered during my work. Refer to the documentation for [supported syntax](https://docs.splunk.com/Documentation/Splunk/8.2.3/Knowledge/AboutSplunkregularexpressions). + +A field is extracted through "capturing group". + +``` +(?capture pattern) +``` + +For example, `(?\w+)` searches for one or more (`+`) alphanumeric characters (`\w`) and names the field as `month`. I opted for lazier matching, mostly using unbounded quantifier `+` instead of a stricter range of occurrences `{M,N}` despite knowing the exact pattern of a field. I found some fields may stray off slightly from the expected pattern, so a lazier matching tends match more events without matching unwanted's. + +## Web request + +### Regex + +``` +(?\w+)\s+(?\d+)\s(?