blog/source/_posts/nginx-splunk-field-extracto...

11 KiB

title excerpt date tags
Parsing NGINX log in Splunk Configure regex in field extractor to create relevant fields 2021-12-25
splunk
nginx

For web server's access log, Splunk has built-in support for Apache only. Splunk has a feature called field extractor. It is powered by delimiter and regex, and enables user to add new fields to be used in a search query. This post will only covers the regex patterns to parse nginx log, for instruction on field extractor, I recommend perusing the official documentation.

To illustrate, say we have a log format like this:

{id} "{http.request.host}" "{http.request.header.user-agent}"

An example log is:

123 "example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0"

While you could search for a specific keyword, e.g. attempts of {% post_link log4shell-log4j-unbound-dns 'Log4shell exploit' %}, since there are no fields, you cannot run any statistics like table or stats on the search results.

Splunk is able to understand Apache log format because its field extractor already includes the necessary regex patterns to parse the relevant fields of each line in a log. Choosing a source type is equivalent of choosing a log format. If a format is not listed in the default list, we can either use an add-on or create new fields using field extractor. There is a Splunk add-on for nginx and I suggest to try it before resorting to field extractor.

I create five patterns which cover most of the nginx events I encountered during my work. Refer to the documentation for supported syntax.

A field is extracted through "capturing group".

(?<field_name>capture pattern)

For example, (?<month>\w+) searches for one or more (+) alphanumeric characters (\w) and names the field as month. I opted for lazier matching, mostly using unbounded quantifier + instead of a stricter range of occurrences {M,N} despite knowing the exact pattern of a field. I found some fields may stray off slightly from the expected pattern, so a lazier matching tends match more events without matching unwanted's.

Web request

Regex

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<remote_ip>[\d\.]+)(?:\s\d+\s\S+\s\S+\s)\[(?<time_local>\S+)\s(?<timezone>\+\d{4})\]\s"(?<http_method>\w+)\s(?<http_path>.+)\s(?<http_version>HTTP/\d\.\d)"\s(?<http_status>\d{3})\s(?:\d+)\s"(?<request_url>.[^"]*)"\s"(?<http_user_agent>.[^"]*)"\s(?<server_ip>[\d\.]+)\:(?<server_port>\d+)(?:\s\d+\s\d+\s)(?<ssl_version>\S+)\s(?<ssl_cipher>\S+)\s(?<http_cookie>\S+)

Event

Dec 24 01:23:45 192.168.0.2 nginx: 1.2.3.4 55763 - - [24/Dec/2021:01:23:45 +0000] "GET /page.html HTTP/2.0" 200 494 "https://www.example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0" 192.168.1.2:8080 123 4 TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 abcdef .

Fields

Field Value Regex Explanation
month Dec (?<month>\w+) One or more alphanumeric
day 24 (?<day>\d+) One or more digit
time 01:23:45 (?<time>[\d\:]+) One or more digit or semicolon
proxy_ip 192.168.0.2 (?<proxy_ip>[\d\.]+) One or more digit or dot
remote_ip 1.2.3.4 (?<remote_ip>[\d\.]+)
time_local 24/Dec/2021:01:23:45 (?<time_local>\S+) One or more non-whitespace characters
timezone +0000 (?<timezone>[\+\-]\d{4}) Four digits with plus or minus prefix
http_method GET (?<http_method>\w+)
http_path /page.html (?<http_path>.+) One or more of any character
http_version HTTP/2.0 (?<http_version>HTTP/\d\.\d) "HTTP", a digit, dot and digit
http_status 200 (?<http_status>\d{3}) Three digits
request_url https://www.example.com (?<request_url>.[^"]*) Zero or more of any character except double quote
http_user_agent Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0 (?<http_user_agent>.[^"]*)
server_ip 192.168.1.2 (?<server_ip>[\d\.]+)
server_port 8080 (?<server_port>\d+)
ssl_version TLSv1.2 (?<ssl_version>\S+)
ssl_cipher ECDHE-RSA-AES128-GCM-SHA256 (?<ssl_cipher>\S+)
http_cookie abcdef (?<http_cookie>\S+)

nginx is configured as a reverse proxy, proxy_ip is its ip whereas server_ip is the upstream's.

Proxy request

Regex

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\sclient\s)(?<remote_ip>[\d\.]+)\:(?<remote_port>\d+)(?:\sconnected\sto\s)(?<server_ip>[\d\.]+)\:(?<server_port>\d+)

Event

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 1776#1776:*114333142 client 1.2.3.4:19802 connected to 192.168.1.2:8080

Fields

Field Value Regex Explanation
month Dec (?<month>\w+)
day 24 (?<day>\d+)
time 01:23:45 (?<time>[\d\:]+)
proxy_ip 192.168.0.2 (?<proxy_ip>[\d\.]+)
year 2021 (?<year>\d{4})
nmonth 12 (?<nmonth>\d{2})
log_level info (?<log_level>\w+)
remote_ip 1.2.3.4 (?<remote_ip>[\d\.]+)
remote_port 19802 (?<remote_port>\d+)
server_ip 192.168.1.2 (?<server_ip>[\d\.]+)
server_port 8080 (?<server_port>\d+)

Upstream error response

Regex

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>.[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)",\shost\:\s"(?<upstream_host>.[^"]*)

Event

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [error] 1776#1776:*71197740 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 1.2.3.4, server: example.com, request: "POST /api/path HTTP/2.0",upstream: "http://192.168.1.2:8080/api/path", host:"example.com"

Fields

Field Value Regex Explanation
month Dec (?<month>\w+)
day 24 (?<day>\d+)
time 01:23:45 (?<time>[\d\:]+)
proxy_ip 192.168.0.2 (?<remote_ip>[\d\.]+)
year 2021 (?<year>\d{4})
nmonth 12 (?<nmonth>\d{2})
log_level error (?<log_level>\w+)
upstream_error upstream timed out (110: Connection timed out) while reading response header from upstream (?<upstream_error>.[^,]*) Zero or more of any character except comma
remote_ip 1.2.3.4 (?<remote_ip>[\d\.]+)
server_host example.com (?<server_host>.[^,]*)
http_method POST (?<http_method>\w+)
http_path /api/path (?<http_path>\S+)
http_version HTTP/2.0 (?<http_version>HTTP/\d\.\d)
upstream_url http://192.168.1.2:8080/api/path (?<upstream_url>.[^"]*)
upstream_host example.com (?<upstream_host>.[^"]*)

Upstream epoll error

Regex

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>[^,]*,[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)(?:",\shost\:\s")(?<upstream_host>.[^"]*)

Event

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 13199#13199: *81574833 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream, client: 1.2.3.4, server: example.com, request: "GET /page.html HTTP/1.1", upstream:"http://192.168.1.2/page.html", host: "example.com"

Fields

Field Value Regex Explanation
month Dec (?<month>\w+)
day 24 (?<day>\d+)
time 01:23:45 (?<time>[\d\:]+)
proxy_ip 192.168.0.2 (?<remote_ip>[\d\.]+)
year 2021 (?<year>\d{4})
nmonth 12 (?<nmonth>\d{2})
log_level info (?<log_level>\w+)
upstream_error epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream (?<upstream_error>.[^,]*)
remote_ip 1.2.3.4 (?<remote_ip>[\d\.]+)
server_host example.com (?<server_host>.[^,]*)
http_method GET (?<http_method>\w+)
http_path /page.html (?<http_path>\S+)
http_version HTTP/1.1 (?<http_version>HTTP/\d\.\d)
upstream_url http://192.168.1.2/page.html (?<upstream_url>.[^"]*)
upstream_host example.com (?<upstream_host>.[^"]*)

Upstream epoll error with referrer

Regex

(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>[^,]*,[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)(?:",\shost\:\s")(?<upstream_host>.[^"]*)(?:",\sreferrer\:\s")(?<referrer>.[^"]*)

Event

Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 1776#1776:*71220252 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 1.2.3.4, server: example.com, request: "GET /page.html HTTP/1.1", upstream: "http://192.168.1.2:8080/page.html", host: "example.com", referrer: "https://example.com"

Fields

Field Value Regex Explanation
month Dec (?<month>\w+)
day 24 (?<day>\d+)
time 01:23:45 (?<time>[\d\:]+)
proxy_ip 192.168.0.2 (?<remote_ip>[\d\.]+)
year 2021 (?<year>\d{4})
nmonth 12 (?<nmonth>\d{2})
log_level info (?<log_level>\w+)
upstream_error epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream (?<upstream_error>.[^,]*)
remote_ip 1.2.3.4 (?<remote_ip>[\d\.]+)
server_host example.com (?<server_host>.[^,]*)
http_method GET (?<http_method>\w+)
http_path /page.html (?<http_path>\S+)
http_version HTTP/1.1 (?<http_version>HTTP/\d\.\d)
upstream_url http://192.168.1.2:8080/page.html (?<upstream_url>.[^"]*)
upstream_host example.com (?<upstream_host>.[^"]*)
referrer https://example.com (?<referrer>.[^"]*)