mirror of https://gitlab.com/curben/blog
209 lines
11 KiB
Markdown
209 lines
11 KiB
Markdown
---
|
|
title: Parsing NGINX log in Splunk
|
|
excerpt: Configure regex in field extractor to create relevant fields
|
|
date: 2021-12-25
|
|
tags:
|
|
- splunk
|
|
- nginx
|
|
---
|
|
|
|
For web server's access log, Splunk has built-in support for Apache only. Splunk has a feature called field extractor. It is powered by delimiter and regex, and enables user to add new [_fields_](https://docs.splunk.com/Documentation/Splunk/8.2.3/Knowledge/Aboutfields) to be used in a search query. This post will only covers the regex patterns to parse nginx log, for instruction on field extractor, I recommend perusing the [official documentation](https://docs.splunk.com/Documentation/Splunk/8.2.3/Knowledge/ExtractfieldsinteractivelywithIFX).
|
|
|
|
To illustrate, say we have a log format like this:
|
|
|
|
```
|
|
{id} "{http.request.host}" "{http.request.header.user-agent}"
|
|
```
|
|
|
|
An example log is:
|
|
|
|
```
|
|
123 "example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0"
|
|
```
|
|
|
|
While you could search for a specific keyword, e.g. attempts of {% post_link log4shell-log4j-unbound-dns 'Log4shell exploit' %}, since there are no fields, you cannot run any statistics like [`table`](https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Table) or [`stats`](https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/stats) on the search results.
|
|
|
|
Splunk is able to understand Apache log format because its field extractor already includes the necessary regex patterns to parse the relevant fields of each line in a log. Choosing a source type is equivalent of choosing a log format. If a format is not listed in [the default list](https://docs.splunk.com/Documentation/Splunk/8.2.3/Data/Listofpretrainedsourcetypes), we can either use an add-on or create new fields using field extractor. There is a Splunk [add-on](https://docs.splunk.com/Documentation/AddOns/latest/NGINX) for nginx and I suggest to try it before resorting to field extractor.
|
|
|
|
I create five patterns which cover most of the nginx events I encountered during my work. Refer to the documentation for [supported syntax](https://docs.splunk.com/Documentation/Splunk/8.2.3/Knowledge/AboutSplunkregularexpressions).
|
|
|
|
A field is extracted through "capturing group".
|
|
|
|
```
|
|
(?<field_name>capture pattern)
|
|
```
|
|
|
|
For example, `(?<month>\w+)` searches for one or more (`+`) alphanumeric characters (`\w`) and names the field as `month`. I opted for lazier matching, mostly using unbounded quantifier `+` instead of a stricter range of occurrences `{M,N}` despite knowing the exact pattern of a field. I found some fields may stray off slightly from the expected pattern, so a lazier matching tends match more events without matching unwanted's.
|
|
|
|
## Web request
|
|
|
|
### Regex
|
|
|
|
```
|
|
(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<remote_ip>[\d\.]+)(?:\s\d+\s\S+\s\S+\s)\[(?<time_local>\S+)\s(?<timezone>\+\d{4})\]\s"(?<http_method>\w+)\s(?<http_path>.+)\s(?<http_version>HTTP/\d\.\d)"\s(?<http_status>\d{3})\s(?:\d+)\s"(?<request_url>.[^"]*)"\s"(?<http_user_agent>.[^"]*)"\s(?<server_ip>[\d\.]+)\:(?<server_port>\d+)(?:\s\d+\s\d+\s)(?<ssl_version>\S+)\s(?<ssl_cipher>\S+)\s(?<http_cookie>\S+)
|
|
```
|
|
|
|
### Event
|
|
|
|
```
|
|
Dec 24 01:23:45 192.168.0.2 nginx: 1.2.3.4 55763 - - [24/Dec/2021:01:23:45 +0000] "GET /page.html HTTP/2.0" 200 494 "https://www.example.com" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0" 192.168.1.2:8080 123 4 TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 abcdef .
|
|
```
|
|
|
|
### Fields
|
|
|
|
Field | Value | Regex | Explanation
|
|
--- | --- | --- | ---
|
|
month | Dec | `(?<month>\w+)` | One or more alphanumeric
|
|
day | 24 | `(?<day>\d+)` | One or more digit
|
|
time | 01:23:45 | `(?<time>[\d\:]+)` | One or more digit or semicolon
|
|
proxy_ip | 192.168.0.2 | `(?<proxy_ip>[\d\.]+)` | One or more digit or dot
|
|
remote_ip | 1.2.3.4 | `(?<remote_ip>[\d\.]+)` |
|
|
time_local | 24/Dec/2021:01:23:45 | `(?<time_local>\S+)` | One or more non-whitespace characters
|
|
timezone | +0000 | `(?<timezone>[\+\-]\d{4})` | Four digits with plus or minus prefix
|
|
http_method | GET | `(?<http_method>\w+)` |
|
|
http_path | /page.html | `(?<http_path>.+)` | One or more of any character
|
|
http_version | HTTP/2.0 | `(?<http_version>HTTP/\d\.\d)` | "HTTP", a digit, dot and digit
|
|
http_status | 200 | `(?<http_status>\d{3})` | Three digits
|
|
request_url | https://www.example.com | `(?<request_url>.[^"]*)` | Zero or more of any character except double quote
|
|
http_user_agent | Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0 | `(?<http_user_agent>.[^"]*)` |
|
|
server_ip | 192.168.1.2 | `(?<server_ip>[\d\.]+)` |
|
|
server_port | 8080 | `(?<server_port>\d+)` |
|
|
ssl_version | TLSv1.2 | `(?<ssl_version>\S+)` |
|
|
ssl_cipher | ECDHE-RSA-AES128-GCM-SHA256 | `(?<ssl_cipher>\S+)` |
|
|
http_cookie | abcdef | `(?<http_cookie>\S+)` |
|
|
|
|
nginx is configured as a reverse proxy, `proxy_ip` is its ip whereas `server_ip` is the upstream's.
|
|
|
|
## Proxy request
|
|
|
|
### Regex
|
|
|
|
```
|
|
(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\sclient\s)(?<remote_ip>[\d\.]+)\:(?<remote_port>\d+)(?:\sconnected\sto\s)(?<server_ip>[\d\.]+)\:(?<server_port>\d+)
|
|
```
|
|
|
|
### Event
|
|
|
|
```
|
|
Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 1776#1776:*114333142 client 1.2.3.4:19802 connected to 192.168.1.2:8080
|
|
```
|
|
|
|
### Fields
|
|
|
|
Field | Value | Regex | Explanation
|
|
--- | --- | --- | ---
|
|
month | Dec | `(?<month>\w+)` |
|
|
day | 24 | `(?<day>\d+)` |
|
|
time | 01:23:45 | `(?<time>[\d\:]+)` |
|
|
proxy_ip | 192.168.0.2 | `(?<proxy_ip>[\d\.]+)` |
|
|
year | 2021 | `(?<year>\d{4})` |
|
|
nmonth | 12 | `(?<nmonth>\d{2})` |
|
|
log_level | info | `(?<log_level>\w+)` |
|
|
remote_ip | 1.2.3.4 | `(?<remote_ip>[\d\.]+)` |
|
|
remote_port | 19802 | `(?<remote_port>\d+)` |
|
|
server_ip | 192.168.1.2 | `(?<server_ip>[\d\.]+)` |
|
|
server_port | 8080 | `(?<server_port>\d+)` |
|
|
|
|
## Upstream error response
|
|
|
|
### Regex
|
|
|
|
```
|
|
(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>.[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)",\shost\:\s"(?<upstream_host>.[^"]*)
|
|
```
|
|
|
|
### Event
|
|
|
|
```
|
|
Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [error] 1776#1776:*71197740 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 1.2.3.4, server: example.com, request: "POST /api/path HTTP/2.0",upstream: "http://192.168.1.2:8080/api/path", host:"example.com"
|
|
```
|
|
|
|
### Fields
|
|
|
|
Field | Value | Regex | Explanation
|
|
--- | --- | --- | ---
|
|
month | Dec | `(?<month>\w+)` |
|
|
day | 24 | `(?<day>\d+)` |
|
|
time | 01:23:45 | `(?<time>[\d\:]+)` |
|
|
proxy_ip | 192.168.0.2 | `(?<remote_ip>[\d\.]+)` |
|
|
year | 2021 | `(?<year>\d{4})` |
|
|
nmonth | 12 | `(?<nmonth>\d{2})` |
|
|
log_level | error | `(?<log_level>\w+)` |
|
|
upstream_error | upstream timed out (110: Connection timed out) while reading response header from upstream | `(?<upstream_error>.[^,]*)` | Zero or more of any character except comma
|
|
remote_ip | 1.2.3.4 | `(?<remote_ip>[\d\.]+)` |
|
|
server_host | example.com | `(?<server_host>.[^,]*)` |
|
|
http_method | POST | `(?<http_method>\w+)` |
|
|
http_path | /api/path | `(?<http_path>\S+)` |
|
|
http_version | HTTP/2.0 | `(?<http_version>HTTP/\d\.\d)` |
|
|
upstream_url | http://192.168.1.2:8080/api/path | `(?<upstream_url>.[^"]*)` |
|
|
upstream_host | example.com | `(?<upstream_host>.[^"]*)` |
|
|
|
|
## Upstream epoll error
|
|
|
|
### Regex
|
|
|
|
```
|
|
(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>[^,]*,[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)(?:",\shost\:\s")(?<upstream_host>.[^"]*)
|
|
```
|
|
|
|
### Event
|
|
|
|
```
|
|
Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 13199#13199: *81574833 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream, client: 1.2.3.4, server: example.com, request: "GET /page.html HTTP/1.1", upstream:"http://192.168.1.2/page.html", host: "example.com"
|
|
```
|
|
|
|
### Fields
|
|
|
|
Field | Value | Regex | Explanation
|
|
--- | --- | --- | ---
|
|
month | Dec | `(?<month>\w+)` |
|
|
day | 24 | `(?<day>\d+)` |
|
|
time | 01:23:45 | `(?<time>[\d\:]+)` |
|
|
proxy_ip | 192.168.0.2 | `(?<remote_ip>[\d\.]+)` |
|
|
year | 2021 | `(?<year>\d{4})` |
|
|
nmonth | 12 | `(?<nmonth>\d{2})` |
|
|
log_level | info | `(?<log_level>\w+)` |
|
|
upstream_error | epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while connecting to upstream | `(?<upstream_error>.[^,]*)` |
|
|
remote_ip | 1.2.3.4 | `(?<remote_ip>[\d\.]+)` |
|
|
server_host | example.com | `(?<server_host>.[^,]*)` |
|
|
http_method | GET | `(?<http_method>\w+)` |
|
|
http_path | /page.html | `(?<http_path>\S+)` |
|
|
http_version | HTTP/1.1 | `(?<http_version>HTTP/\d\.\d)` |
|
|
upstream_url | http://192.168.1.2/page.html | `(?<upstream_url>.[^"]*)` |
|
|
upstream_host | example.com | `(?<upstream_host>.[^"]*)` |
|
|
|
|
## Upstream epoll error with referrer
|
|
|
|
### Regex
|
|
|
|
```
|
|
(?<month>\w+)\s+(?<day>\d+)\s(?<time>[\d\:]+)\s(?<proxy_ip>[\d\.]+)(?:\snginx\:\s)(?<year>\d{4})\/(?<nmonth>\d{2})(?:\/\d{2}\s[\d\:]+\s)\[(?<log_level>\w+)\](?:\s\d+#\d+\:\s\*\d+\s)(?<upstream_error>[^,]*,[^,]*)(?:,\sclient\:\s)(?<remote_ip>[\d\.]+)(?:,\sserver\:\s)(?<server_host>.[^,]*)(?:,\srequest\:\s")(?<http_method>\w+)\s(?<http_path>\S+)\s(?<http_version>HTTP/\d\.\d)(?:",\supstream\:\s")(?<upstream_url>.[^"]*)(?:",\shost\:\s")(?<upstream_host>.[^"]*)(?:",\sreferrer\:\s")(?<referrer>.[^"]*)
|
|
```
|
|
|
|
### Event
|
|
|
|
```
|
|
Dec 24 01:23:45 192.168.0.2 nginx: 2021/12/24 01:23:45 [info] 1776#1776:*71220252 epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream, client: 1.2.3.4, server: example.com, request: "GET /page.html HTTP/1.1", upstream: "http://192.168.1.2:8080/page.html", host: "example.com", referrer: "https://example.com"
|
|
```
|
|
|
|
### Fields
|
|
|
|
Field | Value | Regex | Explanation
|
|
--- | --- | --- | ---
|
|
month | Dec | `(?<month>\w+)` |
|
|
day | 24 | `(?<day>\d+)` |
|
|
time | 01:23:45 | `(?<time>[\d\:]+)` |
|
|
proxy_ip | 192.168.0.2 | `(?<remote_ip>[\d\.]+)` |
|
|
year | 2021 | `(?<year>\d{4})` |
|
|
nmonth | 12 | `(?<nmonth>\d{2})` |
|
|
log_level | info | `(?<log_level>\w+)` |
|
|
upstream_error | epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while sending request to upstream | `(?<upstream_error>.[^,]*)` |
|
|
remote_ip | 1.2.3.4 | `(?<remote_ip>[\d\.]+)` |
|
|
server_host | example.com | `(?<server_host>.[^,]*)` |
|
|
http_method | GET | `(?<http_method>\w+)` |
|
|
http_path | /page.html | `(?<http_path>\S+)` |
|
|
http_version | HTTP/1.1 | `(?<http_version>HTTP/\d\.\d)` |
|
|
upstream_url | http://192.168.1.2:8080/page.html | `(?<upstream_url>.[^"]*)` |
|
|
upstream_host | example.com | `(?<upstream_host>.[^"]*)` |
|
|
referrer | https://example.com | `(?<referrer>.[^"]*)` |
|