curben 2018-10-10 16:55:29 +10:30
parent 3dcc053d0c
commit 4d98b65acb
5 changed files with 1000026 additions and 6 deletions

View File

@ -34,7 +34,10 @@ deploy:
- cd build
# Give execute permission to scripts
- chmod 700 script.sh commit.sh
- chmod 700 umbrella-top-1m.sh script.sh commit.sh
# Download Umbrella Popularity List
- ./umbrella-top-1m.sh
# Download database dump and process it
- ./script.sh

View File

@ -15,13 +15,13 @@ https://gitlab.com/curben/urlhaus/raw/master/urlhaus-filter.txt
Following URL categories are removed from the database dump:
- Offline URL
- Well-known host or false positives (see [exclude.txt](exclude.txt))
- Well-known host ([top-1m.txt](top-1m.txt)) or false positives ([exclude.txt](exclude.txt))
Database dump is saved as [URLhaus.csv](URLhaus.csv), processed by [script.sh](script.sh) and output as [urlhaus-filter.txt](urlhaus-filter.txt).
## Note
Please report any false positive, especially if the domain is one of the Alexa 10M.
Please report any false positive.
This filter **only** accepts malware URLs from [URLhaus](https://urlhaus.abuse.ch/).
@ -34,7 +34,10 @@ This repo is not endorsed by Abuse.sh.
- Can you add this *very-bad-url.com* to the filter?
+ No, please report to the [upstream](https://urlhaus.abuse.ch/api/#submit).
- Why do you need to clone the repo again in your CI?
- Why don't you use the URLhaus "Plain-Text URL List"?
+ It doesn't show the status (online/offline) of a URL.
- Why do you need to clone the repo again in your CI? I thought CI already fetch the repo by default?
+ GitLab Runner clone/fetch the repo using HTTPS method by default ([log](https://gitlab.com/curben/urlhaus/-/jobs/105979394)). This method requires deploy *token* which is *read-only* (cannot push).
+ Deploy *key* has write access but cannot be used with the HTTPS method, hence, the workaround to clone using SSH.
+ See issue [#20567](https://gitlab.com/gitlab-org/gitlab-ce/issues/20567) and [#20845](https://gitlab.com/gitlab-org/gitlab-ce/issues/20845).
+ See issue [#20567](https://gitlab.com/gitlab-org/gitlab-ce/issues/20567) and [#20845](https://gitlab.com/gitlab-org/gitlab-ce/issues/20845).

View File

@ -13,7 +13,7 @@ COMMENT="$FIRST_LINE\n$SECOND_LINE\n$THIRD_LINE\n$FOURTH_LINE\n$FIFTH_LINE"
# Download the database dump
wget https://urlhaus.abuse.ch/downloads/csv/ -O URLhaus.csv
# Parse domain name and IP address only
# Parse domains and IP address only
cat URLhaus.csv | \
grep '"online"' | \
cut -f 6 -d '"' | \
@ -21,6 +21,8 @@ cut -f 3 -d '/' | \
cut -f 1 -d ':' | \
# Sort and remove duplicates
sort -u | \
# Exclude Umbrella Top 1M
grep -vf top-1m.txt | \
# Exclude false positive
grep -vf exclude.txt | \
# Append header comment to the filter list

1000000
top-1m.txt Normal file

File diff suppressed because it is too large Load Diff

12
umbrella-top-1m.sh Normal file
View File

@ -0,0 +1,12 @@
#!/bin/sh
# Download the Cisco Umbrella 1 Million
# More info:
# https://s3-us-west-1.amazonaws.com/umbrella-static/index.html
# Download the list
wget -O- http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip | \
# Unzip
funzip | \
# Parse domains only
cut -f 2 -d ',' > top-1m.txt