Merge branch 'full-url' into 'master'

feat: include full URL for popular domains

See merge request curben/urlhaus-filter!2
This commit is contained in:
curben 2019-05-11 10:21:41 +00:00
commit 8a0a5e73f9
11 changed files with 128 additions and 73 deletions

1
.gitignore vendored Normal file
View File

@ -0,0 +1 @@
tmp/*

View File

@ -1,4 +1,4 @@
image: alpine:latest # Use latest version of Alpine Linux docker image
image: alpine:latest # Use the latest version of Alpine Linux docker image
before_script:
# Install dependencies
@ -33,21 +33,10 @@ deploy:
# Change to the downloaded repo directory
- cd build/
# Give execute permission to scripts
- cd utils/
- chmod 700 umbrella-top-1m.sh script.sh commit.sh
# Download Umbrella Popularity List
- ./umbrella-top-1m.sh
# Download database dump and process it
- ./script.sh
# Commit the changes
- ./commit.sh
# Execute script.sh
- sh utils/script.sh
# Push the commit
- cd ../
- git push
only:

View File

@ -16,16 +16,6 @@ Mirrors:
- https://glcdn.githack.com/curben/urlhaus-filter/raw/master/urlhaus-filter.txt
- https://cdn.staticaly.com/gl/curben/urlhaus-filter/raw/master/urlhaus-filter.txt
## Description
Following URL categories are removed from the database dump:
- Offline URLs
- Well-known domains from the [Umbrella Popularity List](https://s3-us-west-1.amazonaws.com/umbrella-static/index.html).
- False positives ([exclude.txt](src/exclude.txt))
Database dump is saved as [URLhaus.csv](src/URLhaus.csv), get processed by [script.sh](utils/script.sh) and output as [urlhaus-filter.txt](urlhaus-filter.txt).
## Compatibility
This filter is only tested with uBO. [FilterLists](https://filterlists.com/) shows it is compatible with the following software:
@ -40,11 +30,13 @@ This filter is only tested with uBO. [FilterLists](https://filterlists.com/) sho
- [Samsung Knox](https://www.samsungknox.com/)
- [uMatrix](https://github.com/gorhill/uMatrix)
Note that some of the software above are host-based only, meaning it cannot block malware URLs hosted by well-known domains (e.g. amazonaws.com, docs.google.com, dropbox.com). For best compatibility, use uBO or its fork NanoAdblocker.
## Issues
Report any false positive by creating an [issue](https://gitlab.com/curben/urlhaus-filter/issues).
Report any false positive by creating an [issue](https://gitlab.com/curben/urlhaus-filter/issues) or [merge request](https://gitlab.com/curben/urlhaus-filter/merge_requests)
This filter **only** accepts malware URLs from the [URLhaus](https://urlhaus.abuse.ch/).
This filter **only** accepts malware URLs from [URLhaus](https://urlhaus.abuse.ch/).
Please report new malware URL to the upstream maintainer through https://urlhaus.abuse.ch/api/#submit.
@ -54,7 +46,7 @@ This repo is not endorsed by Abuse.sh.
Since the filter is updated frequently, cloning the repo would become slower over time as the revision grows.
Use shallow clone to get the recent revisions only. Getting the last five revisions is sufficient for a valid MR.
Use shallow clone to get the recent revisions only. Getting the last five revisions should be sufficient for a valid MR.
`git clone --depth 5 https://gitlab.com/curben/urlhaus-filter.git`

5
utils/commit.sh Executable file → Normal file
View File

@ -1,6 +1,9 @@
#!/bin/sh
# Commit the filter update
## Commit the filter update
## GitLab CI does not permit shell variable in .gitlab-ci.yml.
## This file is a workaround for that.
CURRENT_TIME="$(date -R -u)"
git commit -a -m "Filter updated: $CURRENT_TIME"

22
utils/malware-domains.sh Normal file
View File

@ -0,0 +1,22 @@
#!/bin/sh
## Parse domains from URLhaus excluding popular domains
cat URLhaus.csv | \
# Convert DOS to Unix line ending
dos2unix | \
# Parse online URLs only
grep '"online"' | \
# Parse domains and IP address only
cut -f 6 -d '"' | \
cut -f 3 -d '/' | \
cut -f 1 -d ':' | \
# Remove www
# Only matches domains that start with www
# Not examplewww.com
sed -e 's/^www\.//g' | \
# Sort and remove duplicates
sort -u | \
# Exclude Umbrella Top 1M and well-known domains
# grep inverse match whole line
grep -Fx -vf top-1m-well-known.txt > malware-domains.txt

View File

@ -0,0 +1,21 @@
#!/bin/sh
## Parse malware URLs from popular URLhaus domains
cat URLhaus.csv | \
# Convert DOS to Unix line ending
dos2unix | \
# Parse online URLs only
grep '"online"' | \
# Parse URLs
cut -f 6 -d '"' | \
cut -f 3- -d '/' | \
cut -f 1- -d ':' | \
# Remove www
# Only matches domains that start with www
# Not examplewww.com
sed -e 's/^www\.//g' | \
# Sort and remove duplicates
sort -u | \
# Include URLs from popular domains
grep -F -f urlhaus-top-domains.txt > malware-url-top-domains.txt

10
utils/prerequisites.sh Normal file
View File

@ -0,0 +1,10 @@
#!/bin/sh
# Download URLhaus database
wget https://urlhaus.abuse.ch/downloads/csv/ -O ../src/URLhaus.csv
# Download Cisco Umbrella 1 Million
wget https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip -O top-1m.csv.zip
cp ../src/URLhaus.csv .
cp ../src/exclude.txt .

47
utils/script.sh Executable file → Normal file
View File

@ -1,40 +1,15 @@
#!/bin/sh
# Download the URLhaus database dump and process it to be uBO-compatible
mkdir tmp/
cd tmp/
CURRENT_TIME="$(date -R -u)"
FIRST_LINE="! Title: abuse.ch URLhaus Malicious URL Blocklist"
SECOND_LINE="! Updated: $CURRENT_TIME"
THIRD_LINE="! Expires: 1 day (update frequency)"
FOURTH_LINE="! Repo: https://gitlab.com/curben/urlhaus-filter"
FIFTH_LINE="! License: https://creativecommons.org/publicdomain/zero/1.0/"
SIXTH_LINE="! Source: https://urlhaus.abuse.ch/api/"
COMMENT="$FIRST_LINE\n$SECOND_LINE\n$THIRD_LINE\n$FOURTH_LINE\n$FIFTH_LINE\n$SIXTH_LINE"
sh ../utils/prerequisites.sh
sh ../utils/umbrella-top-1m.sh
sh ../utils/malware-domains.sh
sh ../utils/urlhaus-top-domains.sh
sh ../utils/malware-url-top-domains.sh
sh ../utils/urlhaus-filter.sh
sh ../utils/commit.sh
# Download the database dump
wget https://urlhaus.abuse.ch/downloads/csv/ -O ../src/URLhaus.csv
cat ../src/URLhaus.csv | \
# Convert DOS to Unix line ending
dos2unix | \
# Parse online URLs only
grep '"online"' | \
# Parse domains and IP address only
cut -f 6 -d '"' | \
cut -f 3 -d '/' | \
cut -f 1 -d ':' | \
# Remove www
# Only matches domains that start with www
# Not examplewww.com
sed -e 's/^www\.//g' | \
# Sort and remove duplicates
sort -u | \
# Exclude Umbrella Top 1M. grep inverse match whole line
grep -Fx -vf ../src/top-1m.txt | \
# Exclude false positive
grep -Fx -vf ../src/exclude.txt | \
# Append header comment to the filter list
sed '1 i\'"$COMMENT"'' > ../urlhaus-filter.txt
# Remove downloaded dataset
rm ../src/top-1m.txt
cd ../
rm -r tmp/

18
utils/umbrella-top-1m.sh Executable file → Normal file
View File

@ -1,11 +1,8 @@
#!/bin/sh
# Download the Cisco Umbrella 1 Million
# More info:
# https://s3-us-west-1.amazonaws.com/umbrella-static/index.html
# Download the list
wget https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip -O top-1m.csv.zip
## Parse the Cisco Umbrella 1 Million
## More info:
## https://s3-us-west-1.amazonaws.com/umbrella-static/index.html
# Decompress the zip and write output to stdout
unzip -p top-1m.csv.zip | \
@ -13,12 +10,15 @@ unzip -p top-1m.csv.zip | \
dos2unix | \
# Parse domains only
cut -f 2 -d ',' | \
# Domain must have at least a 'dot'
grep -F '.' | \
# Remove www
# Only matches domains that start with www
# Not examplewww.com
sed -e 's/^www\.//g' | \
# Remove duplicates
sort -u > ../src/top-1m.txt
sort -u > top-1m.txt
# Remove downloaded zip file
rm top-1m.csv.zip
# Merge Umbrella and self-maintained top domains
cat top-1m.txt exclude.txt | \
sort -u > top-1m-well-known.txt

20
utils/urlhaus-filter.sh Normal file
View File

@ -0,0 +1,20 @@
#!/bin/sh
## Merge malware-domains.txt malware-url-top-domains.txt,
## and append a header to instruct uBO to grab the filter daily.
CURRENT_TIME="$(date -R -u)"
FIRST_LINE="! Title: abuse.ch URLhaus Malicious URL Blocklist"
SECOND_LINE="! Updated: $CURRENT_TIME"
THIRD_LINE="! Expires: 1 day (update frequency)"
FOURTH_LINE="! Repo: https://gitlab.com/curben/urlhaus-filter"
FIFTH_LINE="! License: https://creativecommons.org/publicdomain/zero/1.0/"
SIXTH_LINE="! Source: https://urlhaus.abuse.ch/api/"
COMMENT="$FIRST_LINE\n$SECOND_LINE\n$THIRD_LINE\n$FOURTH_LINE\n$FIFTH_LINE\n$SIXTH_LINE"
cat malware-domains.txt malware-url-top-domains.txt | \
# Sort alphabetically
sort | \
# Append header comment to the filter list
sed '1 i\'"$COMMENT"'' > ../urlhaus-filter.txt

View File

@ -0,0 +1,22 @@
#!/bin/sh
## Parse popular domains from URLhaus
cat URLhaus.csv | \
# Convert DOS to Unix line ending
dos2unix | \
# Parse online URLs only
grep '"online"' | \
# Parse domains and IP address only
cut -f 6 -d '"' | \
cut -f 3 -d '/' | \
cut -f 1 -d ':' | \
# Remove www
# Only matches domains that start with www
# Not examplewww.com
sed -e 's/^www\.//g' | \
# Sort and remove duplicates
sort -u | \
# Exclude Umbrella Top 1M and well-known domains
# grep inverse match whole line
grep -Fx -f top-1m-well-known.txt > urlhaus-top-domains.txt