feat: add OpenPhish data

- https://openphish.com/
This commit is contained in:
MDLeom 2020-07-12 08:16:27 +01:00
parent ea057e5b9e
commit e38bc68ada
No known key found for this signature in database
GPG Key ID: 5D9DB57A25D34EE3
2 changed files with 35 additions and 24 deletions

View File

@ -1,6 +1,6 @@
# Phishing URL Blocklist
A blocklist of phishing websites, based on the [PhishTank](https://www.phishtank.com/) list. Blocklist is updated twice a day.
A blocklist of phishing websites, based on the [PhishTank](https://www.phishtank.com/) and [OpenPhish](https://openphish.com/) lists. Blocklist is updated twice a day.
There are multiple formats available, refer to the appropriate section according to the program used:
@ -195,9 +195,9 @@ This blocklist operates by blocking the **whole** website, instead of specific w
If you wish to exclude certain website(s) that you believe is sufficiently well-known, please create an [issue](https://gitlab.com/curben/phishing-filter/issues) or [merge request](https://gitlab.com/curben/phishing-filter/merge_requests).
This blocklist **only** accepts new phishing URLs from [PhishTank](https://www.phishtank.com/).
This blocklist **only** accepts new phishing URLs from [PhishTank](https://www.phishtank.com/) and [OpenPhish](https://openphish.com/).
Please report new phishing URL to the upstream maintainer through https://www.phishtank.com/add_web_phish.php.
Please report new phishing URL to [PhishTank](https://www.phishtank.com/add_web_phish.php) or [OpenPhish](https://openphish.com/faq.html).
## Cloning
@ -211,14 +211,16 @@ Use shallow clone to get the recent revisions only. Getting the last five revisi
[Creative Commons Zero v1.0 Universal](LICENSE.md)
[csvquote](https://github.com/dbro/csvquote): [MIT License](https://choosealicense.com/licenses/mit/)
[PhishTank](https://www.phishtank.com/): [CC BY-SA 2.5](https://creativecommons.org/licenses/by-sa/2.5/)
_PhishTank is either trademark or registered trademark of OpenDNS, LLC._
[OpenPhish](https://openphish.com/): Available free of charge by OpenPhish
[Tranco List](https://tranco-list.eu/): MIT License
[Umbrella Popularity List](https://s3-us-west-1.amazonaws.com/umbrella-static/index.html): Available free of charge by Cisco Umbrella
[PhishTank](https://www.phishtank.com/): [CC BY-SA 2.5](https://creativecommons.org/licenses/by-sa/2.5/)
[csvquote](https://github.com/dbro/csvquote): [MIT License](https://choosealicense.com/licenses/mit/)
PhishTank is either trademark or registered trademark of OpenDNS, LLC.
This repository is not endorsed by PhishTank/OpenDNS.
This repository is not endorsed by PhishTank/OpenDNS and OpenPhish.

View File

@ -18,9 +18,9 @@ fi
mkdir -p "tmp/"
cd "tmp/"
## Prepare datasets
curl -L "https://data.phishtank.com/data/$PHISHTANK_API/online-valid.csv.bz2" -o "phishtank.bz2"
curl -L "https://openphish.com/feed.txt" -o "openphish-raw.txt"
curl -L "https://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip" -o "top-1m-umbrella.zip"
curl -L "https://tranco-list.eu/top-1m.csv.zip" -o "top-1m-tranco.zip"
@ -37,14 +37,23 @@ sed 's/"//g' | \
cut -f 3- -d "/" | \
# Domain must have at least a 'dot'
grep -F "." | \
sed "s/^www\.//g" | \
sort -u > "phishtank.txt"
sed "s/^www\.//g" > "phishtank.txt"
cat "openphish-raw.txt" | \
dos2unix | \
cut -f 3- -d "/" | \
grep -F "." | \
sed "s/^www\.//g" > "openphish.txt"
## Combine PhishTank and OpenPhish
cat "phishtank.txt" "openphish.txt" | \
sort -u > "phishing.txt"
## Parse domain and IP address only
cat "phishtank.txt" | \
cat "phishing.txt" | \
cut -f 1 -d "/" | \
cut -f 1 -d ":" | \
sort -u > "phishtank-domains.txt"
sort -u > "phishing-domains.txt"
cp "../src/exclude.txt" "."
@ -74,19 +83,19 @@ cat "top-1m-umbrella.txt" "top-1m-tranco.txt" "exclude.txt" | \
sort -u > "top-1m-well-known.txt"
## Parse popular domains from PhishTank
cat "phishtank-domains.txt" | \
## Parse popular domains
cat "phishing-domains.txt" | \
# grep match whole line
grep -Fx -f "top-1m-well-known.txt" > "phishtank-top-domains.txt"
grep -Fx -f "top-1m-well-known.txt" > "phishing-top-domains.txt"
## Parse domains from PhishTank excluding popular domains
cat "phishtank-domains.txt" | \
grep -F -vf "phishtank-top-domains.txt" > "phishing-domains.txt"
## Exclude popular domains
cat "phishing-domains.txt" | \
grep -F -vf "phishing-top-domains.txt" > "phishing-notop-domains.txt"
## Parse phishing URLs from popular domains
cat "phishtank.txt" | \
grep -F -f "phishtank-top-domains.txt" | \
cat "phishing.txt" | \
grep -F -f "phishing-top-domains.txt" | \
sed "s/^/||/g" | \
sed "s/$/\$all/g" > "phishing-url-top-domains.txt"
@ -103,7 +112,7 @@ COMMENT_UBO="$FIRST_LINE\n$SECOND_LINE\n$THIRD_LINE\n$FOURTH_LINE\n$FIFTH_LINE\n
# Compatibility with Adguard Home
# https://gitlab.com/curben/urlhaus-filter/-/issues/19
cat "phishing-domains.txt" | \
cat "phishing-notop-domains.txt" | \
sed "s/^/||/g" | \
sed "s/$/^/g" > "phishing-domains-adguard.txt"
@ -116,7 +125,7 @@ sed '1 i\'"$COMMENT_UBO"'' > "../dist/phishing-filter.txt"
# awk + head is a workaround for sed prepend
COMMENT=$(printf "$COMMENT_UBO" | sed "s/^!/#/g" | sed "1s/URL/Domains/" | awk '{printf "%s\\n", $0}' | head -c -2)
cat "phishing-domains.txt" | \
cat "phishing-notop-domains.txt" | \
sort | \
sed '1 i\'"$COMMENT"'' > "../dist/phishing-filter-domains.txt"
@ -161,7 +170,7 @@ sed "1s/Blocklist/Unbound Blocklist/" > "../dist/phishing-filter-unbound.conf"
## Clean up artifacts
rm "phishtank.csv" "top-1m-umbrella.zip" "top-1m-umbrella.txt" "top-1m-tranco.txt"
rm "phishtank.csv" "top-1m-umbrella.zip" "top-1m-umbrella.txt" "top-1m-tranco.txt" "openphish-raw.txt"
cd ../