Note!!! : This codebase is extremely outdated & old. I created this back in 2016 so it might happen that it might not execute. Also the sources from which I extracted the data from might not exist anymore. You can use the idea or template to build your own implementation & extract data from a different souce.
Malicious Web sites are a cornerstone of Internet criminal activities. These Web sites contain various unwanted content such as spam-advertised products, phishing sites, dangerous "drive-by" harness that infect a visitor's system with malware. The most influential approaches to the malicious URL problem are manually constructed lists in which all malicious web page`s URLs are listed, as well as users systems that analyze the content or behavior of a Web site as it is visited.
The disadvantage of Blacklisting approach is that we have to do the tedious task of searching the list for presence of the entry. And the list can be very large considering the amount of web sites on the Internet. Also the list cannot be kept upto date because of the evergrowing growth of web link each and every hour.
In the given System we are using Machine-Learning techniques to classify a URL as either Safe or Unsafe in Real Time without even the need to download the webpage.
The three main Algorithms we are using in this system are :
The system is presently working only on Lexical features(Simple text features of a URL) which includes:
- Length of URL
- Domain Length
- Presence of Ip Address in Host Name
- Presence of Security Sensitive Words in URL
and many more(around 22 total). The Host Based Features like country code in which site is hosted, creation date, updation date etc. are still yet to be added to the system and increase accuracy of the classifier but increase the Latency time in classifying the URL as we have to query WHOIS servers in order to come up with the Host Based Features. For this query purpose the PyWhois module has been used.
For this given system we are using two sources to collect our data,namely:
We are using the Dmoz Open Directory to collect URLs of Benign Websites of different types.
For the malicious URLs we are collecting data from Phishtank.
This python script will extract the list of URLs from a given page of DMOZ Open Directory relating to a given category. Enter the URL of DMOZ's web page and it will extract the enlisted links and write them to respective csv file.
This python script iteratively extracts the list of phishing urls from Phistank.com and write those links to the respective csv file.
This file reads a certain amount of data from malicious dataset file and certain from benign dataset file and uses random shuffling to create training dataset file.
This file contains the list of Benign( i.e. Non-Malicious URLs) in a comma separated file along with Label 0 specifying them as Non-Spam. This data is collected from DMOZ open Directory.
This file contains the list of Malicious URLs in a comma separated file along with Label 1 specifying them as Spam. This data is collected from Phishtank.com .
File constructed after random shuffling of URLs from both Malicious and Benign URLs.
Binary File containing the feature values computed on training dataset URLs
Python script to generate the following figure/plots of the training dataset to gain insight of type of features we can exploit to get better results from our algorithm
The image shows the URL length Distributions of both Malicious as well as Benign URLs.

The image shows the Number of Dots Distributions of both Malicious as well as Benign URLs.
The image shows the scatter plot of Total Dots vs Total Delimeters in File name in a given URL.
The image shows the Domain length Distributions of both Malicious as well as Benign URLs.
The image shows the Creation Age Distributions of both Malicious as well as Benign URLs.
Python script to extract features values from a given URL and return it as a list.
Python script to produce training dataset after doing feature extraction and storing it in a binary file named Training_Data.pkl as defined above.
Python script which take as input a url and then classify it where Safe or Unsafe after training the algorithm on the training dataset values.
Python script which take as input a url and then classify it where Safe or Unsafe after training the algorithm on the training dataset values using Neural Network Classifier with One Hidden Layer and one Output Unit.



