Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (1 vote)
932 views49 pages

Proj Report PDF

This document presents a project that aims to develop a model to predict melanoma skin cancer. Melanoma is a dangerous type of skin cancer that develops from pigment-containing cells. The project will build a classifier using supervised machine learning algorithms like Random Forest to classify skin lesions as malignant or benign based on a dataset. The performance of the algorithms will be evaluated based on predictive accuracy. The project discusses the existing problems in melanoma detection, the proposed system to develop a classification model, and the system requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
932 views49 pages

Proj Report PDF

This document presents a project that aims to develop a model to predict melanoma skin cancer. Melanoma is a dangerous type of skin cancer that develops from pigment-containing cells. The project will build a classifier using supervised machine learning algorithms like Random Forest to classify skin lesions as malignant or benign based on a dataset. The performance of the algorithms will be evaluated based on predictive accuracy. The project discusses the existing problems in melanoma detection, the proposed system to develop a classification model, and the system requirements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

MELANOMA SKIN LESION CLASSIFICATION

Industry oriented mini project(CS705PC)


Submitted
in partial fulfilment of the requirements for
the award of the degree of
Bachelor of Technology
In
Computer Science and Engineering
by
Ms. Pogula RajyaLakshmi Sobha Pavani
H. T. No: 16261A0542
Under the Guidance of
Ms.D.S.Bhavani
(Assistant Professor)

Department of Computer Science and Engineering


MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Affiliated to JNTU,Hyderabad; Accredited by NBA, AICTE-New Delhi)
Kokapet(vill), Rajendra Nagar (Mandal), Ranga Reddy(Dist.),
Chaitanya Bharathi P.O., Hyderabad-500 075, India.

NOVEMBER 2019
MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Established in 1997 by Chaitanya Bharathi Educational Society)
(Affiliated to JNTU, Hyderabad; Accredited by NBA, AICTE-New Delhi)
Kokapet (village & gram panchayat), Rajendra Nagar (Mandal), Ranga Reddy(Dist.),
Chaitanya Bharathi P.O., Hyderabad-500 075.

CERTIFICATE
This is to certify that the industry oriented mini project(CS705PC) entitled “Melanoma
Skin Lesion Classification”, being submitted by Ms. Pogula RajyaLakshmi Sobha Pavani
bearing Roll No: 16261A0542 in partial fulfilment for the award of B. Tech in Computer
Science and Engineering to Mahatma Gandhi Institute of Technology, Gandipet, Hyderabad is
a record of bonafide work carried out by her under our guidance and supervision.

The results embodied in this project have not been submitted to any other University or
Institute for the award of any degree or diploma.

Project Guide Head of the Department

Ms.D.S.Bhavani Dr.C.R.K.Reddy

Assistant Professor Professor

Dept. of CSE Dept. of CSE

MGIT, Hyderabad MGIT, Hyderabad

EXTERNAL EXAMINER

i
DECLARATION

I hereby declare that the work presented in this project, titled “MELANOMA SKIN LESION
CLASSIFICATION”, which is being submitted by me in partial fulfilment for the award of
B. Tech in the Department of Computer Science and Engineering to Mahatma Gandhi Institute
of Technology, Gandipet, Hyderabad (T.S) - 500075, is a result of investigations carried out
by me under the guidance of Ms.D.S.Bhavani, Assistant Professor of Computer Science and
Engineering, MGIT, Gandipet, Hyderabad.

The work is original and has not been submitted for any Degree/Diploma of this or any other
university.

Place: Hyderabad Name of the Candidate: P.RajyaLakshmi Sobha Pavani

Date: 08 – 11 – 2019 Roll No: 16261A0542

ii
ACKNOWLEDGEMENT

I would like to express my sincere thanks Dr. K. Jaya Sankar, Principal, MGIT, for
providing the working facilities in college.

I wish to express my sincere thanks and gratitude to Dr. C. R. K. Reddy, Professor and HOD,
Department of CSE, MGIT, for all the timely support and valuable suggestions during the
period of project.

I am extremely thankful to Dr. C. R. K. Reddy, Professor, and Ms. M. Sreevani, Associate


Professor, Department of CSE, MGIT, mini project coordinators for their encouragement and
support throughout the project.

I am extremely thankful and indebted to my internal guide Ms. D. S. Bhavani, Assistant


Professor, Department of CSE, for her constant guidance, encouragement and moral support
throughout the project.

Finally, I would also like to thank all the faculty and staff of CSE Department who helped me
directly or indirectly, for completing this project.

P.RajyaLakshmi Sobha Pavani

H.T.No: 16261A0542

iii
TABLE OF CONTENTS
Certificate i

Declaration ii

Acknowledgement iii

List of Figures vi

List of Tables vii

Abstract viii

1. Introduction 1
1.1 Problem Definition 1
1.2 Existing System 2
1.3 Proposed System 2
1.4 System Requirements 3
1.4.1 Hardware Requirements 3
1.4.2 Software Requirements 3

2. Literature Survey 4

3. Design Methodology 9
3.1 System Architecture 9
3.2 Modules 9
3.2.1 Data Acquisition 10
3.2.2 Pre-process Data 10
3.2.3 Data Analysis Module 11
3.2.4 Performance Analysis Module 14
3.2.5 Data Visualizations 15
3.3 UML Diagrams 16
3.3.1 Use Case Diagram 17
3.3.2 Sequence Diagram 18
3.3.3 Activity Diagram 19

iv
4. Testing and Results 22
4.1 Testing 22
4.2 Model Accuracy 25
4.3 Test Cases and Results 26

5. Conclusion and Future Scope 30


5.1 Conclusion 30
5.2 Future Scope 30

Bibliography 31

Appendix 32

v
LIST OF FIGURES

Fig.3.1 System architecture for classifying Melanoma 9


Fig.3.2 Sample of Dataset 10
Fig.3.3 Working of Random Forest Algorithm 11
Fig.3.4 Random forest Algorithm 12
Fig.3.5 First 5 fields in Dataset 15
Fig.3.6 Plot on the localisation of disease 15
Fig.3.7 Sample images of disease 16
Fig.3.8 Sample images of Melanoma 16
Fig.3.9 Use Case diagram of Developer 17
Fig.3.10 Sequence Diagram for Melanoma classification 18
Fig.3.11 Activity diagram for melanoma classification 20
Fig.4.1 Black Box Testing 23
Fig.4.2 Black Box Testing for Machine Learning Algorithms 24
Fig.4.3 Specifications of Random forest classifier 25
Fig.4.4 Accuracy of Random forest classifier 25
Fig.4.5 Confusion Matrix 26
Fig.4.6 Accuracy using confusion matrix 26
Fig.4.7 Actinic keratoses and intraepithelial carcinoma 27
Fig.4.8 Melanoma 28
Fig.4.9 Benign Keratosis 28
Fig.4.10 Melanocytic Nevus 29

vi
LIST OF TABLES

Table.2.1 Literature Survey 7


Table.4.1 Black Box testing 24
Table.4.2 Positive Test Cases 26
Table.4.3 Negative Test Cases 27

vii
ABSTRACT

Machine Learning has been successfully implemented in the real world problems, its use in
real world problems is relatively new, i.e., its use is intended for identification and extraction
of new and potentially valuable knowledge from the data. Disease identification and diagnosis
of ailments is at the forefront of ML research in medicine.

The aim of this project is to develop a model which can predict melanoma cancer. Melanoma
cancer is a type of skin cancer which is not very common but can be fatal, if not detected in
early stages. In this project a classifier is build which could classify whether the given skin
lesion is malignant or benign by supervised algorithms(such as Random forest) and the
performance of learning tools were evaluated based on their predictive accuracy.

Index terms: Skin Cancer, Melanoma, Random forest, Benign

viii
1.INTRODUCTION

Melanoma is the most dangerous type of skin cancer. Melanoma also known as malignant
melanoma is a type of cancer that developed from the pigment containing cells known as
melanocytes. Melanoma can occur on any skin surface. In men, it’s often found on the skin on
the head, on the neck, or between the shoulders and the hips. In women, it’s often found on the
skin on the lower legs or between the shoulders and the hips. Melanoma is rare in people with
dark skin. When it does develop in people with dark skin, it’s usually found under the
fingernails, under the toenails, on the palms of the hands, or on the soles of the feet. Ultraviolet
(UV) rays are clearly a major cause of UV rays can damage the DNA in skin cells. Sometimes
this damage affects certain genes that control how skin cells grow and divide. If these genes no
longer work properly, the affected cells may form a cancer. Sunlight is the main source of UV
rays. The nature of the UV exposure may play a vital role in melanoma development.

Skin cancer is the most commonly diagnosed cancer in the United States, yet most cases are
preventable. Every year in the United States, nearly 5 million people are treated for skin cancer,
at an estimated cost of $8.1 billion. Melanoma, the deadliest form of skin cancer, causes more
than 9,000 deaths each year. Unlike many other cancers, skin cancer rates have continued to
rise in recent years. Here are the American Cancer Society’s estimates for melanoma in the
United States for 2015:

1. About 73,870 new melanomas will be diagnosed (about 42,670 in men and 31,200
in women).
2. About 9,940 people are expected to die of melanoma (about 6,640 men and 3,300
women).

Although survival rate is increasing, death rate from malignant melanoma is exponentially
increasing. Early diagnosis is crucial for the treatment, because malignant melanoma is very
invasive when it affects melanocyte.

1.1 Problem Statement


There is a need for automation in recognition of human skin cancer systems so that the improper
diagnosis and treatment can be minimized. Therefore, this work will initiate a model for human
skin cancer recognition which is consistent, efficient and cost effective by exploring machine

1
learning techniques. The ultimate goal is to ease the doctor's role in the recognition of skin
cancer by providing better and more reliable results, so that more patients can be diagnosed.

1.2 Existing System


Several online Melanoma risk calculators have been developed as web-based versions of the
tools described in published studies, including those from the US National Cancer Institute
(NCI) and the Victorian Melanoma Service . Others have been developed using established
methods for risk prediction and have published their underlying effect estimates online while
the provenance of other online tools is unknown. All the systems developed where used to find
the risk of melanoma in the upcoming years by using previous years data. The systems were
trained with the patient data who was found to be diagnosed by the melanoma cancer in that
particular area. The systems trained are used to predict the probability of occurrence of the
melanoma in the upcoming 5 years only for the patients of same age, name, gender, symptoms.
It is useful to predict the risk only at the final stage of melanoma.

Disadvantages

1. For new data the predictive values are not accurate.


2. Can be used for prediction of melanoma risk only for a population of particular locality.
3. Cannot predict melanoma at early stages.

1.3 Proposed System

In this project the system is trained by collecting the images data of melanoma at different
stages .The trained system predicts the melanoma cancer with good accuracy. It helps in early
diagnosis of the disease and classifies the given images into malignant or benign.

Advantages

1. Can be used for predicting melanoma at any stage.


2. Prediction is accurate.

2
1.4 System Requirements

1.4.1 Hardware Requirements

1. System: Pentium Dual core


2. Hard Disk:120GB
3. RAM:1GB
4. Input Devices: Keyboard, Mouse

1.4.2 Software Requirements

1. Operating System : Windows 7 or 10


2. Tool: Anaconda(Jupyter)
3. Software: Python 3.5

3
2.LITERATURE SURVEY

F. Nachbar, W. Stolz, T. Merkle, A. B Cognetta, T. Vogt, M.Landthaler, P. Bilek, O. B.-


Falco, and G. Plewig, “The abcd rule of dermatoscopy: high prospective value in the
diagnosis of doubtful melanocytic skin lesions”(1994)

In this 194 skin lesions were collected and evaluated according to ABCD rule, A-Asymmetry,
B-abrupt cut-off of the pigment at border, C-different colors, D-Structural Components were
assessed to yield a semiquantitative score. This method is very expensive. This method is used
for selected lesion samples.

E.Zagrouba and W.Barhoumi, “A preliminary approach for the automated recognition


of malignant melanoma”(2004)

In this they proposed an efficient system for prescreening of pigmented skin lesions for
malignancy using general-purpose digital cameras. These images can be captured by a
smartphone or a digital camera. This could be beneficial in different applications, such as
computer aided diagnosis and telemedicine applications. It could assist dermatologists, or
smartphone users, evaluate risk of suspicious moles. The proposed method enhances borders
and extracts a broad set of dermatologically important features. These discriminative features
allow classification of lesions into two groups of melanoma and benign. The algorithm used to
classify the images is ANN(Artificial Neural Networks).

M.Silveira, “Comparison of segmentation methods for melanoma diagnosis in


dermoscopy images”(2009)

In this paper, they proposed and evaluated six methods for the segmentation of skin lesions in
dermoscopic images. This set includes some state of the art techniques which have been suc-
cessfully used in many medical imaging problems (gradient vector flow (GVF) and the level
set method of Chan et al. [(C-LS)]. It also includes a set of methods developed by the authors
which were tailored to this particular application (adaptive thresholding (AT), adaptive snake
(AS), EM level set [(EM-LS), and fuzzy-based split- and-merge algorithm (FBSM)]. The
segmentation methods were applied to 100 dermoscopic images and evaluated with four dif-
ferent metrics, using the segmentation result obtained by an experienced dermatologist as the
ground truth. The best results were obtained by the AS and EM-LS methods.
4
T. Wadhawan, N. Situ, K. Lancaster, X. Yuan, and G. Zouridakis, “SkinScan: A portable
library for melanoma detection on handheld Devices”(2011)

In this study we explored the feasibility of running a sophisticated application for automated
skin cancer detection on an Apple iPhone 4. Our results demonstrate that the proposed library
with the advanced image processing and analysis algorithms has excellent performance on
handheld and desktop computers. Therefore, deployment of smartphones as screening devices
for skin cancer and other skin diseases can have a significant impact on health care delivery in
underserved and remote areas. This method consists of image segmentation, feature extraction
and image classification.

P. G. Cavalcant, J. Scharcanski and G.V. Baranoski, “A two-stage approach for


discriminating melanocytic skin lesions using standard cameras”(2013)

In this first the images collected are pre-processed and evaluated using ABCD rule. In addition
to this a classification algorithm (KNN) is used to classify benign or malignant. In the second
stage the classification sensitivity is enhanced. All the images that have been classified as
benign are re-classified as benign using melanin variation features extracted from inside lesion.
This method is not suitable for early stage detection.

I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman and N.Petkov, “MED-NODE: A


computer-assisted melanoma diagnosis system using non-dermoscopic images”(2015)

They proposed a decision support system called as MED-NODE, which makes use of non-
dermatoscopic digital images from which it extracts lesion regions based on color and texture.
In this the classification is based of majority voting. The prediction of the image is difficult if
it contains Noise.

Arati P. Chavan D. K. Kamat Dr. P. M. Patil ,“Classification Of Skin Cancers Using


Image Processing”(2016)
In this paper, they presented a computer aided method for the detection of Melanoma Skin
Cancer using Image Processing tools. The input to the system is the skin lesion image and then
5
by applying novel image processing techniques, it analyses it to conclude about the presence
of skin cancer. The Lesion Image analysis tools checks for the various Melanoma parameters
Like Asymmetry, Border, Colour, Diameter,(ABCD) etc. by texture, size and shape analysis
for image segmentation and feature stages. The extracted feature parameters are used to classify
the image as Normal skin and Melanoma cancer lesion.

S.Amutha, Mrs.R.Manjula Devi , “Early detection of malignant melanoma using


random forest algorithm”(2016)

This paper introduces the skin lesion analysis system for the early detection of melanoma. The
skin lesion analysis system is aims to focus the automated image analysis module which
contains the steps of image acquisition, hair detection & exclusion, lesion segmentation
process, feature extraction and classification. The random forest algorithm is proposed in lesion
classification process, which enhances the efficiency of the classification results. The random
forest package optionally produces information such as a measure of importance of the
predictor’s variables and a measure of the internal structure of the data. The experimental
results show that the proposed system is efficient, achieving classification of the benign,
atypical and melanoma images with the accuracy

Poornima MS, Dr.Shailaja K, “ Detection of Skin Cancer Using SVM”(2017)

In this paper, we present a method for the detection of Melanoma Skin Cancer using Image
processing tools. The input to the system is the skin lesion image and then by applying image
processing techniques, it analyses to conclude about the presence of skin cancer. The Lesion
Image analysis tools checks for the various Melanoma parameters, Color, Area perimeter,
diameter etc by texture, size and shape analysis for image segmentation and feature stages. The
extracted feature parameters are used to classify the image as Non Melanoma and Melanoma
cancer lesion using SVM(Support Vector Machines). The kernel function used is Radial Basis
Function (RBF).

6
Table.2.1:Literature Survey

S.No Title Authors Algorithms Advantages Disadvantages


1 “The abcd rule F. Nachbar, ABCD rule 1.The lesions 1.The method is
of W. Stolz, T. A-Asymmetry are classified expensive
dermatoscopy: Merkle, A. B B-abrupt cut- based on abcd 2.This method is
high Cognetta, T. off of pigment results. applicable only for
prospective Vogt, M. at border 2.The selected skin lesion
value in the Landthaler, C-color sensitivity samples
diagnosis of P. Bilek, O. D-structural accuracy of 3.The classification
doubtful B.-Falco, and components detection is done is not accurate
melanocytic G. Plewig [1] around 70%
skin lesions”
2 “A two-stage P. G. ABCD rule and 1.It could 1.This is not
approach for Cavalcant, J. KNN classify skin suitable for early
discriminating Scharcanski classification lesion to stage detection of
melanocytic and G.V. algorithm malignant and melanoma.
skin lesions Baranoski benign 2.This method
using standard [5] 2.Classification requires
cameras” sensitivity is normalisation of
enhanced pixels of image
which is cost
inefficient.
3 “MED-NODE: I. Giotis, N. Classifies 1.It uses 1.The final
A computer- Molders, S. lesion based on decision classification is
assisted Land, M. color and support system based on previous
melanoma Biehl, M. F. texture 2.The accuracy results.
diagnosis Jonkman and obtained is 81% 2.The prediction is
system using N. not accurate if the
non- Petkov [6] images contain
dermoscopic noise.
images”
4 “Worlds first I. Giotis, N. SVM 1.It also uses 1.This is also based
svm-based Molders, S. decision previous
image Land, M. support system predictions
analysis Biehl, M. F. 2.The app 2.The app is
iphone app for Jonkman and created has an designed only for
melanoma risk N. accuracy of iphone.
assessment, Petkov, [10] 80%
melapp”
5 “A preliminary E.Zagrouba Artificial 1.This is used 1.The prediction is
approach for and Neural to predict the not accurate for
the W.Barhoumi Networks risk of new data values.
automated [2] melanoma 2.The predictions
recognition of 2.The accuracy are useful for a
malignant of preliminary particular locality.
melanoma” training set is 3.Cannot predict at
79.1% early stages.

7
6 “SkinScan: A T. Wadhawan, N. Image Processing 1.Predicts 1.The embedded
portable library Situ, K. melanoma processors have
for melanoma Lancaster, X. 2.The limited
detection on Yuan, and G. maintenance speed,memory.
handheld Zouridakis [4] cost is low 2.Works only in
Devices” Apple iphone 4

7 “Comparison of M. Silveira [3] Segmentation 1.The 1.The prediction


segmentation Methods(Adaptive predicted was done based
methods for Thresholding,Gradient ratio was on the images of
melanoma Vector Flow) good a particular
diagnosis in hospital data.
dermoscopy 2.The error rate
images” is around 20%

8 “Classification Arati P. Chavan Image Processing and 1.Detects 1.Cannot detect


Of Skin Cancers D. K. Kamat Dr. ABCD rules different the disease at
Using Image P. M. Patil [7] types of skin early stage
Processing” diseases 2.The system is
not accurate

9 Detection of Poornima MS, SVM 1.It predicts 1.It cannot


Skin Cancer Dr.Shailaja K [9] the predict the
Using SVM melanoma melanoma at
using images early stage

10 Early detection S.Amutha , Random Forest 1.It detects 1.The system is


of malignant Mrs.R.Manjula Algorithm the risk of not efficient for
melanoma using Devi [8] melanoma at some images
random forest early stage 2.Cannot predict
algorithm melanoma
accurately at
early stage

8
3.DESIGN METHODOLOGY

3.1 System Architecture

The model usually takes input as images and features. Later it pre-processes the input data
to remove inconsistent values and noise data. The pre-processed data obtained is split into
test and train set. A random forest classifier is built and it is trained against the train dataset.
The trained model is tested against the test data set and the accuracy is predicted. When a
new image is given as the input to the trained model , it correctly classifies the given input
image either to melanoma or non-melanoma.

Input Image

Fig.3.1:System architecture for classifying melanoma

The above Fig.3.1 explains the architecture of the model that is build to classify skin lesions
To melanoma or non-melanoma.

3.2 Modules
The model implementation consists of three modules:

1. Data Acquisition
2. Pre-processing Data
3. Data Analysis Module
4. Performance Analysis Module
5. Data Visualization

9
3.2.1 Data Acquisition
Data acquisition is the process of importing raw data sets into your analytical platform. It can
be acquired from traditional databases (SQL and query browsers), remote data (web services),
text files (scripting languages), NoSQL storage (web services, programming interfaces), etc.
Data acquisition involves the identification of data sets, retrieval of data, query of data from
the dataset.

The melanoma dataset is extracted from “The HAM10000 dataset” hosted by Harvard
Dataverse. The dataset consists of dermatoscopic images collected from different populations,
acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic
images which can serve as a training set for academic machine learning purposes.

Fig.3.2:Sample of Dataset(metadata.csv)

Fig.3.2 represents the features present in the dataset. The features present are:

1. lesion_id: Specifies the sample no


2. image_id: It consists of the ISIC image number of a particular skin lesion
3. dx: It specifies the type of disease
4. dx_type: It specifies whether the image is confirmed through
histopathology(histo) or follow-up examination (follow_up), expert consensus
(consensus), or confirmation by in-vivo confocal microscopy (confocal)
5. age: It specifies the age of effected people
6. sex: Gender of the effected people is specified
7. localization: The locality of the disease is specified

3.2.2 Pre-Process Data


Pre-processing of data involves 2 criteria:

Cleaning Data: Data cleaning involves removal of inconsistent values, duplicate records,

missing values, invalid data and outliers.

10
Data Munging / Data Wrangling: Data Wrangling techniques involve scaling,
transformation, feature selection, dimensionality reduction and data manipulation. Scaling is
performed over the dataset to avoid having certain features with large values from dominating
the results. The transformation technique reduced the noise and variability present in the
dataset. Multiple features are handpicked for the removal of redundant/irrelevant features
present in the dataset. Dimensionality reduction helped in eliminating irrelevant features and
made analysis simpler.

3.2.3 Data Analysis Module

The data analysis module comprises of feature selection, model selection, creation of insights
and analysis of results. A Random Forest Algorithm has been utilized to analyse the dataset in
order to predict the melanoma.

Random forest is a type of supervised machine learning algorithm based on ensemble learning.
Ensemble learning is a type of learning where you join different types of algorithms or same
algorithm multiple times to form a more powerful prediction model. The random
forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees,
resulting in a forest of trees, hence the name "Random Forest". The random forest algorithm
can be used for both regression and classification tasks.

How does the algorithm work?

Fig.3.3: Working of Random Forest Algorithm

Fig.3.3 gives the overview of how the random forest works. It works in four steps:
Input: Dataset, a new test sample.

11
Algorithm:

1. Select random samples from a given dataset. In the below Fig.3.4, the dataset is
divided into 3 random samples.

2. Construct a decision tree for each sample and get a prediction result from each
decision tree.

3. Perform a vote for each predicted result. A new sample is passed to each decision tree
and the result is stored.

4. Select the prediction result with the most votes as the final prediction.

Output: The predicted class of the test sample.

Pseudocode for Random forest

Fig.3.4:Random Forest Algorithm

Fig.3.4 represents the splitting of dataset into random sets for finding the best root node for
splitting.

The pseudocode for random forest algorithm can split into two stages.

1. Random forest creation pseudocode.


12
2. Pseudocode to perform prediction from the created random forest classifier.

First, let’s begin with random forest creation pseudocode

1. Random Forest pseudocode:

1. Randomly select “k” features from total “m” features, where k << m.
2. Among the “k” features, calculate the node “d” using the best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
trees.

The beginning of random forest algorithm starts with randomly selecting “k” features out of
total “m” features. In the Fig.3.4, you can observe that we are randomly taking features and
observations.

In the next stage, we are using the randomly selected “k” features to find the root node by
using the best split approach.

The next stage, We will be calculating the daughter nodes using the same best split approach.
Finally, we repeat 1 to 4 stages to create “n” randomly created trees. This randomly created
trees forms the random forest.

2. Random forest prediction pseudocode:

To perform prediction using the trained random forest algorithm uses the below pseudocode.

1. Takes the test features and use the rules of each randomly created decision tree to
predict the outcome and stores the predicted outcome (target)
2. Calculate the votes for each predicted target.
3. Consider the high voted predicted target as the final prediction from the random forest
algorithm.

To perform the prediction using the trained random forest algorithm we need to pass the test
features through the rules of each randomly created trees. Suppose let’s say we formed 100
random decision trees to from the random forest.

13
Each random forest will predict different target (outcome) for the same test feature. Then by
considering each predicted target votes will be calculated. Suppose the 100 random decision
trees are prediction some 3 unique targets x, y, z then the votes of x is nothing but out of 100
random decision tree how many trees prediction is x.

Likewise for other 2 targets (y, z). If x is getting high votes. Let’s say out of 100 random
decision tree 60 trees are predicting the target will be x. Then the final random forest returns
the x as the predicted target.

This concept of voting is known as majority voting.

3.2.4 Performance Analysis Module

The performance is analysed by dividing the data into test and train sets. The model is trained
using train data .The trained model is tested against the test data and the obtained predictions
are evaluated based on the accuracy obtained. The accuracy is calculated using the confusion
matrix.

Confusion Matrix

A confusion matrix is a matrix (table) that can be used to measure the performance of an
machine learning algorithm, usually a supervised learning one. Each row of the confusion
matrix represents the instances of an actual class and each column represents the instances of
a predicted class. For example the algorithms should have predicted a sample as ci because the
actual class is ci, but the algorithm came out with cj. In this case of mislabelling the
element cm[i,j] ,it will be incremented by one, when the confusion matrix is constructed.

For a multi-class confusion matrix the accuracy is measured as the sum of the diagonal
elements divided by total no,of predictions performed.

𝑗−1 𝑖−1,𝑗−1
Accuracy=∑𝑖=0 𝑐𝑚[𝑖, 𝑖]/ ∑𝑘1=0,𝑘2=0 𝑐𝑚[𝑖, 𝑗] (3.1)

Where ,

∑cm[i,i] indicates diagonal elements of confusion matrix

∑cm[i,j] indicates all elements in the confusion matrix

14
3.2.5 Data Visualizations

The dataset after dropping the lesion_id column is as follows:

Fig.3.5:First 5 fields in Dataset

Fig.3.5 represents the first 5 rows of data in the dataset after dropping lesion_id. The
lesion_id is dropped as there are some duplicate values that are not useful for classification.

Count Plot

A count plot can be thought of as a histogram across a categorical, instead of quantitative,


variable. The basic API and options are identical to those for bar plot, so you can
compare counts across nested variables.

Fig.3.6:Plot on the localisation of disease

15
Fig.3.6 represents the plot of occurrence of disease in the particular location to the count of
people. In this, the X-axis represents count and Y-axis represents the location. From the plot
we can analyse that the majority of the occurrence of disease is observed on the back, while
the minimum is observed at genital.

Fig.3.7:Sample images of disease

Fig.3.7 represents the sample images of various types of skin diseases which may include
vascular, benign, basal cell carcinoma, etc.

Fig.3.8:Sample images of Melanoma

Fig.3.8 represents the sample images of Melanoma (malignant skin cancer).

3.3 UML DIAGRAMS

Unified Modelling Language is a standard language for writing software blueprints. It can be
used to visualize, specify, construct, and document the artifacts of a software-intensive system.
Modeling is a proven and well-accepted engineering technique.

16
3.3.1 Use Case Diagram

A use case diagram is a behavioral diagram that shows a set of use cases and actors and their
relationships. They address the static view of a system. It is a description of set of sequence of
actions that a system performs that yields an observable result of value to a actor.

.
Fig.3.9:Use-case diagram of Developer

The above Fig.3.9 illustrates the use case diagram of the developer.In this the developer collects
the datasets and performs pre-process operations. Later the developer splits the dataset into
train and testing sets. Developer trains the model using training dataset and estimates the
accuracy the model obtained. The model takes the input as image from user and predicts the
output.

3.3.2 Sequence Diagram

A Sequence Diagram emphasizes the time ordering of messages. A sequence diagram show a
set of objects and the messages sent and received by those objects. The objects are typically

17
named or anonymous instances of classes. Sequence diagrams are used to model the dynamic
aspects of a system.

Fig.3.10:Sequence Diagram for Melanoma classification

The above Fig.3.10 represents the sequence diagram for melanoma detection. The sequence
followed is as follows:

1. The kernel is initiated by starting the server.

2. An instruction is sent to the browser to open up a jupyter notebook.

3. The browser requests for libraries from file system via kernel (Ex.: Pandas, Numpy,etc.)

18
4.The Kernel interacts directly with the file system and requests for the required libraries.

5. The file system responds to Kernel’s request by fetching the libraries that were requested.

6. The Browser then requests for the dataset associated with the problem statement.

7. The Kernel receives the request from Browser and passes it on to the file system.

8. The file system satisfies the request by fetching the required dataset.

9. Cleaning steps are requested for by the browser over the data set.

10. Various cleaning operations are performed over the data frame to get higher accuracy.

11. The target data is visualized to understand it’s behaviour.

12. The kernel responds with the kind of visualizations that were requested for by the browser.

13. Data transformation techniques are performed to normalize the attribute values in the data set.

14. Different transformation techniques are applied over the data frame by the Kernel.

15. A model is trained over the data frame.

16. A tuned model is fit over the data set to provide a highly accurate model.

17. The associated predictions over the test data are visualized to analyze it’s accuracy in
comparison to the actual melanoma images.

18. Requested visualizations are displayed over the browser to better understand the accuracy of
the trained model.

3.3.3 Activity Diagram

Activity diagram is defined as a UML diagram that focuses on the execution and flow of the
behavior of a system instead of implementation. It is also called object-oriented flowchart.
Activity diagrams consist of activities that are made up of actions which apply to behavioral
modeling technology.

19
Dataset
Dataset
retrieved
retrieved

Data is
preprocessed
ed

Resizing data

Divide data into training and


testing set

Train model
with data

Test model
against test data

Accuracy
estimation

Predicted
output

Fig.3.11:Activity diagram for melanoma classification

The above Fig.3.11 represents the activity diagram for melanoma classification. The
execution flow is determined using activity diagram as

1) The data set is received.

2) Then the data is pre processed.

3) Then the data is scaled.

20
4) Data is divided in to training and testing sets.

5) Model is trained with train data.

6)The trained model is tested against test data.

7)The accuracy is calculated.

8)Output is displayed.

21
4.TESTING AND RESULTS

4.1 Testing

Testing is a process of executing a program with the aim of finding error. To make our software
perform well it should be error free. If testing is done successfully it will remove all the errors
from the software.

4.1.1 Types of Testing

1. White Box Testing


2. Black Box Testing
3. Unit testing
4. Integration Testing
5. Alpha Testing
6. Beta Testing
7. Performance Testing and so on

4.1.1.1 White Box Testing

Testing technique based on knowledge of the internal logic of an application's code and
includes tests like coverage of code statements, branches, paths, conditions. It is performed by
software developers

4.1.1.2 Black Box Testing

A method of software testing that verifies the functionality of an application without having
specific knowledge of the application's code/internal structure. Tests are based on requirements
and functionality.

4.1.1.3 Unit Testing

Software verification and validation method in which a programmer tests if individual units of
source code are fit for use. It is usually conducted by the development team.

4.1.1.4 Integration Testing

The phase in software testing in which individual software modules are combined and tested
as a group. It is usually conducted by testing teams.

22
4.1.1.5 Alpha Testing

Type of testing a software product or system conducted at the developer's site. Usually it is
performed by the end users.

4.1.1.6 Beta Testing

Final testing before releasing application for commercial purpose. It is typically done by end-
users or others.

4.1.1.7 Performance Testing

Functional testing conducted to evaluate the compliance of a system or component with


specified performance requirements. It is usually conducted by the performance engineer.

4.1.2 Black Box Testing

Blackbox testing is testing the functionality of an application without knowing the details of
its implementation including internal program structure, data structures etc. Test cases for black
box testing are created based on the requirement specifications. Therefore, it is also called as
specification-based testing. Fig.4.1 represents the black box testing:

Fig.4.1:Black Box Testing

When applied to machine learning models, black box testing would mean testing machine
learning models without knowing the internal details such as features of the machine learning

23
model, the algorithm used to create the model etc. The challenge, however, is to verify the
test outcome against the expected values that are known beforehand.

Fig.4.2:Black Box Testing for Machine Learning algorithms

The above Fig.4.2 represents the black box testing procedure for machine learning
algorithms.

Table.4.1:Black Box testing

Input Original Output Obtained output

ISIC_0033256 df df

ISIC_0033477 mel mel

The model gives out the correct output when different inputs are given which are mentioned
in Table 4.1. Therefore the program is said to be executed as expected or correct program.

24
4.2 Model Accuracy

The random forest classifier is tested against several test datasets. The model is trained to get
a good accuracy. The model trained has the following specifications:

Fig.4.3:Specifications of Random forest classifier

Fig.4.3 represents the specifications of the random forest classifier that is used for
classification.

The test cases and results of the model are given below

Fig.4.4:Accuracy of Random forest classifier

Fig.4.4 shows the accuracy obtained by using random forest classifier. An accuracy of 69.56%
was obtained on the test set .The accuracy is obtained by comparing the correctly predicted
values (Xp) with the total values classified (Yv). Therefore the accuracy is calculated as follows

Accuracy =Predicted values/actual labels=Xp/Yv

The accuracy_score internally constructs a confusion matrix which represents the number of
images classified correctly or incorrectly. The detail calculation of accuracy_score using
confusion matrix is explained below.

25
Fig.4.5:Confusion Matrix

Fig.4.5 illustrates the confusion matrix built for classification of skin diseases. The confusion
matrix built represents the number of images that are classified correctly and incorrectly. The
columns and rows of conf_mat are akiec, bcc, bkl, df, mel, nv, vasc. The conf_mat[0][0]
represents the count of akiec images that are correctly classified as akiec, conf_mat[0][1]
represents the count of akiec images that are incorrectly classified as bcc and so on. The
diagonal elements in the confusion matrix gives the number of images that are classified
correctly.

Fig.4.6:Accuracy using confusion matrix

In the Fig.4.6, the accuracy of the model is calculated using confusion matrix which is
defined in the eq (3.1).

4.3 Test Cases and Results

Table.4.2:Positive Test Cases

Test Case id Input(image id) Excepted Actual Output Remarks


Output
1 ISIC_0030586.jpg [‘akiec’] [‘akiec’] Success
2 ISIC_0027043.jpg [‘mel’] [‘mel’] Success
3 ISIC_0026637 [‘bkl’] [‘bkl’] Success

26
The Table.4.2, shows the positive test cases i.e. , the test cases that are executed correctly .
Whereas, the Table.4.3 ,shows the test cases that resulted in wrong output.

Table.4.3:Negative Test Cases

Test Case id Input(image id) Expected Output Actual Output Remarks

1 ISIC_0024375.jpg [‘vasc’] [‘nv’] Failure


2 test2img.jpg [‘vasc’] [‘akiec’] Failure
3 testimg.jpg [‘bcc’] [‘nv’] Failure

The prediction on few samples of test data is given below

1)

Fig.4.7: Actinic keratoses and intraepithelial carcinoma

The code and image in Fig.4.7 represents the input image id given for the classifier and
output image and type of cancer obtained.

27
2)

Fig.4.8:Melanoma

The above Fig.4.8 represents the input image id given for the classifier and the output
obtained.

3)

Fig.4.9:Benign Keratosis

28
The above Fig.4.9 represents the input image id of skin lesion and outputs the kind of
disease.

4)

Fig.4.10: Melanocytic Nevus

The above Fig.4.10 represents the input image id of skin lesion and outputs the kind of
disease.

29
5.CONCLUSION AND FUTURE SCOPE

5.1 Conclusion

The Melanoma Skin Lesion Classification project is completed successfully. The goal of the
project is achieved. A Random Forest Classifier is successfully developed to fit a training data
of 8782 image samples. The trained model now predicts the melanoma image up to an accuracy
of 69.5%

The model also predicts different types of skin disease in addition to benign and melanoma
,such as Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell
carcinoma (bcc), dermatofibroma (df), melanocytic nevi (nv) and vascular lesions angiomas,
angiokeratomas, pyogenic granulomas and haemorrhage (vasc).

5.2 Future Scope

A comparison between various machine learning algorithms such as CNN(Convolutional


Neural Network),Ada Boost, Bagging, etc can be performed to get even better results. The
model needs to be updated in finite intervals to be more accurate.

30
BIBLIOGRAPHY

[1] F. Nachbar, W. Stolz, T. Merkle, A. B Cognetta, T. Vogt, M.Landthaler, P. Bilek, O.B.-


Falco, and G. Plewig, “The abcd rule of dermatoscopy: high prospective value in the diagnosis
of doubtful melanocytic skin lesions,” Journal of the American Academy of Dermatology, vol.
30, no. 4, pp. 551–559, 1994.

[2] E. Zagrouba and W. Barhoumi, “A preliminary approach for the automated recognition of
malignant melanoma, ”Image Analysis & Stereology, vol. 23, no. 2, pp. 121-135, 2004.

[3] P. G. Cavalcant, J. Scharcanski and G.V. Baranoski, “A two-stage approach for


discriminating melanocytic skin lesions using standard cameras,”Expert Systems with
Applications, Elsevier, vol. 40, no. 10,pp. 4054-4064, 2013.

[4] I. Giotis, N. Molders, S. Land, M. Biehl, M. F. Jonkman and N. Petkov, “MED-NODE: A


computer-assisted melanoma diagnosis system using non-dermoscopic images,” Expert
Systems with Applications, Elsevier, vol. 42, no. 19, pp. 6578-6585, 2015.

[5] S.Amutha, Mrs.R,Manjula Devi, “Early detection of malignant melanoma using random
forest algorithm” International Journal For Trends In Engineering & Technology Volume 13
Issue 1 – May 2016 - Issn: 2349 – 9303.

[6] Poornima M S, Dr.Shailaja K, “Detection of Skin Cancer Using SVM” International


Research Journal of Engineering and Technology (IRJET) Volume: 04 Issue: 07 | July -2017.

[7] Health Discovery Corporation, “Worlds first svm-based image analysis iphone app for
melanoma risk assessment, melapp,”launched by health discovery corporation, 2011.

[8] M. Silveira et al., ``Comparison of segmentation methods for melanoma diagnosis in


dermoscopy images,'' IEEE J. Sel. Topics Signal Process.,vol. 3, no. 1, pp. 35-45, Feb. 2009.

[9] T. Wadhawan, N. Situ, K. Lancaster, X. Yuan, and G. Zouridakis, “SkinScan: A portable


library for melanoma detection on handheld devices,'' in Proc. IEEE Int. Symp. Biomed. Imag.,
Nano Macro,Mar./Apr. 2011, pp. 133-136.

[10] Arati P. Chavan D. K. Kamat Dr. P. M. Patil,“Classification Of Skin Cancers Using Image
Processing”, International Journal of Advance Research in Electronics, Electrical Computer
Science Applications of Engineering Technology Volume 2, Issue 3, , PP 378-384 June 2014.

31
APPENDIX

#importing libraries

# importing the libraries necessary for the extracting data and perform operations on it

import numpy as np

import pandas as pd

import os

import seaborn as sns

import matplotlib.pyplot as plt

#extracting dataset

data=pd.read_csv('HAM10000_metadata.csv')

#importing the zip file to extract images

import zipfile

print('Extracting Images into single directory')

#creating a directory to store the images

#create a new directory named images to store all the extracted images

os.mkdir('images')

#extracting images to a single folder

'''

Open the image zipfiles and extract all the images present in them to a new folder named images.
Later close the zipfiles unlink them

'''

zip_ref = zipfile.ZipFile('HAM10000_images_part_1.zip', 'r')

zip_ref.extractall('images/')

32
zip_ref.close()

os.unlink('HAM10000_images_part_1.zip')

zip_ref = zipfile.ZipFile('HAM10000_images_part_2.zip', 'r')

zip_ref.extractall('images/')

zip_ref.close()

os.unlink('HAM10000_images_part_2.zip')

#listing the directories present in folder

os.listdir()

#Exploratory Data Analysis

'''

Perform exploratory data analysis on the dataset HAM10000_metadata.csv and find if any null
values are present in the dataset

'''

data=data.drop(columns=['lesion_id'])

data.head()

data.isnull().sum()

#plotting a countplot

'''

Plot a count plot on localization attribute to analyse the areas that are more effected with the
disease. Plot a count plot on the type of disease to get the number of people affected with that
particular type of disease

'''

data['dx'].value_counts()

plt.figure(figsize=(10,8))

sns.countplot(data['localization'])

plt.xticks(rotation=90)

33
sns.countplot(data['dx_type'])

sns.countplot(data['dx'])

#calculating mean of age values to fill in the missing fields

#Find the mean of the age field to fill the missing values in it with mean value

print(data['age'].mean())

print(data['age'].median())

data['age'].fillna(data['age'].mean(),inplace=True)

#import glob to add image_path attribute in desired pattern

#Adding a new field image_path to the dataset

from glob import glob

image_path = {os.path.splitext(os.path.basename(x))[0]: x for x in glob(os.path.join('', '*', '*.jpg'))}

data['path'] = data['image_id'].map(image_path.get)

data.head()

#importing PIL to perform operations on images

'''

Using PIL package open the image and convert it into array and display the image shape and
image. Calculate the count of people affected with disease based on sex

'''

from PIL import Image as pil_image

image_example = np.asarray(pil_image.open(data['path'][0]))

image_example.shape

plt.imshow(image_example)

data['sex'].value_counts()

#Resizing the images

data['image'] = data['path'].map(lambda x: np.asarray(pil_image.open(x).resize((120,90))))

34
#Samples of 5 images in each class

fig,axes = plt.subplots(7,5,figsize=(20,21))

#sorting based on dx_type and displaying some images

#Group the images based on the disease type and display 5 sample images for each type

for nth_axis,(cell_type_name,cell_type_row) in zip(axes,data.sort_values(['dx']).groupby('dx')):

nth_axis[0].set_title(cell_type_name)

for column_axis,(_,column_row) in zip(nth_axis,cell_type_row.sample(5).iterrows()):

column_axis.imshow(column_row['image'])

column_axis.axis('off')

plt.imshow(data['image'][0])

data['image'][0].shape

#Data Pre-processing and Data Modelling

'''

The data is divided into features and target where features all fields other than dx whereas the
target contains dx field

'''

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

features = data.drop(['dx'],axis=1)

target = data['dx']

#splitting the data into train and testing

'''

Split the features and target into training and testing set by taking test size as 0.02. Convert the
training and testing images to array and print their shape

35
'''

X_TRAIN, X_TEST, Y_TRAIN, Y_TEST = train_test_split(features,target,test_size=0.02)

x_train = np.asarray(X_TRAIN['image'].tolist())

x_test = np.asarray(X_TEST['image'].tolist())

print(x_train.shape)

print(x_test.shape)

print(Y_TRAIN.shape)

print(Y_TEST.shape)

#splitting data to training and validation sets

train_img,val_img,train_labels,val_labels=train_test_split(x_train,Y_TRAIN,test_size=0.1,random_st
ate=42)

#displaying the shape of train, test and validation set

train_img.shape,val_img.shape,x_test.shape

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

#fitting the training labels using label encoder

le.fit(train_labels)

train_labels_enc = le.transform(train_labels)

val_labels_enc = le.transform(val_labels)

print(train_labels[:10], train_labels_enc[:10])

#Random Forest

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

36
#reshaping the images to fit into random forest classifier

'''

Reshape the training image size by converting them from 4-dimensional to 2-dimensional to fit it
into the random forest classifier

'''

train_img=train_img.reshape(train_img.shape[0],train_img.shape[1]*train_img.shape[2]*train_img.sh
ape[3])

#fitting the data into classifier

'''

Reshaping the validation data from 4-dimensional to 2-dimensional to fit it into random forest
classifier

'''

clf.fit(train_img,train_labels)

val_img=val_img.reshape(val_img.shape[0],val_img.shape[1]*val_img.shape[2]*val_img.shape[3])

from sklearn.metrics import accuracy_score

preds = clf.predict(train_img)

print("Accuracy:", accuracy_score(train_labels,preds))

#Accuracy score of Random Forest

'''

Predict the type of disease for given validation image using the trained model. Calculate the
accuracy of the predicted image by comparing the obtained output with the actual output

'''

from sklearn.metrics import accuracy_score

preds = clf.predict(val_img)

print("Accuracy:", accuracy_score(val_labels,preds))

37
#Confusion Matrix

'''

Display the confusion matrix to analyse the number of images that are classified correctly and the
number of images that are classified as other type of disease.

'''

from sklearn.metrics import confusion_matrix

conf_mat = confusion_matrix(val_labels, preds)

print(conf_mat)

#Accuracy using Confusion Matrix

'''
The accuracy using confusion matrix is calculated by comparing the sum of diagonal elements
of confusion matrix with the sum of total elements present in the matrix

'''
(conf_mat[0,0]+conf_mat[1,1]+conf_mat[2,2]+conf_mat[3,3]+conf_mat[4,4]+conf_mat[5,5]
+conf_mat[6,6])/np.sum(conf_mat)

#Classification report

#The classification report gives the precision, recall,f1-score and support

from sklearn.metrics import classification_report

print(classification_report(val_labels, preds))

#Prediction of New images

#We test the trained model against new images


#TestCase1

test_image = np.asarray(pil_image.open('images\ISIC_0030586.jpg'))

38
print('Original Shape of image is : ',test_image.shape)

plt.imshow(test_image)

resized_image = np.asarray(pil_image.open('images\ISIC_0030586.jpg').resize((120,90)))

image_array = np.asarray(resized_image.tolist())

test_image = image_array.reshape(1,90,120,3)

test_image.shape

test_image=test_image.reshape(test_image.shape[0],test_image.shape[1]*test_image.shape[2]*test_i
mage.shape[3])

prediction_class = clf.predict(test_image)

print(prediction_class)

#TestCase2

test_image = np.asarray(pil_image.open('images\ISIC_0027043.jpg'))

print('Original Shape of image is : ',test_image.shape)

plt.imshow(test_image)

resized_image = np.asarray(pil_image.open('images\ISIC_0027043.jpg').resize((120,90)))

image_array = np.asarray(resized_image.tolist())

test_image = image_array.reshape(1,90,120,3)

test_image=test_image.reshape(test_image.shape[0],test_image.shape[1]*test_image.shape[2]*test_i
mage.shape[3])

prediction_class = clf.predict(test_image)

print(prediction_class)

#TestCase3

test_image = np.asarray(pil_image.open('images\ISIC_0024375.jpg'))

print('Original Shape of image is : ',test_image.shape)

plt.imshow(test_image)

39
resized_image = np.asarray(pil_image.open('images\ISIC_0024375.jpg').resize((120,90)))

image_array = np.asarray(resized_image.tolist())

test_image = image_array.reshape(1,90,120,3)

test_image=test_image.reshape(test_image.shape[0],test_image.shape[1]*test_image.shape[2]*test_i
mage.shape[3])

prediction_class = clf.predict(test_image)

print(prediction_class)

40

You might also like