Proj Report PDF
Proj Report PDF
NOVEMBER 2019
MAHATMA GANDHI INSTITUTE OF TECHNOLOGY
(Established in 1997 by Chaitanya Bharathi Educational Society)
(Affiliated to JNTU, Hyderabad; Accredited by NBA, AICTE-New Delhi)
Kokapet (village & gram panchayat), Rajendra Nagar (Mandal), Ranga Reddy(Dist.),
Chaitanya Bharathi P.O., Hyderabad-500 075.
CERTIFICATE
This is to certify that the industry oriented mini project(CS705PC) entitled “Melanoma
Skin Lesion Classification”, being submitted by Ms. Pogula RajyaLakshmi Sobha Pavani
bearing Roll No: 16261A0542 in partial fulfilment for the award of B. Tech in Computer
Science and Engineering to Mahatma Gandhi Institute of Technology, Gandipet, Hyderabad is
a record of bonafide work carried out by her under our guidance and supervision.
The results embodied in this project have not been submitted to any other University or
Institute for the award of any degree or diploma.
Ms.D.S.Bhavani Dr.C.R.K.Reddy
EXTERNAL EXAMINER
i
DECLARATION
I hereby declare that the work presented in this project, titled “MELANOMA SKIN LESION
CLASSIFICATION”, which is being submitted by me in partial fulfilment for the award of
B. Tech in the Department of Computer Science and Engineering to Mahatma Gandhi Institute
of Technology, Gandipet, Hyderabad (T.S) - 500075, is a result of investigations carried out
by me under the guidance of Ms.D.S.Bhavani, Assistant Professor of Computer Science and
Engineering, MGIT, Gandipet, Hyderabad.
The work is original and has not been submitted for any Degree/Diploma of this or any other
university.
ii
ACKNOWLEDGEMENT
I would like to express my sincere thanks Dr. K. Jaya Sankar, Principal, MGIT, for
providing the working facilities in college.
I wish to express my sincere thanks and gratitude to Dr. C. R. K. Reddy, Professor and HOD,
Department of CSE, MGIT, for all the timely support and valuable suggestions during the
period of project.
Finally, I would also like to thank all the faculty and staff of CSE Department who helped me
directly or indirectly, for completing this project.
H.T.No: 16261A0542
iii
TABLE OF CONTENTS
Certificate i
Declaration ii
Acknowledgement iii
List of Figures vi
Abstract viii
1. Introduction 1
1.1 Problem Definition 1
1.2 Existing System 2
1.3 Proposed System 2
1.4 System Requirements 3
1.4.1 Hardware Requirements 3
1.4.2 Software Requirements 3
2. Literature Survey 4
3. Design Methodology 9
3.1 System Architecture 9
3.2 Modules 9
3.2.1 Data Acquisition 10
3.2.2 Pre-process Data 10
3.2.3 Data Analysis Module 11
3.2.4 Performance Analysis Module 14
3.2.5 Data Visualizations 15
3.3 UML Diagrams 16
3.3.1 Use Case Diagram 17
3.3.2 Sequence Diagram 18
3.3.3 Activity Diagram 19
iv
4. Testing and Results 22
4.1 Testing 22
4.2 Model Accuracy 25
4.3 Test Cases and Results 26
Bibliography 31
Appendix 32
v
LIST OF FIGURES
vi
LIST OF TABLES
vii
ABSTRACT
Machine Learning has been successfully implemented in the real world problems, its use in
real world problems is relatively new, i.e., its use is intended for identification and extraction
of new and potentially valuable knowledge from the data. Disease identification and diagnosis
of ailments is at the forefront of ML research in medicine.
The aim of this project is to develop a model which can predict melanoma cancer. Melanoma
cancer is a type of skin cancer which is not very common but can be fatal, if not detected in
early stages. In this project a classifier is build which could classify whether the given skin
lesion is malignant or benign by supervised algorithms(such as Random forest) and the
performance of learning tools were evaluated based on their predictive accuracy.
viii
1.INTRODUCTION
Melanoma is the most dangerous type of skin cancer. Melanoma also known as malignant
melanoma is a type of cancer that developed from the pigment containing cells known as
melanocytes. Melanoma can occur on any skin surface. In men, it’s often found on the skin on
the head, on the neck, or between the shoulders and the hips. In women, it’s often found on the
skin on the lower legs or between the shoulders and the hips. Melanoma is rare in people with
dark skin. When it does develop in people with dark skin, it’s usually found under the
fingernails, under the toenails, on the palms of the hands, or on the soles of the feet. Ultraviolet
(UV) rays are clearly a major cause of UV rays can damage the DNA in skin cells. Sometimes
this damage affects certain genes that control how skin cells grow and divide. If these genes no
longer work properly, the affected cells may form a cancer. Sunlight is the main source of UV
rays. The nature of the UV exposure may play a vital role in melanoma development.
Skin cancer is the most commonly diagnosed cancer in the United States, yet most cases are
preventable. Every year in the United States, nearly 5 million people are treated for skin cancer,
at an estimated cost of $8.1 billion. Melanoma, the deadliest form of skin cancer, causes more
than 9,000 deaths each year. Unlike many other cancers, skin cancer rates have continued to
rise in recent years. Here are the American Cancer Society’s estimates for melanoma in the
United States for 2015:
1. About 73,870 new melanomas will be diagnosed (about 42,670 in men and 31,200
in women).
2. About 9,940 people are expected to die of melanoma (about 6,640 men and 3,300
women).
Although survival rate is increasing, death rate from malignant melanoma is exponentially
increasing. Early diagnosis is crucial for the treatment, because malignant melanoma is very
invasive when it affects melanocyte.
1
learning techniques. The ultimate goal is to ease the doctor's role in the recognition of skin
cancer by providing better and more reliable results, so that more patients can be diagnosed.
Disadvantages
In this project the system is trained by collecting the images data of melanoma at different
stages .The trained system predicts the melanoma cancer with good accuracy. It helps in early
diagnosis of the disease and classifies the given images into malignant or benign.
Advantages
2
1.4 System Requirements
3
2.LITERATURE SURVEY
In this 194 skin lesions were collected and evaluated according to ABCD rule, A-Asymmetry,
B-abrupt cut-off of the pigment at border, C-different colors, D-Structural Components were
assessed to yield a semiquantitative score. This method is very expensive. This method is used
for selected lesion samples.
In this they proposed an efficient system for prescreening of pigmented skin lesions for
malignancy using general-purpose digital cameras. These images can be captured by a
smartphone or a digital camera. This could be beneficial in different applications, such as
computer aided diagnosis and telemedicine applications. It could assist dermatologists, or
smartphone users, evaluate risk of suspicious moles. The proposed method enhances borders
and extracts a broad set of dermatologically important features. These discriminative features
allow classification of lesions into two groups of melanoma and benign. The algorithm used to
classify the images is ANN(Artificial Neural Networks).
In this paper, they proposed and evaluated six methods for the segmentation of skin lesions in
dermoscopic images. This set includes some state of the art techniques which have been suc-
cessfully used in many medical imaging problems (gradient vector flow (GVF) and the level
set method of Chan et al. [(C-LS)]. It also includes a set of methods developed by the authors
which were tailored to this particular application (adaptive thresholding (AT), adaptive snake
(AS), EM level set [(EM-LS), and fuzzy-based split- and-merge algorithm (FBSM)]. The
segmentation methods were applied to 100 dermoscopic images and evaluated with four dif-
ferent metrics, using the segmentation result obtained by an experienced dermatologist as the
ground truth. The best results were obtained by the AS and EM-LS methods.
4
T. Wadhawan, N. Situ, K. Lancaster, X. Yuan, and G. Zouridakis, “SkinScan: A portable
library for melanoma detection on handheld Devices”(2011)
In this study we explored the feasibility of running a sophisticated application for automated
skin cancer detection on an Apple iPhone 4. Our results demonstrate that the proposed library
with the advanced image processing and analysis algorithms has excellent performance on
handheld and desktop computers. Therefore, deployment of smartphones as screening devices
for skin cancer and other skin diseases can have a significant impact on health care delivery in
underserved and remote areas. This method consists of image segmentation, feature extraction
and image classification.
In this first the images collected are pre-processed and evaluated using ABCD rule. In addition
to this a classification algorithm (KNN) is used to classify benign or malignant. In the second
stage the classification sensitivity is enhanced. All the images that have been classified as
benign are re-classified as benign using melanin variation features extracted from inside lesion.
This method is not suitable for early stage detection.
They proposed a decision support system called as MED-NODE, which makes use of non-
dermatoscopic digital images from which it extracts lesion regions based on color and texture.
In this the classification is based of majority voting. The prediction of the image is difficult if
it contains Noise.
This paper introduces the skin lesion analysis system for the early detection of melanoma. The
skin lesion analysis system is aims to focus the automated image analysis module which
contains the steps of image acquisition, hair detection & exclusion, lesion segmentation
process, feature extraction and classification. The random forest algorithm is proposed in lesion
classification process, which enhances the efficiency of the classification results. The random
forest package optionally produces information such as a measure of importance of the
predictor’s variables and a measure of the internal structure of the data. The experimental
results show that the proposed system is efficient, achieving classification of the benign,
atypical and melanoma images with the accuracy
In this paper, we present a method for the detection of Melanoma Skin Cancer using Image
processing tools. The input to the system is the skin lesion image and then by applying image
processing techniques, it analyses to conclude about the presence of skin cancer. The Lesion
Image analysis tools checks for the various Melanoma parameters, Color, Area perimeter,
diameter etc by texture, size and shape analysis for image segmentation and feature stages. The
extracted feature parameters are used to classify the image as Non Melanoma and Melanoma
cancer lesion using SVM(Support Vector Machines). The kernel function used is Radial Basis
Function (RBF).
6
Table.2.1:Literature Survey
7
6 “SkinScan: A T. Wadhawan, N. Image Processing 1.Predicts 1.The embedded
portable library Situ, K. melanoma processors have
for melanoma Lancaster, X. 2.The limited
detection on Yuan, and G. maintenance speed,memory.
handheld Zouridakis [4] cost is low 2.Works only in
Devices” Apple iphone 4
8
3.DESIGN METHODOLOGY
The model usually takes input as images and features. Later it pre-processes the input data
to remove inconsistent values and noise data. The pre-processed data obtained is split into
test and train set. A random forest classifier is built and it is trained against the train dataset.
The trained model is tested against the test data set and the accuracy is predicted. When a
new image is given as the input to the trained model , it correctly classifies the given input
image either to melanoma or non-melanoma.
Input Image
The above Fig.3.1 explains the architecture of the model that is build to classify skin lesions
To melanoma or non-melanoma.
3.2 Modules
The model implementation consists of three modules:
1. Data Acquisition
2. Pre-processing Data
3. Data Analysis Module
4. Performance Analysis Module
5. Data Visualization
9
3.2.1 Data Acquisition
Data acquisition is the process of importing raw data sets into your analytical platform. It can
be acquired from traditional databases (SQL and query browsers), remote data (web services),
text files (scripting languages), NoSQL storage (web services, programming interfaces), etc.
Data acquisition involves the identification of data sets, retrieval of data, query of data from
the dataset.
The melanoma dataset is extracted from “The HAM10000 dataset” hosted by Harvard
Dataverse. The dataset consists of dermatoscopic images collected from different populations,
acquired and stored by different modalities. The final dataset consists of 10015 dermatoscopic
images which can serve as a training set for academic machine learning purposes.
Fig.3.2:Sample of Dataset(metadata.csv)
Fig.3.2 represents the features present in the dataset. The features present are:
Cleaning Data: Data cleaning involves removal of inconsistent values, duplicate records,
10
Data Munging / Data Wrangling: Data Wrangling techniques involve scaling,
transformation, feature selection, dimensionality reduction and data manipulation. Scaling is
performed over the dataset to avoid having certain features with large values from dominating
the results. The transformation technique reduced the noise and variability present in the
dataset. Multiple features are handpicked for the removal of redundant/irrelevant features
present in the dataset. Dimensionality reduction helped in eliminating irrelevant features and
made analysis simpler.
The data analysis module comprises of feature selection, model selection, creation of insights
and analysis of results. A Random Forest Algorithm has been utilized to analyse the dataset in
order to predict the melanoma.
Random forest is a type of supervised machine learning algorithm based on ensemble learning.
Ensemble learning is a type of learning where you join different types of algorithms or same
algorithm multiple times to form a more powerful prediction model. The random
forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees,
resulting in a forest of trees, hence the name "Random Forest". The random forest algorithm
can be used for both regression and classification tasks.
Fig.3.3 gives the overview of how the random forest works. It works in four steps:
Input: Dataset, a new test sample.
11
Algorithm:
1. Select random samples from a given dataset. In the below Fig.3.4, the dataset is
divided into 3 random samples.
2. Construct a decision tree for each sample and get a prediction result from each
decision tree.
3. Perform a vote for each predicted result. A new sample is passed to each decision tree
and the result is stored.
4. Select the prediction result with the most votes as the final prediction.
Fig.3.4 represents the splitting of dataset into random sets for finding the best root node for
splitting.
The pseudocode for random forest algorithm can split into two stages.
1. Randomly select “k” features from total “m” features, where k << m.
2. Among the “k” features, calculate the node “d” using the best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to create “n” number of
trees.
The beginning of random forest algorithm starts with randomly selecting “k” features out of
total “m” features. In the Fig.3.4, you can observe that we are randomly taking features and
observations.
In the next stage, we are using the randomly selected “k” features to find the root node by
using the best split approach.
The next stage, We will be calculating the daughter nodes using the same best split approach.
Finally, we repeat 1 to 4 stages to create “n” randomly created trees. This randomly created
trees forms the random forest.
To perform prediction using the trained random forest algorithm uses the below pseudocode.
1. Takes the test features and use the rules of each randomly created decision tree to
predict the outcome and stores the predicted outcome (target)
2. Calculate the votes for each predicted target.
3. Consider the high voted predicted target as the final prediction from the random forest
algorithm.
To perform the prediction using the trained random forest algorithm we need to pass the test
features through the rules of each randomly created trees. Suppose let’s say we formed 100
random decision trees to from the random forest.
13
Each random forest will predict different target (outcome) for the same test feature. Then by
considering each predicted target votes will be calculated. Suppose the 100 random decision
trees are prediction some 3 unique targets x, y, z then the votes of x is nothing but out of 100
random decision tree how many trees prediction is x.
Likewise for other 2 targets (y, z). If x is getting high votes. Let’s say out of 100 random
decision tree 60 trees are predicting the target will be x. Then the final random forest returns
the x as the predicted target.
The performance is analysed by dividing the data into test and train sets. The model is trained
using train data .The trained model is tested against the test data and the obtained predictions
are evaluated based on the accuracy obtained. The accuracy is calculated using the confusion
matrix.
Confusion Matrix
A confusion matrix is a matrix (table) that can be used to measure the performance of an
machine learning algorithm, usually a supervised learning one. Each row of the confusion
matrix represents the instances of an actual class and each column represents the instances of
a predicted class. For example the algorithms should have predicted a sample as ci because the
actual class is ci, but the algorithm came out with cj. In this case of mislabelling the
element cm[i,j] ,it will be incremented by one, when the confusion matrix is constructed.
For a multi-class confusion matrix the accuracy is measured as the sum of the diagonal
elements divided by total no,of predictions performed.
𝑗−1 𝑖−1,𝑗−1
Accuracy=∑𝑖=0 𝑐𝑚[𝑖, 𝑖]/ ∑𝑘1=0,𝑘2=0 𝑐𝑚[𝑖, 𝑗] (3.1)
Where ,
14
3.2.5 Data Visualizations
Fig.3.5 represents the first 5 rows of data in the dataset after dropping lesion_id. The
lesion_id is dropped as there are some duplicate values that are not useful for classification.
Count Plot
15
Fig.3.6 represents the plot of occurrence of disease in the particular location to the count of
people. In this, the X-axis represents count and Y-axis represents the location. From the plot
we can analyse that the majority of the occurrence of disease is observed on the back, while
the minimum is observed at genital.
Fig.3.7 represents the sample images of various types of skin diseases which may include
vascular, benign, basal cell carcinoma, etc.
Unified Modelling Language is a standard language for writing software blueprints. It can be
used to visualize, specify, construct, and document the artifacts of a software-intensive system.
Modeling is a proven and well-accepted engineering technique.
16
3.3.1 Use Case Diagram
A use case diagram is a behavioral diagram that shows a set of use cases and actors and their
relationships. They address the static view of a system. It is a description of set of sequence of
actions that a system performs that yields an observable result of value to a actor.
.
Fig.3.9:Use-case diagram of Developer
The above Fig.3.9 illustrates the use case diagram of the developer.In this the developer collects
the datasets and performs pre-process operations. Later the developer splits the dataset into
train and testing sets. Developer trains the model using training dataset and estimates the
accuracy the model obtained. The model takes the input as image from user and predicts the
output.
A Sequence Diagram emphasizes the time ordering of messages. A sequence diagram show a
set of objects and the messages sent and received by those objects. The objects are typically
17
named or anonymous instances of classes. Sequence diagrams are used to model the dynamic
aspects of a system.
The above Fig.3.10 represents the sequence diagram for melanoma detection. The sequence
followed is as follows:
3. The browser requests for libraries from file system via kernel (Ex.: Pandas, Numpy,etc.)
18
4.The Kernel interacts directly with the file system and requests for the required libraries.
5. The file system responds to Kernel’s request by fetching the libraries that were requested.
6. The Browser then requests for the dataset associated with the problem statement.
7. The Kernel receives the request from Browser and passes it on to the file system.
8. The file system satisfies the request by fetching the required dataset.
9. Cleaning steps are requested for by the browser over the data set.
10. Various cleaning operations are performed over the data frame to get higher accuracy.
12. The kernel responds with the kind of visualizations that were requested for by the browser.
13. Data transformation techniques are performed to normalize the attribute values in the data set.
14. Different transformation techniques are applied over the data frame by the Kernel.
16. A tuned model is fit over the data set to provide a highly accurate model.
17. The associated predictions over the test data are visualized to analyze it’s accuracy in
comparison to the actual melanoma images.
18. Requested visualizations are displayed over the browser to better understand the accuracy of
the trained model.
Activity diagram is defined as a UML diagram that focuses on the execution and flow of the
behavior of a system instead of implementation. It is also called object-oriented flowchart.
Activity diagrams consist of activities that are made up of actions which apply to behavioral
modeling technology.
19
Dataset
Dataset
retrieved
retrieved
Data is
preprocessed
ed
Resizing data
Train model
with data
Test model
against test data
Accuracy
estimation
Predicted
output
The above Fig.3.11 represents the activity diagram for melanoma classification. The
execution flow is determined using activity diagram as
20
4) Data is divided in to training and testing sets.
8)Output is displayed.
21
4.TESTING AND RESULTS
4.1 Testing
Testing is a process of executing a program with the aim of finding error. To make our software
perform well it should be error free. If testing is done successfully it will remove all the errors
from the software.
Testing technique based on knowledge of the internal logic of an application's code and
includes tests like coverage of code statements, branches, paths, conditions. It is performed by
software developers
A method of software testing that verifies the functionality of an application without having
specific knowledge of the application's code/internal structure. Tests are based on requirements
and functionality.
Software verification and validation method in which a programmer tests if individual units of
source code are fit for use. It is usually conducted by the development team.
The phase in software testing in which individual software modules are combined and tested
as a group. It is usually conducted by testing teams.
22
4.1.1.5 Alpha Testing
Type of testing a software product or system conducted at the developer's site. Usually it is
performed by the end users.
Final testing before releasing application for commercial purpose. It is typically done by end-
users or others.
Blackbox testing is testing the functionality of an application without knowing the details of
its implementation including internal program structure, data structures etc. Test cases for black
box testing are created based on the requirement specifications. Therefore, it is also called as
specification-based testing. Fig.4.1 represents the black box testing:
When applied to machine learning models, black box testing would mean testing machine
learning models without knowing the internal details such as features of the machine learning
23
model, the algorithm used to create the model etc. The challenge, however, is to verify the
test outcome against the expected values that are known beforehand.
The above Fig.4.2 represents the black box testing procedure for machine learning
algorithms.
ISIC_0033256 df df
The model gives out the correct output when different inputs are given which are mentioned
in Table 4.1. Therefore the program is said to be executed as expected or correct program.
24
4.2 Model Accuracy
The random forest classifier is tested against several test datasets. The model is trained to get
a good accuracy. The model trained has the following specifications:
Fig.4.3 represents the specifications of the random forest classifier that is used for
classification.
The test cases and results of the model are given below
Fig.4.4 shows the accuracy obtained by using random forest classifier. An accuracy of 69.56%
was obtained on the test set .The accuracy is obtained by comparing the correctly predicted
values (Xp) with the total values classified (Yv). Therefore the accuracy is calculated as follows
The accuracy_score internally constructs a confusion matrix which represents the number of
images classified correctly or incorrectly. The detail calculation of accuracy_score using
confusion matrix is explained below.
25
Fig.4.5:Confusion Matrix
Fig.4.5 illustrates the confusion matrix built for classification of skin diseases. The confusion
matrix built represents the number of images that are classified correctly and incorrectly. The
columns and rows of conf_mat are akiec, bcc, bkl, df, mel, nv, vasc. The conf_mat[0][0]
represents the count of akiec images that are correctly classified as akiec, conf_mat[0][1]
represents the count of akiec images that are incorrectly classified as bcc and so on. The
diagonal elements in the confusion matrix gives the number of images that are classified
correctly.
In the Fig.4.6, the accuracy of the model is calculated using confusion matrix which is
defined in the eq (3.1).
26
The Table.4.2, shows the positive test cases i.e. , the test cases that are executed correctly .
Whereas, the Table.4.3 ,shows the test cases that resulted in wrong output.
1)
The code and image in Fig.4.7 represents the input image id given for the classifier and
output image and type of cancer obtained.
27
2)
Fig.4.8:Melanoma
The above Fig.4.8 represents the input image id given for the classifier and the output
obtained.
3)
Fig.4.9:Benign Keratosis
28
The above Fig.4.9 represents the input image id of skin lesion and outputs the kind of
disease.
4)
The above Fig.4.10 represents the input image id of skin lesion and outputs the kind of
disease.
29
5.CONCLUSION AND FUTURE SCOPE
5.1 Conclusion
The Melanoma Skin Lesion Classification project is completed successfully. The goal of the
project is achieved. A Random Forest Classifier is successfully developed to fit a training data
of 8782 image samples. The trained model now predicts the melanoma image up to an accuracy
of 69.5%
The model also predicts different types of skin disease in addition to benign and melanoma
,such as Actinic keratoses and intraepithelial carcinoma / Bowen's disease (akiec), basal cell
carcinoma (bcc), dermatofibroma (df), melanocytic nevi (nv) and vascular lesions angiomas,
angiokeratomas, pyogenic granulomas and haemorrhage (vasc).
30
BIBLIOGRAPHY
[2] E. Zagrouba and W. Barhoumi, “A preliminary approach for the automated recognition of
malignant melanoma, ”Image Analysis & Stereology, vol. 23, no. 2, pp. 121-135, 2004.
[5] S.Amutha, Mrs.R,Manjula Devi, “Early detection of malignant melanoma using random
forest algorithm” International Journal For Trends In Engineering & Technology Volume 13
Issue 1 – May 2016 - Issn: 2349 – 9303.
[7] Health Discovery Corporation, “Worlds first svm-based image analysis iphone app for
melanoma risk assessment, melapp,”launched by health discovery corporation, 2011.
[10] Arati P. Chavan D. K. Kamat Dr. P. M. Patil,“Classification Of Skin Cancers Using Image
Processing”, International Journal of Advance Research in Electronics, Electrical Computer
Science Applications of Engineering Technology Volume 2, Issue 3, , PP 378-384 June 2014.
31
APPENDIX
#importing libraries
# importing the libraries necessary for the extracting data and perform operations on it
import numpy as np
import pandas as pd
import os
#extracting dataset
data=pd.read_csv('HAM10000_metadata.csv')
import zipfile
#create a new directory named images to store all the extracted images
os.mkdir('images')
'''
Open the image zipfiles and extract all the images present in them to a new folder named images.
Later close the zipfiles unlink them
'''
zip_ref.extractall('images/')
32
zip_ref.close()
os.unlink('HAM10000_images_part_1.zip')
zip_ref.extractall('images/')
zip_ref.close()
os.unlink('HAM10000_images_part_2.zip')
os.listdir()
'''
Perform exploratory data analysis on the dataset HAM10000_metadata.csv and find if any null
values are present in the dataset
'''
data=data.drop(columns=['lesion_id'])
data.head()
data.isnull().sum()
#plotting a countplot
'''
Plot a count plot on localization attribute to analyse the areas that are more effected with the
disease. Plot a count plot on the type of disease to get the number of people affected with that
particular type of disease
'''
data['dx'].value_counts()
plt.figure(figsize=(10,8))
sns.countplot(data['localization'])
plt.xticks(rotation=90)
33
sns.countplot(data['dx_type'])
sns.countplot(data['dx'])
#Find the mean of the age field to fill the missing values in it with mean value
print(data['age'].mean())
print(data['age'].median())
data['age'].fillna(data['age'].mean(),inplace=True)
data['path'] = data['image_id'].map(image_path.get)
data.head()
'''
Using PIL package open the image and convert it into array and display the image shape and
image. Calculate the count of people affected with disease based on sex
'''
image_example = np.asarray(pil_image.open(data['path'][0]))
image_example.shape
plt.imshow(image_example)
data['sex'].value_counts()
34
#Samples of 5 images in each class
fig,axes = plt.subplots(7,5,figsize=(20,21))
#Group the images based on the disease type and display 5 sample images for each type
nth_axis[0].set_title(cell_type_name)
column_axis.imshow(column_row['image'])
column_axis.axis('off')
plt.imshow(data['image'][0])
data['image'][0].shape
'''
The data is divided into features and target where features all fields other than dx whereas the
target contains dx field
'''
features = data.drop(['dx'],axis=1)
target = data['dx']
'''
Split the features and target into training and testing set by taking test size as 0.02. Convert the
training and testing images to array and print their shape
35
'''
x_train = np.asarray(X_TRAIN['image'].tolist())
x_test = np.asarray(X_TEST['image'].tolist())
print(x_train.shape)
print(x_test.shape)
print(Y_TRAIN.shape)
print(Y_TEST.shape)
train_img,val_img,train_labels,val_labels=train_test_split(x_train,Y_TRAIN,test_size=0.1,random_st
ate=42)
train_img.shape,val_img.shape,x_test.shape
le = LabelEncoder()
le.fit(train_labels)
train_labels_enc = le.transform(train_labels)
val_labels_enc = le.transform(val_labels)
print(train_labels[:10], train_labels_enc[:10])
#Random Forest
clf = RandomForestClassifier()
36
#reshaping the images to fit into random forest classifier
'''
Reshape the training image size by converting them from 4-dimensional to 2-dimensional to fit it
into the random forest classifier
'''
train_img=train_img.reshape(train_img.shape[0],train_img.shape[1]*train_img.shape[2]*train_img.sh
ape[3])
'''
Reshaping the validation data from 4-dimensional to 2-dimensional to fit it into random forest
classifier
'''
clf.fit(train_img,train_labels)
val_img=val_img.reshape(val_img.shape[0],val_img.shape[1]*val_img.shape[2]*val_img.shape[3])
preds = clf.predict(train_img)
print("Accuracy:", accuracy_score(train_labels,preds))
'''
Predict the type of disease for given validation image using the trained model. Calculate the
accuracy of the predicted image by comparing the obtained output with the actual output
'''
preds = clf.predict(val_img)
print("Accuracy:", accuracy_score(val_labels,preds))
37
#Confusion Matrix
'''
Display the confusion matrix to analyse the number of images that are classified correctly and the
number of images that are classified as other type of disease.
'''
print(conf_mat)
'''
The accuracy using confusion matrix is calculated by comparing the sum of diagonal elements
of confusion matrix with the sum of total elements present in the matrix
'''
(conf_mat[0,0]+conf_mat[1,1]+conf_mat[2,2]+conf_mat[3,3]+conf_mat[4,4]+conf_mat[5,5]
+conf_mat[6,6])/np.sum(conf_mat)
#Classification report
print(classification_report(val_labels, preds))
test_image = np.asarray(pil_image.open('images\ISIC_0030586.jpg'))
38
print('Original Shape of image is : ',test_image.shape)
plt.imshow(test_image)
resized_image = np.asarray(pil_image.open('images\ISIC_0030586.jpg').resize((120,90)))
image_array = np.asarray(resized_image.tolist())
test_image = image_array.reshape(1,90,120,3)
test_image.shape
test_image=test_image.reshape(test_image.shape[0],test_image.shape[1]*test_image.shape[2]*test_i
mage.shape[3])
prediction_class = clf.predict(test_image)
print(prediction_class)
#TestCase2
test_image = np.asarray(pil_image.open('images\ISIC_0027043.jpg'))
plt.imshow(test_image)
resized_image = np.asarray(pil_image.open('images\ISIC_0027043.jpg').resize((120,90)))
image_array = np.asarray(resized_image.tolist())
test_image = image_array.reshape(1,90,120,3)
test_image=test_image.reshape(test_image.shape[0],test_image.shape[1]*test_image.shape[2]*test_i
mage.shape[3])
prediction_class = clf.predict(test_image)
print(prediction_class)
#TestCase3
test_image = np.asarray(pil_image.open('images\ISIC_0024375.jpg'))
plt.imshow(test_image)
39
resized_image = np.asarray(pil_image.open('images\ISIC_0024375.jpg').resize((120,90)))
image_array = np.asarray(resized_image.tolist())
test_image = image_array.reshape(1,90,120,3)
test_image=test_image.reshape(test_image.shape[0],test_image.shape[1]*test_image.shape[2]*test_i
mage.shape[3])
prediction_class = clf.predict(test_image)
print(prediction_class)
40