© 2019 IJRAR May 2019, Volume 6, Issue 2
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
MACHINE LEARNING FOR DISEASE
PREDICTION BY USING NEURAL NETWORKS
1
P. Sunanda
1
Asst. Professor
Department of Computer Science & Engineering,
1
G. Pulla Reddy Engineering College (Autonomous), GPREC, Kurnool, India
1
Abstract : With the growth of data in biomedical and health care communities in large amounts, accurate analysis of medical data
benefits early disease detection, patient care, and community services. However, the analysis accuracy is reduced when the quality of
medical data is incomplete. Moreover, different regions exhibit unique characteristics of certain regional diseases, which may weaken
the prediction of disease outbreaks. Machine learning algorithms are used for effective prediction of chronic disease like Cardiac
Arrhythmia, is of a group of conditions in which the electrical activity of the heart is irregular or is faster or slower than normal. It is
the leading cause of death for both men and women in the world. For optimization Stochastic Gradient Descent Algorithm is used.
IndexTerms - Machine Learning, Neural Networks, Stochastic Gradient Descent.
I. INTRODUCTION
1.1 Machine Learning
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and
improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs
that can access data and use it learn for themselves. The process of learning begins with observations of data, such as direct experience,
or instruction in order to look for patterns in data and make better decisions in the future based on the examples that are provided. The
primary aim is to allow the computers learn automatically without human intervention or assistance and adjust actions accordingly.
Fig 1.1Machine Learning Classification
Machine learning algorithms can be divided into 2 broad categories.
1. Supervised Learning
2. Unsupervised Learning
1. Supervised learning is useful in cases where a property (label) is available for a certain dataset (training set), but is missing and
needs to be predicted for other instances. Some of the supervised learning algorithms include Decision trees, Naive Bayes classifier,
Nearest Neighbor algorithm, Support Vector Machines etc.
1) Decision tree learning uses a decision tree (as a predictive model) to go from observations about an item (represented in the
branches) to conclusions about the item's target value (represented in the leaves).
2) Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive)
independence assumptions between the features.
3) K-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input
consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or
regression:
In k-NN classification, the output is a class membership.
In k-NN regression, the output is the property value for the object.
2. Unsupervised learning is useful in cases where the challenge is to discover implicit relationships in a given unlabeled dataset.
Some of the unsupervised learning algorithms include k Means, Hierarchical, Hidden Markov models etc.
1) The k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis
in data mining. The k-means clustering aims to partition n observations into k clusters in which each observation belongs to the
cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into voronoi
cells.
2) In data mining and statistics, hierarchical clustering also called hierarchical cluster analysis or HCA. It is a method of cluster
analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types:
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
712
© 2019 IJRAR May 2019, Volume 6, Issue 2
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as
one moves up the hierarchy.
Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one
moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical
clustering are usually presented in a dendrogram.
3) Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov
process with unobserved (i.e. hidden) states.
1.1.1 Why Do Machine Learning On Different Data?
Traditional analytics tools are not well suited to capturing the full value of data. The volume of data is too large for
comprehensive analysis, and the range of potential correlations and relationships between disparate data sources — from back end
customer databases to live web based click streams —are too great for any analyst to test all hypotheses and derive all the value
buried in the data.
Basic analytical methods used in business intelligence and enterprise reporting tools reduce to reporting sums, counts, simple
averages and running SQL queries. Online analytical processing is merely a systematized extension of these basic analytics that still
rely on a human to direct activities specify what should be calculated. And unlike traditional analysis, machine learning thrives on
growing datasets. The more data fed into a machine learning system, the more it can learn and apply the results to higher quality
insights.
1.2 Neural Networks
Neural networks are an example of machine learning, where the output of the program can change as it learns. A neural
network can be trained and improves with each example, but the larger the neural network, the more examples it needs to perform
well - often needing millions or billions of examples in the case of deep learning.
A network starts with an input, somewhat like a sensory organ. Information then flows through layers of neurons, where each
neuron is connected to many other neurons. If a particular neuron receives enough stimuli, then it sends a message to any other neuron
is it connected to through its axon. Similarly, an artificial neural network has an input layer of data, one or more hidden layers of
classifiers, and an output layer. Each node in each hidden layer is connected to a node in the next layer. When a node receives
information, it sends along some amount of it to the nodes it is connected to. The amount is determined by a mathematical function
called an activation function, such as sigmoid or tanh.
Neural networks work in very similar manner. It takes several input, processes it through multiple neurons from multiple
hidden layers and returns the result using an output layer. This result estimation process is technically known as Forward Propagation.
Next, compare the result with actual output. The task is to make the output to neural network as close to actual (desired)
output. Each of these neurons is contributing some error to final output. To reduce the error, to minimize the value/ weight of neurons
those are contributing more to the error and this happens while traveling back to the neurons of the neural network and finding where
the error lies. This process is known as Backward Propagation.
In order to reduce these numbers of iterations to minimize the error, the neural networks use a common algorithm known as
Gradient Descent, which helps to optimize the task quickly and efficiently.
1.2.1 Components of Neural Networks
Weighting Factors: A neuron usually receives many simultaneous inputs. Each input has its own relative weight which gives the
input the impact that it needs on the processing element's summation function. These weights perform the same type of function
as do the varying synaptic strengths of biological neurons. Weights are adaptive coefficients within the network that determine
the intensity of the input signal as registered by the artificial neuron. They are a measure of an input's connection strength. These
strengths can be modified in response to various training sets and according to a network's specific topology or through its
learning rules.
Summation Function: The first step in a processing element's operation is to compute the weighted sum of all of the inputs.
Mathematically, the inputs and the corresponding weights are vectors which can be represented as (i1, i2 . . . in) and (w1, w2 . . .
wn). The total input signal is the dot, or inner product of these two vectors. This simplistic summation function is found by
multiplying each component of the i vector by the corresponding component of the w vector and then adding up all the products.
Input1 = i1 * w1, input2 = i2 * w2, etc., are added as input1 + input2 + . . . + input n. The result is a single number, not a multielement vector. Geometrically, the inner product of two vectors can be considered a measure of their similarity. If the vectors
point in the same direction, the inner product is maximum; if the vectors point in opposite direction (180 degrees out of phase),
their inner product is minimum.
Transfer Function: The result of the summation function, almost always the weighted sum, is transformed to a working output
through an algorithmic process known as the transfer function. In the transfer function the summation total can be compared with
some threshold to determine the neural output. If the sum is greater than the threshold value, the processing element generates a
signal. If the sum of the input and weight products is less than the threshold, no signal (or some inhibitory signal) is generated.
Both types of response are significant.
Output Function (Competition): Each processing element is allowed one output signal which it may output to hundreds of other
neurons. This is just like the biological neuron, where there are many inputs and only one output action. Normally, the output is
directly equivalent to the transfer function's result.
Error Function and Back-Propagated Value: In most learning networks the difference between the current output and the
desired output is calculated. This raw error is then transformed by the error function to match particular network architecture. The
most basic architectures use this error directly, but some square the error while retaining its sign, some cube the error, while the
other paradigms modify the raw error to fit their specific purposes. The artificial neuron's error is then typically propagated into
the learning function of another processing element. This error term is sometimes called the current error.
The current error is typically propagated backwards to a previous layer. Yet, this back- propagated value can be either the current
error, the current error scaled in some manner (often by the derivative of the transfer function), or some other desired output
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
713
© 2019 IJRAR May 2019, Volume 6, Issue 2
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
depending on the network type. Normally, this back-propagated value, after being scaled by the learning function, is multiplied
against each of the incoming connection weights to modify them before the next learning cycle.
Learning Function: The purpose of the learning function is to modify the variable connection weights on the inputs of each
processing element according to some neural based algorithm. This process of changing the weights of the input connections to
achieve some desired result can also be called the adaption function, as well as the learning mode.
1.2.2 Multi Layer Perceptron and its basics
Perceptron
Just like atoms form the basics of any material on earth similarly the basic forming unit of a neural network is a perceptron.
A perceptron can be understood as anything that takes multiple inputs and produces one output.
The three ways of creating input output relationships:
1. By directly combining the input and computing the output based on a threshold value. For example: Take x1=0, x2=1, x3=1 and
setting a threshold =0. So, if x1+x2+x3>0, the output is 1 otherwise 0. You can see that in this case, the perceptron calculates the
output as 1.
2. Next, add weights to the inputs. Weights give importance to an input. For example, you assign w1=2, w2=3 and w3=4 to x1, x2 and
x3 respectively. To compute the output, we will multiply input with respective weights and compare with threshold value as w1*x1 +
w2*x2 + w3*x3 > threshold. These weights assign more importance to x3 in comparison to x1 and x2.
3. Next, add bias: Each perceptron also has a bias which can be thought of as how much flexible the perceptron is. It is somehow
similar to the constant b of a linear function y =ax + b. It allows us to move the line up and down to fit the prediction with the data
better. Without b the line will always goes through the origin (0, 0) and you may get a poorer fit. For example, a perceptron may have
two inputs, in that case, it requires three weights. One for each input and one for the bias. Now linear representation of input will look
like, w1*x1 + w2*x2 + w3*x3 + 1*b.
But, all of this is still linear which is what perceptrons used to be. So, people thought of evolving a perceptron to what is now called as
artificial neuron. A neuron applies non-linear transformations (activation function) to the inputs and biases.
What is an activation function?
Activation Function takes the sum of weighted input (w1*x1 + w2*x2 + w3*x3 + 1*b) as an argument and return the output
of the neuron. In above equation, the representation is 1 as x0 and b as w0.
𝑛
a = 𝑓 ∑ wi ∗ xi
𝑖=0
The activation function is mostly used to make a non-linear transformation which allows to fit nonlinear hypotheses or to estimate the
complex functions. There are multiple activation functions, like: Sigmoid, Tanh, and many other.
Multi-layer perceptron
Generally, a neural network has a single layer consisting of 3 input nodes i.e x1, x2 and x3 and an output layer consisting of a single
neuron. But, for practical purposes, the single-layer network can do only so much. An MLP consists of multiple layers called Hidden
Layers stacked in between the Input Layer and the Output Layer as shown below.
Fig 1.2 Multilayer perceptron
The image above just shows a single hidden layer in green but in practice can contain multiple hidden layers. Another point to
remember in case of an MLP is that all the layers are fully connected that is every node in a layer (except the input and the output
layer) is connected to every node in the previous layer and the following layer. Gradient Descent algorithm is used in the process for
optimizing errors.
Full Batch and Stochastic Gradient Descent
Both variants of Gradient Descent perform the same work of updating the weights of the MLP by using the same updating
algorithm but the difference lies in the number of training samples used to update the weights and biases. Full Batch Gradient Descent
Algorithm as the name implies uses all the training data points to update each of the weights once whereas Stochastic Gradient uses 1
or more (sample) but never the entire training data to update the weights once. A simple example of a dataset of 10 data points with
two weights w1 and w2.
Full Batch uses 10 data points (entire training data) and calculate the change in w1 (Δw1) and change in w2 (Δw2) and update w1 and
w2. Whereas SGD uses 1st data point and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2. Next,
when you use 2nd data point, you will work on the updated weights.
1.2.3 Steps involved in Neural Network methodology
The step by step building methodology of Neural Network (MLP with one hidden layer, similar to above-shown architecture) is as
follows. At the output layer, only one neuron is used to solve a binary classification problem (predict 0 or 1).
First look at the broad steps: We take input and output
•
X as an input matrix
•
y as an output matrix
1) Initialize weights and biases with random values (This is one time initiation. In the next iteration, we will use updated weights, and
biases). Let us define:
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
714
© 2019 IJRAR May 2019, Volume 6, Issue 2
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
wh as weight matrix to the hidden layer
bh as bias matrix to the hidden layer
wout as weight matrix to the output layer
bout as bias matrix to the output layer
2) We take matrix dot product of input and weights assigned to edges between the input and hidden layer then add biases of the
hidden layer neurons to respective inputs, this is known as linear transformation:
hidden_layer_input= matrix_dot_product(X,wh) + bh
3) Perform non-linear transformation using an activation function (Sigmoid). Sigmoid will return the output as 1/(1 + exp(-x)).
hiddenlayer_activations = sigmoid(hidden_layer_input)
4) Perform a linear transformation on hidden layer activation (take matrix dot product with weights and add a bias of the output layer
neuron) then apply an activation function (again used sigmoid, but you can use any other activation function depending upon your
task) to predict the output
output_layer_input=matrix_dot_product (hiddenlayer_activations*wout) +bout
output = sigmoid(output_layer_input)
All above steps are known as ―Forward Propagation‖
5) Compare prediction with actual output and calculate the gradient of error (Actual – Predicted).
Error is the mean square loss = ((Y-t)^2)/2
E = y – output
6) Compute the slope/ gradient of hidden and output layer neurons ( To compute the slope, we calculate the derivatives of non-linear
activations x at each layer for each neuron). Gradient of sigmoid can be returned as x * (1 – x).
slope_output_layer=derivatives_sigmoid(output)
slope_hidden_layer=derivatives_sigmoid(hiddenlayer_activations)
7) Compute change factor(delta) at output layer, dependent on the gradient of error multiplied by the slope of output layer activation
d_output = E * slope_output_layer
8) At this step, the error will propagate back into the network which means error at hidden layer. For this, we will take the dot product
of output layer delta with weight parameters of edges between the hidden and output layer (wout.T).
Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose)
9) Compute change factor(delta) at hidden layer, multiply the error at hidden layer with slope of hidden layer activation
d_hiddenlayer=Error_at_hidden_layer * slope_hidden_layer
10) Update weights at the output and hidden layer: The weights in the network can be updated from the errors calculated for training
example(s).
wout=wout+matrix_dot_product(hiddenlayer_activations.
Transpose, d_output)*learning_rate
wh=wh+matrix_dot_product(X.Transpose,d_hiddenlayer)*learning_rate
learning_rate: The amount that weights are updated is controlled by a configuration parameter called the learning rate)
11) Update biases at the output and hidden layer: The biases in the network can be updated from the aggregated errors at that neuron.
bias at output_layer =bias at output_layer + sum of delta of output_layer at row-wise * learning_rate
bias at hidden_layer =bias at hidden_layer + sum of delta of output_layer at row-wise * learning_rate
bh=bh+ sum(d_hiddenlayer, axis=0)*learning_rate bout = bout + sum(d_output, axis=0)*learning_rate
Steps from 5 to 11 are known as ―Backward Propagation.
II EXISTING SYSTEM
The limitations of the existing system are:
Updating the model so frequently is more computationally expensive than other configurations of gradient descent, taking
significantly longer to train models on large datasets.
The frequent updates can result in a noisy gradient signal, which may cause the model parameters and in turn the model error to
jump around (have a higher variance over training epochs).
The supervised algorithms are implemented in existing system which is not that efficient when compared to neural networks.
III PROPOSED SYSTEM
However, in the existing work Supervised machine Learning Algorithm is used. But in proposed system, Neural Networks
are implemented. For optimization, stochastic gradient descent algorithm is used. This algorithm is efficient for computations. For
classification of data, Multi Layer Perceptron (MLP) Classifier is used.
3.1 Design and Implementation
Design is the process of defining the architecture, components, modules, interfaces, and data for a system to satisfy specified
requirements.
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
715
© 2019 IJRAR May 2019, Volume 6, Issue 2
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
3.1.1 Design
The System Design Document describes the system requirements, operating environment, system and subsystem
architecture, files and database design, input formats, output layouts, human-machine interfaces, detailed design, processing logic, and
external interfaces.
3.1.2 Data Flow Diagram
Fig 3.1 Data flow diagram
3.1.3 Activity Diagram
Fig 3.2 Activity diagram
3.2 Implementation
3.2.1 Implementation Models
Optimization
Classification of data using MLP Classifier
Prediction
Optimization
Stochastic Gradient Descent Algorithm
Stochastic gradient descent is a variation of the gradient descent algorithm that calculates the error and updates the model for each
example in the training dataset. The update of the model for each training example means that stochastic gradient descent is often
called an online machine learning algorithm.
Fig 3.3 Stochastic Gradient Descent Algorithm
3.2.2 Installation Of Python On Ubuntu
Ubuntu 16.04 ships with both Python 3 and Python 2 pre-installed. To make sure that our versions are up-to-date, let‘s update and
upgrade the system with apt-get:
$sudo apt-get update
$sudo apt-get -y upgrade
The -y flag will confirm that we are agreeing for all items to be installed, but depending on your version of Linux, you may need to
confirm additional prompts as your system updates and upgrades. Once the process is complete, we can check the version of Python 3
that is installed in the system by typing:
$python3 -V
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
716
© 2019 IJRAR May 2019, Volume 6, Issue 2
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
You will receive output in the terminal window that will let you know the version number. The version number may vary, but it will
look similar to this:
$Output Python 3.5.2
To manage software packages for Python, let‘s install pip:
$sudo apt-get install -y python3-pip
A tool for use with Python, pip installs and manages programming packages we may want to use in our development projects. You
can install Python packages by typing:
$pip3 install package_name
Here, package_name can refer to any Python package or library, such as Django for web development or NumPy for scientific
computing. So if you would like to install NumPy, you can do so with the command pip3 install numpy.
There are a few more packages and development tools to install to ensure that we have a robust set-up for our programming
environment:
$sudo apt-get install build-essential libssl-dev libffi-dev python-dev
KERAS
Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was
developed with a focus on enabling fast experimentation. Keras allows for easy and fast prototyping (through user friendliness,
modularity, and extensibility).It supports both neural networks and recurrent networks, as well as combinations of the two. It runs
seamlessly on CPU and GPU. Keras is compatible with: Python 2.7-3.6.
Guiding principles
User friendliness: Keras is an API designed for human beings, not machines. It puts user experience front and center. Keras
follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions
required for common use cases, and it provides clear and actionable feedback upon user error.
Modularity: A model is understood as a sequence or a graph of standalone, fully-configurable modules that can be plugged
together with as little restrictions as possible. In particular, neural layers, cost functions, optimizers, initialization schemes,
activation functions, regularization schemes are all standalone modules that can be combined to create new models.
Easy extensibility: New modules are simple to add (as new classes and functions), and existing modules provide ample
examples. To be able to easily create new modules allows for total expressiveness, making Keras suitable for advanced research.
Work with Python: No separate models configuration files in a declarative format. Models are described in Python code, which
is compact, easier to debug, and allows for ease of extensibility.
Installation
Before installing Keras, please install one of its backend engines: TensorFlow, Theano, or CNTK. We recommend the TensorFlow
backend.
TensorFlow installation instructions.
Theano installation instructions.
CNTK installation instructions.
You may also consider installing the following optional dependencies:
cuDNN (recommended if you plan on running Keras on GPU).
HDF5 and h5py (required if you plan on saving Keras models to disk).
graphviz and pydot (used by visualization utilities to plot model graphs). Then, you can install Keras itself.
There are two ways to install Keras:
1. Install Keras from PyPI (recommended):
2.
sudo pip install keras
If you are using a virtualenv, you may want to avoid using sudo:
pip install keras
Using a different backend than TensorFlow
By default, Keras will use TensorFlow as its tensor manipulation library. Follow these instructions to configure the Keras backend.
Tensorflow installation through Nativepip
$ sudo apt-get install python-pip python-dev # for Python 2.7
$ pip install tensorflow
# Python 2.7; CPU support (no GPU support)
pip install numpy
pip install scipy
pip install matplotlib
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
717
© 2019 IJRAR May 2019, Volume 6, Issue 2
IV RESULT ANALYSIS
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
Test Results
All the test cases mentioned above passed successfully. No defects encountered.
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
718
© 2019 IJRAR May 2019, Volume 6, Issue 2
www.ijrar.org (E-ISSN 2348-1269, P- ISSN 2349-5138)
V CONCLUSION & FUTURE ENHANCEMENT
Chronic Disease like Cardiac Arrhythmia is predicted by Machine Using Neural Networks (NN) and for optimization Stochastic
Gradient Descent algorithm is used. For Optimization, different algorithms can be used for better accuracy.
REFERENCES
[1] M. Chen, Y. Hao, K. Hwang, L. Wang and L. Wang, "Disease Prediction by Machine Learning Over Big Data From Healthcare
Communities," in IEEE Access, vol. 5, pp. 8869-8879, 2017.doi: 10.1109/ACCESS.2017.2694446
[2]P. Groves, B. Kayyali, D. Knott, S. van Kuiken, The‗Big Data‘Revolution in Healthcare: Accelerating Value and Innovation,
2016.
[3]
S. B. Kotsiantis,I. D. Zaharakis and P. E. Pintelas,‖Machine learning: a review of classification and combining
techniques‖,JournalArtificial Intelligence Review archiveVolume 26 Issue 3, November 2006.doi:10.1007/s10462-007-9052-3
[4]
S. Lawrence, C. L. Giles, Ah Chung Tsoi and A. D. Back, "Face recognition: a convolutional neural-network approach," in
IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 98- 113,Jan 1997.doi: 10.1109/72.554195
IJRAR19K1534
International Journal of Research and Analytical Reviews (IJRAR)www.ijrar.org
719