Context

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Attribute Information

1. Age: age of the patient [years]

2. Sex: sex of the patient [M: Male, F: Female]

3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]

4. RestingBP: resting blood pressure [mm Hg]

5. Cholesterol: serum cholesterol [mm/dl]

6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]

7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]

8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]

9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]

10. Oldpeak: oldpeak = ST [Numeric value measured in depression]

11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]

12. HeartDisease: output class [1: heart disease, 0: Normal]

Loading Libraries

Looking at the first and last five rows of the dataset everthing looks uniform and in order.

EDA

Univariate Analysis

Visualizing Categorical varibles in bar graphs.

Bivariate Analysis

We see above there is no significant age gap between postive and negative cases of heart disease between men and women.

According to the table men and women with higher heart rates are less likely to have heart disease.

Cholesterol does not seem to be a significant indicator of heart disease.

Looking above we see that older patients are slightly at a higher risk for heart disease.

Above we see no significant differnce for RestingBP in relation to heart disease.

Above we see no significant differnce for Cholesterol in relation to heart disease.

Above we see that those with a slightly higher MaxHR are not as much at risk of getting heart disease.

Let's look at outliers in every numerical column

Outlier Treatment

Treating the outliers

Outliers have been treated.

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting a patient is not going to get Heart Disease but in reality the patient does. - Loss of resources
  2. Predicting a patient is going to get Heart Disease but in reality the patient does not. - Loss of opportunity

Which Loss is greater ?

How to reduce this loss i.e need to reduce False Negatives ?

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Data Preparation Logistic Regression (with statsmodels library)

Observations

Additional Information on VIF

Multicollinearity

Dropping ST_Slope_Flat

Dropping ST_Slope_Up

Observations

Coefficient interpretations

Checking model performance on the training set

ROC-AUC

Optimal threshold using AUC-ROC curve

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Using model with default threshold

Model performance summary

Conclusion

Decision Trees

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

Build Decision Tree Model

Checking model performance on training set

Visualizing the Decision Tree

Model Improvement

Checking performance on training set

Visualizing the Decision Tree

Observations

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

Recall vs alpha for training and testing sets

Let's check the performance on test set

Using the decision tree with default parameters

Using the hyperparameter tuned decision tree

Comparing Decision Tree models

Conclusion

Business Recommendations