The approach taken by my team to predict diabetes

The Women in Data Science (WiDS) is a global movement to support data scientists around the world. WiDS conducts a datathon every year to encourage all the participants to enhance their skills by working on a social impact challenge. The best part of this competition is the local workshops and online resources to help beginners to advance their data science skills. My participation in the competition helped me to interact with several experienced data scientists and learn from them. This article is a brief explanation of the approach taken by my team.
Problem Description
The objective of the competition is to create a model to determine whether a patient admitted to an ICU has been diagnosed with a particular type of diabetes, diabetes mellitus. AUC_ROC is the metric used for evaluating the performance of the model.
Data Analysis
The dataset for the competition can be accessed from kaggle. This data is obtained from MIT’s GOSSIS (Global Open Source Severity of Illness Score) initiative. The dataset consists of 180 features. According to the dictionary, there are 7 categories of features in the data. Due to a large number of features in the dataset, a category-wise analysis of the data was done. A closer look into the data shows that our data is imbalanced.

There are a lot of missing values in the feature set. It is also observed that some features are missing in pairs. For example, the number of missing values for d1_glucose_max and d1_glucose_min is the same. We also noticed there are a lot of common values in some apache and vital columns.

Data Cleaning
The most important part of a data science project is to clean the data properly. There are some rows with ‘age’ = 0. An interesting finding is that the patients with age=0 have the height and weight of an adult. As this seems to be an error, all these rows are dropped. The missing values of ‘ethnicity’ are filled with ‘Other/Unknown’. As there are only a few rows with missing values of gender, this column is filled with the most common value of ‘M’. We tried to calculate the ‘bmi’ field with corresponding values of height and weight. But, this approach won’t work for this dataset because all the missing values of ‘bmi’ have either missing values of ‘height’ or ‘weight’ associated with it. So, the missing values in these features are imputed using the mean value of the feature for the corresponding gender. The missing values of the feature ‘icu_admit_source’ are filled using the column ‘hospital_admit_source’.
As discussed in the data analysis section, the d1_max and apache variables had a lot of common values. So, the missing values of the d1 variables are filled using corresponding apache values.
There were some min and max columns with the min value > max value. We have inverted those values. We also dropped all the columns having exactly the same values. The data dictionary description helped to fill in missing values of some features using other columns.
We dropped all the highly correlated columns in the dataset. We also dropped some columns with a lot of missing values and some other columns of no value. We created dummy variables for all the categorical variables.
Feature Engineering
Due to the limited domain knowledge, our team didn’t do a lot of feature engineering. As there are a large number of features in the dataset, we removed all highly correlated features from the data. My teammate Lia had done a wonderful job by creating a new feature of high importance. The new feature is an estimated value of kidney function.
Classification
We implemented some baseline models to check the performance of different algorithms. As the data is imbalanced, we tried to improve the models using techniques like SMOTE and ADASYN. Unfortunately, this approach didn’t give any expected results which resulted in focussing on tree-based models. Generally, it is observed that models like xgboost and lightgbm work really well in Kaggle competitions. We also noticed that these models are working well with this dataset. Our final model was the hyperparameter tuned lightgbm model. The model had had a final score of .86 on the final leaderboard which is a reasonably good model for a beginner.


Conclusion
It was a wonderful experience to participate in this competition. Public notebooks from experienced kagglers are a great resource to learn new concepts and techniques. I recommend all beginners to participate in datathons so that you could improve your data science skills. We also realized the importance of domain importance in the field of data science.