top of page

Machine learning Datasets and Descriptions



Blood Transfusion service center data set - Classification


Description :


This dataset was taken from a blood transfusion service center in taiwan. This dataset contains information about the blood donor, E.g. duration of last month blood donation, number of times blood donated, how much blood donated, how many times blood donated etc.This dataset consists of 748 instances and 5 attributes. We can use this dataset to predict whether he/she donated blood in March 2007.


Dataset Blog Link :


Cervical Cancer Risk factor Dataset - Classification and Clustering


Description :


The dataset has been collected at Hospital Universitario de Caracas in Caracas, Venezuela. This dataset contains the detailed information of habits, demographic and historical medical records of 858 patients. In this dataset, there are 55 patients diagnosed with cervical cancer disease and the number of healthy patients is 803. Cervical cancer is the leading gynecological malignancy worldwide and According to the WHO report it is most common among women in developing countries.



Divorce Predictor Dataset - Classification


Description :


This dataset contains the 54 questions and answers asked to the married couple. They were answered by 170 people. In this dataset, there are 84 divorced and 86 married. Each question had different probabilities of impact. Answers are on a 5-point scale (0 = Never, 1 = Rarely, 2 = Average, 3 = Often, 4 = Always). We can use this dataset to predict whether a married couple will divorce.



Student Performance Dataset - Classification and Regression


Description :


This dataset contains information about student performance in secondary education of two Portuguese schools. The data attributes are student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the student performance in two subjects: Mathematics (mat) and Portuguese language (por). The two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).


Fire forest dataset - Regression


Description :


This dataset contains information about forest fires. This dataset is used to Predict Forest Fires using Meteorological Data. In [Cortez and Morais, 2007], the output 'area' was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.



Heart Disease Dataset - Classification


Description :


The heart disease dataset is available on kaggle and UCI Machine learning Repository. According to UCI, "This dataset contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date." We can use this dataset for classification, to predict whether patients have heart disease by giving some features of users.



Wholesale dataset - Classification and Clustering


Description :


This data set refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories. The wholesale distributor operating in different regions of Portugal has information on annual spending of several items in their stores across different regions and channels. The dataset consist of 440 large retailers annual spending on 6 different varieties of product in 3 different regions (lisbon , oporto, other) and across different sales channel ( Hotel, channel)



Online Retail Dataset - Classification, Clustering and Regression


Description :


This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011. We can use this dataset for regression, clustering and classification for e.g. to predict the sale of items or to predict the products which have been purchased previously and the user is most likely to buy the same products in their next order etc.



Room Occupancy detection dataset - Classification


Description :


This dataset provides information about the room's environmental factors such as temperature, humidity, light, Co2 Humidity ratio and occupancy. We can use this dataset for predicting occupancy in an office room. There are three dataset available, one for training and two for testing the models considering the office door opened and closed during occupancy. The target variable occupancy 0 and 1.


Dataset Blog Link :


Census income dataset - Classification


Description :


The data extracted by Barry Becker using the1994 census dataset. Dataset contains 14 attributes consisting of 8 categorical and 6 continuous attributes containing information about age, education, nationality, marital status, relationship status, occupation, work classification, gender, race, working hours per week, capital loss and capital gain. The target variable in the dataset income level which predicts whether a person earns more than 50 thousand dollars per year or not based on the given set of attributes.



If you need implementation for any of the topics mentioned above or assignment help on any of its variants, feel free to contact us.


Comments


bottom of page