This project aims to analyze a dataset representing the chemicals in coffee. We
want to build a model which will be able to predict the Preference_Score of the coffee
based on the quantity of chemicals present.
The provided dataset includes the following
information:
• Caffeine: Amount of caffeine in the coffee in mg
• Tannin: Amount of Tannin in the coffee in mg
• Thiamin: Amount of Thiamin in the coffee in mg
• Xanthine: Amount of Xanthine in the coffee in mg
• Spermidine: Amount of Spermidine in the coffee in mg
• Gualacol: Amount of Gualacol in the coffee in mg
• Chlorogenic_acid: Amount of Chlorogenic acid in the coffee in mg
• Preference_Score: Preference score of the drink given by the consumer.
• Drink_Name: Name of the drink.
• Cus_opinion: customer’s opinion on if they liked the service or not.
• Rating: Rating given to the employees.
In order to build the desired predictive model, develop the following tasks and answer
the following questions.
Questions and Tasks
1. Load and explore the dataset
(a) How many numerical features are there? How many categorical features?
(b) Verify if there are missing values in the dataset and handle them
(c) Justify the choices you make for handling the missing values
2. Prepare the dataset for a Linear Regression task.
(a) Verify the features values distribution of the numerical variables.
(b) Is features transformation necessary for the numerical variables? Let’s take
into account that we are preparing the dataset for a Linear Regression task,
with the goal of building a "Preference_Score" predictive model. If transfor-
mation is necessary, after justifying your choices, do proceed as described.
(c) Verify the presence of outliers and eventually handle them. Justify your
choices.
(d) Is encoding necessary for the categorical variables? If yes, which kind of
encoding? Specify your choices, justify them and perform categorical data
encoding, if necessary.
(e) Increase the dimensionality of the dataset introducing Polynomial Features –
degree = 3 (continuous variables)
(f ) Eventually include any other transformation which might be necessary/appropriate
and justify your choices.
3. Features Selection
(a) Perform One Way ANOVA and test the relationship between variable Drink_Name
and Preference_Score. Eventually, consider the possibility to remove the feature. Justify your choice.
(b) Perform Features Selection and visualize the features which have been selected. Select one appropriate methodology for features selection and justify
your choice.
4. Linear Regression
(a) Train a Multiple Linear Regression model, using the Sklearn implementation
of Linear Regression to find the best θ vector. Use all the transformed features, excluding the derived polynomial features. Evaluate the model with the learned θ on the test set.
(b) Use all the transformed features, excluding the derived polynomial features, to
identify the best values of θ by means of a Batch Gradient Descent procedure.
Identify the best values of η (starting with an initial value of η = 0.1 ). Evaluate
the model with the trained θ on the test set. Plot the train and the test error
for increasing number of iterations of the Gradient Descent procedure (with
the best value of η). Provide a comment of the plot.
(c) Use the complete set of features, including the derived polynomial features.
Train a Multiple Linear Regression model, using the Sklearn implementation
of Linear Regression to find the best θ vector. Evaluate the model with the
learned θ on the test set. Plot the train and the test error for increasing the
size of the train-set (with the best value of η). Provide a comment of the plot.
(d) Use the complete set of features, including the derived polynomial features.
Train a Ridge Regression model identifying the best value of the learning
rate α that allows the model to achieve the best generalization performances.
Evaluate the model.
(e) Use the complete set of features, including the derived polynomial features.
Train a Linear Regression model with Lasso regularization. Comment on the
importance of each feature given the related trained parameter value of the
trained model. Also, verify the number of features selected (related coefficient
θ different from zero) with different values of α.
(f ) Use the subset of features selected in the Feature Selection task (question 3b).
Train a Multiple Linear Regression model using the Sklearn implementation
of Linear Regression to find the best θ vector. Evaluate the model.
(g) Create a table with the evaluation results obtained from all the models above
on both the train and test sets.
(h) Compare and discuss the results obtained above.
This project can be used as final year project, capstone project, personal portfolio project, resume, proof of concept.
If you need implementation for the above problem or any of its variants, feel free to contact us.
Comments