Rapidminer

Required Software: RapidMinder Studio [[url removed, login to view] ] + TableAU Desktop 8.3 [[url removed, login to view]]

Task 1 consists of the following sub tasks

The sinking of the Titanic is a famous event. You may find it useful to research the facts surrounding the sinking of the Titanic to inform your understanding of the problem and ensuing interpretation of your data analysis of the factors determining the survival of passengers on the Titanic. Use the data mining tool RapidMiner to conduct an exploratory analysis of the [url removed, login to view] data set which is provided on the course study desk Assignment 2 folder link and then build a simple predictive model of Survival on the Titanic using a Decision Tree.

Sub Task A) You need to identify five key variables that contribute most to determining the survival rate of passengers on the ill-fated Titanic on its maiden voyage. Note you should also refer to the data dictionary provided with the [url removed, login to view] file which describes each of the variables and their range of values. (Hint: an exploratory analysis should be based on summary statistics, histograms, crosstab tables and scatterplots of individual variables and the relationship between individual variables and the target variable survived. Which variables are correlated with target variable survived and other variables?)

You might also need to consider reformatting some of variables to facilitate the next stage of analysis of the [url removed, login to view] and [url removed, login to view] data sets using a Decision Tree

(Hint: you will need to convert the survival variable to nominal variable with the values Yes =

1, No = 0 in [url removed, login to view]). See Data Mining for the Masses Chapters 3 and 4 for guidance in Exploratory Data Analysis using RapidMiner.

Discuss each of your five top predictor variables and the results of your exploratory data analysis in general using the RapidMiner data mining tool as well as how you dealt with missing data and unusual data informed by relevant supporting literature on the survival rate of passengers on the Titanic. Your discussion should also include appropriate statistical analysis results such as graphs and results tables from conducting an exploratory data analysis in the RapidMiner data mining tool with some supporting references on predictive model building and interpretation using Decision Trees in data mining (about 600 words).

The following table lists the data dictionary for the data set titanic_train.csv. (Note: [url removed, login to view] is the same as [url removed, login to view] but does not contain any values for target variable survived which is referred to as a label variable in Rapidminer).

Variable Description

pclass Passenger Class (1 = 1st class; 2 = 2nd class; 3 = 3rd class)

survived Survived (0 = No; 1 = Yes)

name Name

Sex Sex

Age Age

sibsp Number of Siblings/Spouses Aboard

parch Number of Parents/Children Aboard

ticket Ticket Number

fare Passenger Fare

cabin Cabin

embarked Port of Embarkation(C = Cherbourg; Q = Queenstown; S = Southampton)

boat Lifeboat

body Body Identification Number

[url removed, login to view] Home/Destination

Refer attached file to understand end to end requirement

