Skip to main content

Step by step data analysis using spreadsheet application (e.g. Microsoft Excel)

Background

A spreadsheet application is an invaluable tool to gather insights from data when the data is structured and the quantity of data is less.
A good rule of thumb is when the data is stored as a CSV (Comma Separated values) or in the XLS/XLSX format and the size of the data is less than 25 MB then use spreadsheet applications for analyzing the data to get insights. Also, it needs to be noted that Microsoft Excel (https://products.office.com/en-in/excel) is not the only spread sheet software, Open Office Calc (https://www.openoffice.org/product/calc.html) and LibreOffice Calc (https://www.libreoffice.org/discover/calc/) are good and FREE alternatives.


Dataset

As part of this post we shall work through the steps that are involved to perform analysis on the Titanic dataset. This dataset contains the details of the passengers of the RMS Titanic which was involved in an unfortunate accident with an ice berg that led to the death of many people. The back ground for the dataset can be found here and the details of the columns of the dataset are here. The CSV file itself can be downloaded using this link. Note that you would need to create an account in Kaggle and take part in the Titanic competition before you can download the dataset.

Our objective

We shall perform exploratory analysis on this dataset, which we would typically do for any dataset. The main objectives of the exploratory analysis will be as follows:
1. How many passenger details are there in the file?
2. What is the total amount of fare paid?
3. What is the average age and median age of the passengers?
4. Which class is the most frequent class for travel?
5. Does the name of the passenger have some special title (other than Mr./ Mrs.)?
6. What is the average number of siblings/spouses and parent children aboard the ship?
7. For the male passengers what are their class of travel sorted by the age of the passenger?
8. Exercise: For the female passengers who unfortunately died, what was their class of travel sorted by the age of the passenger?

Note: The below solutions are done in Microsoft Excel which is part of Office 365.


Realizing our objectives

First we need to open the CSV file that you downloaded from the above link. So, right click on the downloaded file, hover over "Open with" and then click Excel


1. How many passenger details are there in the file?

>> To solve this we can check the number of rows in the sheet that have data. We need to scroll down till we see data in the spreadsheet. The row number to the left will show the number of records, which in this case is 891 (1 row subtracted for the headers).


2. What is the total amount of fare paid?


>> To get the total of a column there are 2 ways to do it.

a. First way is to click the 'J' symbol in the column to select the column and then observe the Sum at the bar near the bottom.


b. The next way is to use the sum function in the spreadsheet application. You can Google the details of the sum function and check this link from official Microsoft Office support:https://support.office.com/en-us/article/sum-function-043e1c7d-7726-4e80-8f32-07b23e057f89

All functions in spreadsheet applications needs to be typed in the "formula" bar towards the top of the window. A function starts with an equals symbol and then followed by the name of the function. A parenthesis is then required after which the parameters of the function need to be provided, which depends on the function that is used. In the case of SUM the parameter takes a range of values.
A cell in Excel is labelled by a combination of the column name and row number. Columns are named alphabetically and rows are numbered starting with 1. So the first cell in any spreadsheet application has the label A1. The remaining cells can be named based on the column and row of the cell.
For SUM we first give the cell label of the starting cell and then a colon separator and then the cell label of the ending cell. Since here the values are in the same column, the two cells will share the column name, which is J in this case. Hence, the formula will look like below. The text values will be ignored for the SUM and the formula can be written after selecting any cell which in my case is done after selecting cell M3. The sum formula is SUM(J1:J892). Note the capitalization of the alphabet j in this case will not matter. Type enter to see the result of the formula show up in the cell.



 Also, note that the value is same as the earlier case showing the formula is correct.
Congratulations, you just wrote your first formula for analyzing data using Excel!! Give yourself a pat on your back!!


3. What is the average age and median age of the passengers?

>> This task is similar to the earlier task, try to type in the formula yourself using the steps from the last task. The formulas and results are mentioned for reference.

Average Age: type in into cell M4:  '=AVERAGE(F1:F892)' Expected age: 29.699
Median age: type in into cell M5: '=MEDIAN(F1:F892)' Expected age: 28

The resulting sheet is shown below:

4. Which class is the most frequent class for travel?


>> The most frequent value in a distribution of values is called the Mode of a distribution. To get the mode of the passenger class type in the formula 'MODE.SNGL(C2:C892)' and expect '3' as the result.

5. Does the name of the passenger have some special title (other than Mr./ Mrs.)?

>> This is slightly complex. We would need to use the IF formula to solve this task. Please GOOGLE and check the links for specific details on the IF formula. On a high level the IF formula takes in 3 parameters. The first parameter is a condition that needs to be evaluated. In case the condition evaluates to TRUE then the second parameter is executed and if it evaluates to FALSE then the third parameter is executed. The third parameter to the IF function is optional.

For this case, we will create some columns. The first column called 'title' will hold the title of the person extracted from the name, the second column called 'Is special title' will hold whether the title that was extracted is a special title, i.e. not Mr./Mrs./Miss/Master. These titles are present in a separate table so that more titles can be added or removed as required.

To extract the title we need to observe that the titles are mentioned in the names after a comma and space. So we need to find the location of the first comma with space in the name and start extracting the characters till we encounter another space. The FIND function in Excel is useful to find the location of smaller texts inside a larger text. The MID function helps to take a portion of the text from the larger text and the TRIM function removes any preceding and trailing spaces in the string. The formula for extracting the title is the below:
TRIM(MID(D2, FIND(", ", D2) + 2,  FIND(". ", D2) - FIND(", ", D2) - 2 ))

Let us break down the formula since it is quite a handful.
1. The TRIM function takes in a text and removes the spaces at the front and back. https://support.office.com/en-ie/article/trim-function-410388fa-c5df-49c6-b16c-9e5630b479f9

2. The MID function takes in 3 parameters, the text from which to extract a smaller text, the starting portion from where to start extracting characters and the number of characters to extract. Here we specify the D2 cell as the cell containing the first name, the starting location as the location of the comma with a space added by 2 to account for these 2 characters. The number of characters is computed as the number of characters between the full stop and a space that marks the end of the title and the comma and space that we explored earlier.
Further details: https://support.office.com/en-us/article/mid-midb-functions-d5f9e25c-d7d6-472e-b568-4ecb12433028

3. The FIND function finds the location of the text pattern mentioned as the first parameter in the second parameter.
Details here: https://support.office.com/en-ie/article/find-findb-functions-c7912941-af2a-4bdf-a553-d0d89b0a0628

So in plain English the formula can be read as: trim the extra spaces to the left and right of the text obtained as a result of extracting the text from the comma and space till the fullstop that marks the end of the title of the name.

This formula can be applied to all the rows by dragging the small square to the bottom of the cell where the formula was written as shown below.



Now that we got the titles from the names, we shall use the VLOOKUP function to check whether these titles are part of the list of titles which we consider as not special. We shall write down these titles in rows R7 through R10: Mr, Mrs, Master and Miss. The formula looks like the below

IF(ISERROR(VLOOKUP(N2,$R$7:$R$10,1,FALSE)),TRUE,FALSE)

The VLOOKUP function takes in 4 parameters and details are described in the Office support link  above. The first parameter is the value we want to lookup which in this case is our title which we just extracted. The next parameter is the table where we search for our value which is the table where we have saved the 4 values we are tracking for now. Note that the first column in this table must have the same values we are looking up (as part of our first parameter). The third parameter is the column from where we fetch the result in case of a match, which in this case we set to the only column value 1. The 4th parameter is to specify whether to perform approximate match or not.
The VLOOKUP function returns an error when the value we lookup (1st parameter) does not exist in the table (2nd parameter). We check for this error using the ISERROR function. We wrap everything in an IF function where we say that if there was an error while looking up the title value in our list then we have a special title else it is not special. Do note since the table where we lookup our values is static, in order to avoid the cells from changing as we drag the small square to apply the formula to all the rows, we need to add the dollar ($) sign to prevent these values from changing even when we drag the formula.


Now we finally have the folks with special titles, the below highlights show a few examples to show that our approach and formulas work as expected.

6. What is the average number of siblings/spouses and parent children aboard the ship?

>> Try to solve using the AVERAGE formulas.

7. For the male passengers what are their class of travel sorted by the age of the passenger?

>> Here we need to sort the records by the age of the passenger after filtering to only male passengers. We then would be interested in their class of travel.

The below steps need to be performed:

a. Filter the data so that we only have the male passengers. Click on E which is the 'Sex' column, then navigate to the Data tab at the top and then click on Filter. A small selection box will appear to the side of the column name.

Click on the box and select 'male'.

b. Sort the male records by the age, by clicking on the Age column F and then click the sort button on the data tab. Since there is data beside the column to be sorted we need to click on "Expand the selection" on the pop up that opens and then click on sort. Then select the age column and specify the order as shown below:
c. Hide the other columns by click the column name and then "Hide" to focus only on the required columns to get the below:


Here we have the male passengers, sorted by their age in descending order with the required information shown.


8. Exercise: For the female passengers who unfortunately died, what was their class of travel sorted by the age of the passenger?




Comments