How Titanic passengers are segmented by Vorteris Big Data?

How Titanic passengers are segmented by VORTX Big Data

To demonstrate how Vorteris works, I selected a well-known dataset with information about the passengers who embarked on Titanic. Despite the tragic event, this dataset is fairly rich in details and has been widely used in Machine Learning communities since it allows the application of several Big Data techniques.

In this case, I am going to apply Vorteris which it is Big Data tool focused giving automatic segmentation plus other important decision-making indicators. This technique is called clustering. More information about this on this post (How can big data clustering strategy help business)In the conclusion section, I give some ideas on how it help businesses by means of this innovative approach.

Titanic Dataset summary

According to Encyclopedia Titanica “On 10 April 1912, the new liner sailed from Southampton, England with 2,208 passengers and crew, but four days later she collided with an iceberg and sank: 1496 people died and 712 survived”.  For this analysis the data we had access we had the following figures: 

  • 1309 people on board of which 500 survived (38%) and 809 (62%) died.
  • The average age of 29.88 years (estimated).
  • 466 women of which 127 died and 339 survived.
  • 843 man, of which 682 died and 161 survived.
  • Ticket cost on average £53.65 per woman while £76.60 for man.

For more details on the complete dataset – Google for Titanic Dataset.

Factors under analysis

Unfortunately, 267 passengers (20.39%) had to be excluded from the analysis due to missing age values. Furthermore, out of 15 factors presented in the original file, I select the numerical ones with stronger weights calculated by Vorteris. Usually, we classify factors, variables or data attributes in the following 3 categories:

  • Protagonist – Factors with strong positive influence to generate a valuable pattern with clarity.
  • Antagonist – Factors with noise or unclear patterns and negative influence that play against the protagonist.
  • Supporting – Factors that do not play a significant role in changing the path of the analysis, but can enrich the results.

According to the influence power, the protagonists chosen for this analysis were:

  • Age of the passenger = 87.85%
  • How much each passenger paid to embark = 72.69%
  • Number of parents on the ship = 71.69%
  • Number of siblings or spouses on the ship = 72.42%

During the calculation the gender that indicates if the passenger was male or female tended to play an antagonist role, meaning the absence of a pattern to form the groups dropping the dataset sharpness to 7%.  Therefore, it was removed.

Vorteris Results and group characteristics

After processing, Vorteris resulted in the following indicators, which most of them are not offered by other algorithms, therefore, I give a brief explanation for each of them:

  • Dataset Sharpness = 33.64%. It shows how clear or confident the machine is about the discovered grouping patterns. According to our dataset quality scale, sharpness above 20% is already useful for decision making.
  • Automatic discovery of segments (groups) = 8. This is a function that makes the whole process a lot easier for the data analyst. Unlike k-means and other algorithms, Vorteris finds the right (ideal) number of groups by itself reducing dramatically the segmentation errors that topically happened.
  • Clustering Distinctness = How much different the elements of each group are in relation to the overall group that makes them a group. The most distinctive one is number 5 with 51.48% (darker color) and the least one group 1 with 8.58%. This means that elements from group 5 tend to more homogeneous than the other groups.

Vorteris screenshot

By analyzing the groups and checking against the ones who survived or not the trip I came to the survival rate of each group plus the average Ticket Fare, so if you have the characteristics of the group 5 or 7 you would have better chances of surviving. 

Survival Rate

New indicators based on groups. Blue bar = group size / Green = Higher survival rate and Ticket Price.

Naming the groups

To operationalise a managing strategy in any section you need to study the characteristics of each group and name them. Therefore, by looking at the key predominant characteristics of each group or also persona, let’s have a visual comparison of just 4 groups according to the factor “AGE”. The higher it goes, means the greater number of passengers with that characteristic. Those factors can be easily studied interactively on the Vorteris DataScope.


Screenshots of the Vorteris DataScope

Yet, another option is to look straight to the grouped data. In this case, I took a screenshot of the classified data of the group number 5, which has the most distinct passengers on the whole ship, probably right young people traveling with the whole family.

Screen Shot 2016-03-25 at 18.17.48

Classified data

Conclusions and Recommendations

The most typical passenger is a young person with an average age of 21 years and who paid on average £26.35 while the less typical passenger is one alone on the group 8 who had 38 years old paid £7.775, was traveling with both parents plus 4 siblings.

Looking at the case with a more than a thousand records is not a great use to find out theses profiles, however, if you have millions of transactions, millions of clients or patients, the tool could serve as the key tool to optimize your operation, reducing costs and better aiming at your public, so:

  • Who is the most typical client you have?
  • What are the characteristics of each group?
  • What is the total cost or revenue per group?
  • What groups represent 80% of your cost or revenues?
  • Which groups do you want to address your strategy and the ones you don’t want to?
  • What are the Protagonist, Antagonist, Supporting factors that most affect your strategy?
  • The persona created by Vorteris matches the persona you have today? Benchmark it!

That was it, for now, hope this could be interesting and useful to plan your decisions ahead. In case you need a little help let us know.



What is Aquarela Advanced Analytics?

Aquarela Analytics is the winner of the CNI Innovation Award in Brazil and a national reference in the application of corporate Artificial Intelligence in the industry and large companies. Through the Vorteris platform and the DCM methodology, it serves important clients such as Embraer (aerospace), Scania, Mercedes-Benz, Randon Group (automotive), SolarBR Coca-Cola (food retail), Hospital das Clínicas (healthcare), NTS-Brasil (oil and gas), Auren,SPIC Brasil (energy), Telefônica Vivo (telecommunications), among others.

Stay tuned following Aquarela’s Linkedin!

Leave a Reply

Your email address will not be published. Required fields are marked *

Send this to a friend