How Vorteris Big Data organises the world?

Hello everyone,

The objective of this post is to show you what happens when we give several numbers to a machine (Vorteris Big Data) and it finds out by itself how the countries should be organized into different boxes. This technique is called clustering! The questions we will answer in this post are:

How are countries segmented based on the world’s indexes?
What are the characteristics of each group?
Which factors are the most influential for the separation?

Here we go!

Data First – What comes in?

I have gathered 65 indexes of 188 countries of the world, the sources are mainly from:

UNDESA 2015,
UNESCO Institute for Statistics 2015,
United Nations Statistics Division 2015,
World Bank 2015,
IMF 2015.

Selected variables for the analysis were:

Human Development Index HDI-2014
Gini coefficient 2005-2013
Adolescent birth rate 15-19 per 100k 20102015
Birth registration under age 5 2005-2013
Carbon dioxide emissions Average annual growth
Carbon dioxide emissions per capita 2011 Tones
Change forest percentile 1900 to 2012
Change mobile usage 2009 2014
Consumer price index 2013
Domestic credit provided by financial sector 2013
Domestic food price level 2009 2014 index
Domestic food price level 2009-2014 volatility index
Electrification rate or population
Expected years of schooling – Years
Exports and imports percentage GPD 2013
Female Suicide Rate 100k people
Foreign direct investment net inflows percentage GDP 2013
Forest area percentage of total land area 2012
Fossil fuels percentage of total 2012
Freshwater withdrawals 2005
Gender Inequality Index 2014
General government final consumption expenditure – Annual growth 2005 2013
General government final consumption expenditure – Perce of GDP 2005-2013
Gross domestic product GDP 2013
Gross domestic product GDP per capita
Gross fixed capital formation of GDP 2005-2013
Gross national income GNI per capita – 2011 Dollars
Homeless people due to natural disaster 2005 2014 per million people
Homicide rate per 100k people 2008-2012
Infant Mortality 2013 per thousands
International inbound tourists thousands 2013
International student mobility of total tertiary enrolment 2013
Internet users percentage of population 2014
Intimate or no intimate partner violence ever experienced 2001-2011
Life expectancy at birth- years
Male Suicide Rate 100k people
Maternal mortality ratio deaths per 100 live births 2013
Mean years of schooling – Years
Mobile phone subscriptions per 100 people 2014
Natural resource depletion
Net migration rate per 1k people 2010-2015
Physicians per 10k people
Population affected by natural disasters average annual per million people 2005-2014
Population living on degraded land Percentage 2010
Population with at least some secondary education percent 2005-2013
Pre-primary 2008-2014
Primary-2008-2014
Primary school dropout rate 2008-2014
Prison population per 100k people
Private capital flows percentage GDP 2013
Public expenditure on education Percentage GDP
Public health expenditure percentage of GDP 2013
Pupil-teacher ratio primary school pupils per teacher 2008-2014
Refugees by country of origin
Remittances inflows GDP 2013
Renewable sources percentage of total 2012
Research and development expenditure 2005-2012
Secondary 2008-2014
Share of seats in parliament percentage held by woman 2014
Stock of immigrants percentage of population 2013
Taxes on income profit and capital gain 205 2013
Tertiary -2008-2014
Total tax revenue of GDP 2005-2013
Tuberculosis rate per thousands 2012
Under-five Mortality 2013 per thousands

What comes out?

Let’s start looking at the map, where these groups are, then we go to the Vorteris’ visualization for better understanding the DNA (composition of factors of each group).

Click on the picture to play around with the map inside Google maps.

Ok, I see the clusters but know I want to know what is the combination of characteristics that unite or separate them. In the picture below is the Vorteris visualization considering all groups and all factors.

On the left side, there are the groups and their proportion. Segmentation sharpness is the measurement of the differences of groups based on all factors. On the right side is the total composition of variables or we can call the world’s DNA.

In the next figures, you will see how different it becomes when we select each group some groups.

The most typical situation of a country representing 51,60. We call them as average countries.

The second most common type representing 26.46% of the globe.

This is the cluster that has the so called first world countries with results are above average representing 14.89% of the globe. The United States does not belong to these group, but Canada, Australia, New Zeeland and Israel.

The US is numerically so different from the rest of the world that Vorteris decided to separate it alone in one group that had the highest distinctiveness = 38.93%.

Other countries didn’t have similar countries to share the same group, this is the case of United Arab Emirates.

Before we finish, below I add the top 5 most and the 5 least influential factors that Vorteris identified as the key to create the groups.

Top 5

Maternal mortality ratio deaths per 100 live births 2013 – 91% influence
Under-five Mortality 2013 thousand – 90%
Human Development Index HDI-2014 – 90%
Infant Mortality 2013 per thousands – 90%
Life expectancy at birth- years – 90%

Bottom 5

Renewable sources percentage of total 2012 – 70% influence
Total tax revenue of GDP 2005-2013 – 72%
Public health expenditure percentage of GDP 2013 73%
General government final consumption expenditure – Percentual of GDP 2005-2013 73%
General government final consumption expenditure – Annual growth 2005 2013 75%

Conclusions

According to Vorteris if you plan to live in another country or sell your product abroad, it would be wise to see to which group this country belong to. If it belongs to the same group you live in, then you know what to expect.

Could other factors be added to removed from the analysis? Yes, absolutely. However, sometimes it is not that easy to get the information you need at the time you need it, Big Data analyses usually have several constraints and typically really on the type of questions are posed to the Data and to the algorithm that, in turn, relies on the creativity of the Data Scientist.

The clustering approach is becoming more and more common in the industry due to its strategic role in organizing and simplifying the decision-making chaos. So how could a manager look at 12.220 cells to define a regional strategy?

Any question or doubts? Or anything that calls your attention? Please leave a comment!

For those who wish to see the platform operating in practice, here is a video using data from Switzerland. Enjoy it!.

What is Aquarela Advanced Analytics?

Aquarela Analytics is the winner of the CNI Innovation Award in Brazil and a national reference in the application of corporate Artificial Intelligence in the industry and large companies. Through the Vorteris platform and the DCM methodology, it serves important clients such as Embraer (aerospace), Scania, Mercedes-Benz, Randon Group (automotive), SolarBR Coca-Cola (food retail), Hospital das Clínicas (healthcare), NTS-Brasil (oil and gas), Auren,SPIC Brasil (energy), Telefônica Vivo (telecommunications), among others.

Stay tuned following Aquarela’s Linkedin!

Marcos Santos

Founder and CEO of Aquarela Analytics. Master’s degree in Engineering and Knowledge Management, author of the DCM (Data Culture Methodology), with over 30 years of experience in entrepreneurship and systems development. He has successfully structured and led dozens of highly complex strategic projects in Artificial Intelligence and Industry 4.0, focusing on Revenue Management, Predictive Maintenance, Competitive Intelligence, Customer Acquisition, Logistics, and Inventory for major companies such as Mercedes-Benz, Vivo/Telefônica, Coca-Cola, Scania, Randon, Votorantim, and Embraer.

Joni Hoppen

Founder – Commercial Director, Msc. Business Information Technology at University of Twente – The Netherlands. Lecturer in the area of Data Science, Data governance and business development for industry 4.0. Responsible for large projects in key industry players in Brazil in the areas of Energy, Telecom, Logistics and Food.

1 Comment

Big Data Scenario Discovery, why is this super useful decision making? - Aquarela says:

16 de September de 2016 at 12:30

[…] in the machine learning field. The Dataset used for the experiment was presented presented in the previous post about Big Data country auto segmentation (clustering). The differences here is that this one also […]

Reply

How Vorteris Big Data organises the world?

How Vorteris Big Data organises the world?

Data First – What comes in?

What comes out?

Top 5

Bottom 5

Conclusions

What is Aquarela Advanced Analytics?

1 Comment

Leave a Reply Cancel reply