Hello everyone,
The objective of this post is to show you what happens when we give several numbers to a machine (Vorteris Big Data) and it finds out by itself how the countries should be organized into different boxes. This technique is called clustering! The questions we will answer in this post are:
- How are countries segmented based on the world’s indexes?
- What are the characteristics of each group?
- Which factors are the most influential for the separation?
Here we go!
Data First – What comes in?
I have gathered 65 indexes of 188 countries of the world, the sources are mainly from:
- UNDESA 2015,
- UNESCO Institute for Statistics 2015,
- United Nations Statistics Division 2015,
- World Bank 2015,
- IMF 2015.
Selected variables for the analysis were:
- Human Development Index HDI-2014
- Gini coefficient 2005-2013
- Adolescent birth rate 15-19 per 100k 20102015
- Birth registration under age 5 2005-2013
- Carbon dioxide emissions Average annual growth
- Carbon dioxide emissions per capita 2011 Tones
- Change forest percentile 1900 to 2012
- Change mobile usage 2009 2014
- Consumer price index 2013
- Domestic credit provided by financial sector 2013
- Domestic food price level 2009 2014 index
- Domestic food price level 2009-2014 volatility index
- Electrification rate or population
- Expected years of schooling – Years
- Exports and imports percentage GPD 2013
- Female Suicide Rate 100k people
- Foreign direct investment net inflows percentage GDP 2013
- Forest area percentage of total land area 2012
- Fossil fuels percentage of total 2012
- Freshwater withdrawals 2005
- Gender Inequality Index 2014
- General government final consumption expenditure – Annual growth 2005 2013
- General government final consumption expenditure – Perce of GDP 2005-2013
- Gross domestic product GDP 2013
- Gross domestic product GDP per capita
- Gross fixed capital formation of GDP 2005-2013
- Gross national income GNI per capita – 2011 Dollars
- Homeless people due to natural disaster 2005 2014 per million people
- Homicide rate per 100k people 2008-2012
- Infant Mortality 2013 per thousands
- International inbound tourists thousands 2013
- International student mobility of total tertiary enrolment 2013
- Internet users percentage of population 2014
- Intimate or no intimate partner violence ever experienced 2001-2011
- Life expectancy at birth- years
- Male Suicide Rate 100k people
- Maternal mortality ratio deaths per 100 live births 2013
- Mean years of schooling – Years
- Mobile phone subscriptions per 100 people 2014
- Natural resource depletion
- Net migration rate per 1k people 2010-2015
- Physicians per 10k people
- Population affected by natural disasters average annual per million people 2005-2014
- Population living on degraded land Percentage 2010
- Population with at least some secondary education percent 2005-2013
- Pre-primary 2008-2014
- Primary-2008-2014
- Primary school dropout rate 2008-2014
- Prison population per 100k people
- Private capital flows percentage GDP 2013
- Public expenditure on education Percentage GDP
- Public health expenditure percentage of GDP 2013
- Pupil-teacher ratio primary school pupils per teacher 2008-2014
- Refugees by country of origin
- Remittances inflows GDP 2013
- Renewable sources percentage of total 2012
- Research and development expenditure 2005-2012
- Secondary 2008-2014
- Share of seats in parliament percentage held by woman 2014
- Stock of immigrants percentage of population 2013
- Taxes on income profit and capital gain 205 2013
- Tertiary -2008-2014
- Total tax revenue of GDP 2005-2013
- Tuberculosis rate per thousands 2012
- Under-five Mortality 2013 per thousands
What comes out?
Let’s start looking at the map, where these groups are, then we go to the Vorteris’ visualization for better understanding the DNA (composition of factors of each group).
Ok, I see the clusters but know I want to know what is the combination of characteristics that unite or separate them. In the picture below is the Vorteris visualization considering all groups and all factors.
In the next figures, you will see how different it becomes when we select each group some groups.
Before we finish, below I add the top 5 most and the 5 least influential factors that Vorteris identified as the key to create the groups.
Top 5
- Maternal mortality ratio deaths per 100 live births 2013 – 91% influence
- Under-five Mortality 2013 thousand – 90%
- Human Development Index HDI-2014 – 90%
- Infant Mortality 2013 per thousands – 90%
- Life expectancy at birth- years – 90%
Bottom 5
- Renewable sources percentage of total 2012 – 70% influence
- Total tax revenue of GDP 2005-2013 – 72%
- Public health expenditure percentage of GDP 2013 73%
- General government final consumption expenditure – Percentual of GDP 2005-2013 73%
- General government final consumption expenditure – Annual growth 2005 2013 75%
Conclusions
According to Vorteris if you plan to live in another country or sell your product abroad, it would be wise to see to which group this country belong to. If it belongs to the same group you live in, then you know what to expect.
Could other factors be added to removed from the analysis? Yes, absolutely. However, sometimes it is not that easy to get the information you need at the time you need it, Big Data analyses usually have several constraints and typically really on the type of questions are posed to the Data and to the algorithm that, in turn, relies on the creativity of the Data Scientist.
The clustering approach is becoming more and more common in the industry due to its strategic role in organizing and simplifying the decision-making chaos. So how could a manager look at 12.220 cells to define a regional strategy?
Any question or doubts? Or anything that calls your attention? Please leave a comment!
For those who wish to see the platform operating in practice, here is a video using data from Switzerland. Enjoy it!.
What is Aquarela Advanced Analytics?
Aquarela Analytics is the winner of the CNI Innovation Award in Brazil and a national reference in the application of corporate Artificial Intelligence in the industry and large companies. Through the Vorteris platform and the DCM methodology, it serves important clients such as Embraer (aerospace), Scania, Mercedes-Benz, Randon Group (automotive), SolarBR Coca-Cola (food retail), Hospital das Clínicas (healthcare), NTS-Brasil (oil and gas), Auren,SPIC Brasil (energy), Telefônica Vivo (telecommunications), among others.
Stay tuned following Aquarela’s Linkedin!
Founder of Aquarela, CEO and architect of the Vorteris platform. Master in Engineering and Knowledge Management, enthusiast of new technologies, having expertise in Scala functional language and algorithms of Machine Learning and IA.
Founder – Commercial Director, Msc. Business Information Technology at University of Twente – The Netherlands. Lecturer in the area of Data Science, Data governance and business development for industry 4.0. Responsible for large projects in key industry players in Brazil in the areas of Energy, Telecom, Logistics and Food.
1 Comment
[…] in the machine learning field. The Dataset used for the experiment was presented presented in the previous post about Big Data country auto segmentation (clustering). The differences here is that this one also […]