Decision tree – a bit of Data Mining

As this blog is slowly growing up, I still have a chance to explain how relatively simple approaches can solve complex problems. Please be invited to read a bit of explanation on how to make decisions based on a non-linear dataset.

The idea for this article came up in parallel with the article describing the application which supports the process of buying football players in FIFA 2016 (link here: Lewandowski, Messi, Ronaldo, Mbappé – who’s worth buying in FIFA 2016). The aim was to split the application and explanation of results interpretation from a detailed description of what are the rules behind it.

Now you know, the idea presented in “Decision tree – a bit of Data Mining” will be described based on an already accomplished solution. Let me remind you in general what steps were taken.

Dataset

In every data mining tool, data is a must. Imagine that you have data representing every available player in FIFA 16 with basic information like name, birth date, height, weight, and player ID. No matter from where they are or in which country they play. All basic info is stored in the “Player” table. The fun thing is I couldn’t find a player with it’s duplicated name – all names are unique.
On the other side, we have a set of data with a finite number of skills (33 pure skills values in total) having identified values per footballer. Let’s call the second table “Player_attributes”. Some of the attributes are:

  • crossing,
  • head accuracy,
  • short passing,
  • volleys,
  • dribbling,
  • ball control,
  • acceleration,
  • shot power,
  • etc.

Along with skills scores in “Player_attributes”, one can find the player’s overall rating and preferred foot info. These will be required in a minute. Both tables are bonded by player ID, which fits perfectly for identification.

Prerequisite

Before we start data mining activity, let’s proceed with basic preparation. By this, I mean to link data from both tables presented above. As it was written, it’s done by joining with player ID values (“player_api_id”). See diagram:

Database diagram of “Player” and “Player_Attributes” tables relation. “player_api_id” is the joining value.

The reason we did this is simple. We want to enable users to choose players from table “Player” by their names and get attributes stored in the “Player_Attributes” table. The simplified sequence of steps done in the application before the decision tree comes into action:

The simplified sequence of steps which are done when a user starts the application before the decision tree tries to assign output.

Structure explanation of the Decision tree

I bet many of you imagine a decision tree as an ordinary tree and you are right. There is nothing else than a root, a few bunches of branches and leafs. But some of them have little different names. Root stays root. Branch saves their names as well. Leafs are called nodes.

The first thing to know is that a root is a special node. It’s special because it’s first. It’s time for emphasize that each node (and a root) represent question on which decision have to be made. Branches are chosen paths of available decisions that are going to be made.

Node A contains a decision to be made. By choosing either decision Y or Z we approach Node B or Node C respectively.

Decision tree

The main dish of this article. What is the phenomenon of decision tree? Dear reader, you have used it more than once. So far we have two joined tables from which we can get attributes of desired footballer. Let me describe in a few words what the decision tree is all about in that project and it’s really simple.

Each time we want to narrow down output data let’s ask a question. It’s exactly as with Santa on Christmas Eve. Santa may ask a question “Who is a kid?” and then “Have you been good for Santa?”. Do you see? The output set is narrowing fast!

Getting back to the topic. What is the structure of questions and possible answers for the project which is being described? I’ll try to explain with the use of a schema:

General outline of the decision tree used in the project.

The main part of the project, the Decision tree itself, is not so complicated as it may looks like. In order to find what is a role player in the whole dataset of players the algorithm is following simple steps. I’ll try to write it in condensate way to show you it’s easy:

If you choose player X, check for all players who:

  • has the same “preferred foot” AND
  • who held the same “position” AND
  • who’s max “attribute” is the same.

The use of the conjunction is not a coincidence. AND operator gives us a chance to reorder questions. Sometimes it’s better to reorder the sentence to get better visibility. By De Morgan’s Law reordering words by “AND” operator doesn’t change the meaning of the sentence (just like multiplication, but for a matter of mathematical logic :)).

All that left is to get a player from the output set of players, who has the highest overall score, just to check what we might expect when dealing with a player chosen by the user (the initial one for which we’re checking statistics).

Conclusions

In the end, a decision tree is nothing else than asking continuous questions that narrows data to gain the best matches. Doing that, we have to be careful because one wrong question will give us an empty set! It’s like with a question: “I’d like to have superfast and elegant and exotic and cheap car” unfortunately it gives an empty set 🙁

Recommended Posts

No comment yet, add your voice below!


Add a Comment

Your email address will not be published. Required fields are marked *