It’s a well-known fact that in data science, people spend most of their time cleaning or preprocessing data. Sometimes as much as 70% of the project’s time may go to reorganizing your data before you can start doing some predictive analytics with it!
Therefore, knowing how to combine your data to prepare it for analysis becomes one of the key abilities to have under your data scientist belt.
That said, it’s super important to be in control of using different joins. Yes, there are other tools and terms, such as appending, merging, and concatenating data. However, joins are conceptually key to grasp. And once you understand join types and learn when and how to apply them well, everything else will fall into place.
That’s why in this article, we’ll go through the two most fundamental types of joins:
- The INNER JOIN;
- and the LEFT JOIN;
What about the RIGHT JOIN? Not all data-science-related software products provide the technical tool for applying RIGHT JOIN but down the article, we’ll explain how to easily use a workaround to create a RIGHT JOIN in practice!
How to Apply JOINs – through JOIN Statements?
Joins can be applied with multiple software tools for data science, such as SQL or Python. With R, you need to use a substitute (‘merge’) to create a join, but you can still arrive at an output that conceptually corresponds to a join.
So, we’ve presented joins as a concept for data interaction. But how are joins technically applied?
By using join statements.
To illustrate this point, we’ll show you how to create an INNER, LEFT, and RIGHT JOIN in MySQL. Why SQL? Because it’s one of the most popular query languages that allow you to work with relational database management systems (RDBMS).
How Does Relational Algebra Help Us Understand JOINs?
Bringing a relational algebra tool into the picture will make your life much easier when you need to use joins!
The algebraic tool in question is the Venn diagram, which shows the logical relation between a finite collection of several sets.
In SQL, relational algebra translates into relational schemas – the RDBMS tool that will help your strategy for linking tables. Also, don’t forget that in the case of relational schemas, the sets of the Venn diagram are in fact separate data tables that act as our data sources.
So, let’s see how we can create a join in MySQL.
How to set up the database?
We’ll use the “employees” database which you can download here.
Let’s focus on the “employees” and the “dept_emp” tables.
How do JOINs and relational schemas work?
Technically, SQL JOINs show result sets containing fields derived from two or more tables. Therefore, we can use a join to relate the “employees” and the “dept_emp” table. Then, we can extract information for a certain group of individuals from the “employees” table like employee number, first and last name. And, from the “dept_emp” table we can extract department number and start date of the labor contract.
But how can we actually relate the two tables? By designating a common key, also known as a common field, matching column or related column. It’s a column you can find in both the “employees” and “dept_emp” tables. In our example, this is the employee number column, “emp_no”.
One thing to keep in mind – when joining our data sources, it is crucial to correctly specify the matching columns on which we are going to combine our data!
How does the INNER JOIN work?
Let’s refer to the following Venn diagram.
The “dept_manager_dup” table is represented as the set on the left, while the “departments_dup” table is represented by the set, or the circle, on the right.
So, which will be the related column here? We know that we can join tables with columns of the same type and meaning. Therefore, the related column of these two tables will be the “dept_no”.
The shared area between the two circles, filled with red, represents all records that belong to both the “dept_manager_dup” and the “departments_dup” tables. We also call this area a result set.
And the INNER JOIN in SQL can help us extract this result set.
By the way, the values that do not match will not appear in our final output. Logically, they are called non-matching values, or non-matching records.
The SQL statement we need to obtain the desired result set is a SELECT statement.
Note that with the INNER JOIN, only non-null matching values are in play! Null values will not be matched, because it would make no sense to do so!
What if such matching values did not exist? Simply, the result set will be empty. There will be no link between the two tables.
To sum things up for the INNER JOIN, the query above will lead to a new set combining all records from both “dept_manager_dup” and “departments_dup” tables where all values of the matching columns are identical.
How Does the LEFT JOIN Work?
Similar to INNER JOINs, a SELECT statement with a LEFT JOIN clause will return all records of the table designated to be on the left side of the join, as well as the records from the common field for the table designated to be on the right side of the join.
However, in case there are no records from the right table to match with records from the left table on the specific common field, a null value will be displayed!
Here’s how it looks.
Compared to the INNER JOIN, the result set, colored in red, will include not only the common area but also the rest of the area of the left table.
Here’s the SELECT statement that will deliver this output to us.
There are two important takeaways about the LEFT JOIN.
First, it doesn’t include the inner part of the combination of the two sets of data only. So, it can also be called LEFT OUTER JOIN.
Second, the order of the tables matters! Whether you’ll set “dept_manager_dup to be on the left and “departments_dup” to be on the right or vice-versa can lead to quite different results! So, paying attention to this when executing your LEFT JOIN will guarantee that you will obtain the result you’re aiming for!
How does the RIGHT JOIN work?
The functionality of a RIGHT JOIN, or RIGHT OUTER JOIN, is inverse to that of a LEFT JOIN. A RIGHT JOIN will return all records of the table designated to be on the right side of the join, as well as the records from the common field for the table designated to be on the left side of the join.
Here’s an example.
Now let’s discuss the workaround we promised at the beginning of the article.
Is it possible to obtain the same output if we execute a right-join query or a left-join query with inverted tables? Definitely yes! And, that’s exactly what you need to do in case you’re using a software tool that doesn’t support RIGHT JOINs!
Therefore, these two queries will lead to the same result set!
Some Final Words on JOINs…
If you want to dig deeper into the field of data science, it’s essential that you understand INNER, LEFT, and RIGHT JOINs well. In fact, all other types of joins, such as FULL OUTER or CROSS JOINs, for instance, originate from here. And that makes the knowledge of how joins work even more fundamental.
You can find more details about the INNER, LEFT, RIGHT, and other join types in the SQL tutorials in our blog, so make sure you read them, too.
Source: 365 Data Science Blog