JOINING TABLES
One of SQL's most powerful features is the capability to
join tables on-the-fly within data retrieval queries. Joins are
one of the most important operations you can perform using SQL
SELECT, and a good understanding of joins and join syntax is an
extremely important part of learning SQL.
Before you can effectively use joins, you must understand
relational tables and the basics of relational database design.
What follows is by no means a complete coverage of the subject,
but it should be enough to get you up and running.
Understanding Joins
One of SQL's most powerful features is the capability to
join tables on-the-fly within data retrieval queries. Joins are
one of the most important operations you can perform using SQL
SELECT, and a good understanding of joins and join syntax is an
extremely important part of learning SQL.
Before you can effectively use joins, you must understand
relational tables and the basics of relational database design.
What follows is by no means a complete coverage of the subject,
but it should be enough to get you up and running.
Understanding Relational Tables
The best way to understand relational tables is to look at a
real world example.
Suppose you had a database table containing a product
catalog, with each catalog item in its own row. The kind of
information you would store with each item would include a
product description and price, along with vendor information
about the company that creates the product.
Now suppose you had multiple catalog items created by the
same vendor. Where would you store the vendor information
(things such as vendor name, address, and contact information)?
You wouldn't want to store that data along with the products
for several reasons:
- Because the vendor information is the same for each
product that vendor produces, repeating the information for
each product is a waste of time and storage space.
- If vendor information changes (for example, if the
vendor moves or his area code changes), you would need to
update every occurrence of the vendor information.
- When data is repeated (that is, the vendor information
is used with each product), there is a high likelihood that
the data will not be entered exactly the same way each
time. Inconsistent data is extremely difficult to use in
reporting.
The key here is that having multiple occurrences of the same
data is never a good thing, and that principle is the basis for
relational database design. Relational tables are designed so
information is split into multiple tables, one for each data
type. The tables are related to each other through common
values (and thus the relational in relational design).
In our example, you can create two tables, one for vendor
information and one for product information. The vendors table
contains all the vendor information, one table row per vendor,
along with a unique identifier for each vendor. This value,
called a primary key, can be a vendor ID, or any other unique
value. (Primary keys were first mentioned in Tutorial 1,
"Understanding SQL").
The products table stores only product information, and no
vendor specific information other than the vendor ID (the
vendors table's primary key). This key, called a foreign key,
relates the vendors table to the products table, and using this
vendor ID enables you to use the vendors table to find the
details about the appropriate vendor.
New Term
Foreign Key A column in one table that contains the primary
key values from another table, thus defining the relationships
between tables.
What does this do for you? Well, consider the following:
- Vendor information is never repeated, and so time and
space are not wasted.
- If vendor information changes, you can update a single
record in the vendors table. Data in related tables does
not change.
- As no data is repeated, the data used is obviously
consistent, making data reporting and manipulation much
simpler.
The bottom line is that relational data can be stored
efficiently and manipulated easily. Because of this, relational
databases scale far better than non-relational databases.
New Term
Scale Able to handle an increasing load without failing. A
well-designed database or application is said to scale
well.
Why Use Joins?
As just explained, breaking data into multiple tables
enables more efficient storage, easier manipulation, and
greater scalability. But these benefits come with a price.
If data is stored in multiple tables, how can you retrieve
that data with a single SELECT statement?
The answer is to use a join. Simply put, a join is a
mechanism used to associate tables within a SELECT statement
(and thus the name join). Using a special syntax, multiple
tables can be joined so a single set of output is returned, and
the join associates the correct rows in each table
on-the-fly.
Note
Maintaining Referential Integrity It is important to
understand that a join is not a physical entity in other words,
it does not exist in the actual database tables. A join is
created by MySQL as needed, and it persists for the duration of
the query execution.
When using relational tables, it is important that only
valid data is inserted into relational columns. Going back to
the example, if products were stored in the products table with
an invalid vendor ID (one not present in the vendors table),
those products would be inaccessible because they would not be
related to any vendor.
To prevent this from occurring, MySQL can be instructed to
only allow valid values (ones present in the vendors table) in
the vendor ID column in the products table. This is known as
maintaining referential integrity, and is achieved by
specifying the primary and foreign keys as part of the table
definitions (as will be explained in Tutorial 26, "Creating and
Manipulating Tables").
Creating a Join
Creating a join is very simple. You must specify all the
tables to be included and how they are related to each other.
Look at the following example:
• Input
SELECT vend_name, prod_name, prod_price
FROM vendors, products
WHERE vendors.vend_id = products.vend_id
ORDER BY vend_name, prod_name;

• Output
+-------------+----------------+------------+
| vend_name | prod_name | prod_price |
+-------------+----------------+------------+
| ACME | Bird seed | 10.00 |
| ACME | Carrots | 2.50 |
| ACME | Detonator | 13.00 |
| ACME | Safe | 50.00 |
| ACME | Sling | 4.49 |
| ACME | TNT (1 stick) | 2.50 |
| ACME | TNT (5 sticks) | 10.00 |
| Anvils R Us | .5 ton anvil | 5.99 |
| Anvils R Us | 1 ton anvil | 9.99 |
| Anvils R Us | 2 ton anvil | 14.99 |
| Jet Set | JetPack 1000 | 35.00 |
| Jet Set | JetPack 2000 | 55.00 |
| LT Supplies | Fuses | 3.42 |
| LT Supplies | Oil can | 8.99 |
+-------------+----------------+------------+
• Analysis
Take a look at the preceding code. The SELECT statement
starts in the same way as all the statements you've looked at
thus far, by specifying the columns to be retrieved. The big
difference here is that two of the specified columns (prod_name
and prod_price) are in one table, whereas the other (vend_name)
is in another table.
Now look at the FROM clause. Unlike all the prior SELECT
statements, this one has two tables listed in the FROM clause,
vendors and products. These are the names of the two tables
that are being joined in this SELECT statement. The tables are
correctly joined with a WHERE clause that instructs MySQL to
match vend_id in the vendors table with vend_id in the products
table.
You'll notice that the columns are specified as
vendors.vend_id and products.vend_id. This fully qualified
column name is required here because if you just specified
vend_id, MySQL cannot tell which vend_id columns you are
referring to (as there are two of them, one in each table).
Caution
Fully Qualifying Column Names
You must use the fully qualified column name (table and column
separated by a period) whenever there is a possible ambiguity
about to which column you are referring. MySQL returns an error
message if you refer to an ambiguous column name without fully
qualifying it with a table name.
The Importance of the WHERE Clause
It might seem strange to use a WHERE clause to set the join
relationship, but actually, there is a very good reason for
this. Remember, when tables are joined in a SELECT statement,
that relationship is constructed on-the-fly. Nothing in the
database table definitions can instruct MySQL how to join the
tables. You have to do that yourself. When you join two tables,
what you are actually doing is pairing every row in the first
table with every row in the second table. The WHERE clause acts
as a filter to only include rows that match the specified
filter conditionthe join condition, in this case. Without the
WHERE clause, every row in the first table is paired with every
row in the second table, regardless of if they logically go
together or not.
New Term
Cartesian Product The results returned by a table
relationship without a join condition. The number of rows
retrieved is the number of rows in the first table multiplied
by the number of rows in the second table.
To understand this, look at the following SELECT statement
and output:
• Input
SELECT vend_name, prod_name, prod_price
FROM vendors, products
ORDER BY vend_name, prod_name;


• Output
+----------------+----------------+------------+
| vend_name | prod_name | prod_price |
+----------------+----------------+------------+
| ACME | .5 ton anvil | 5.99 |
| ACME | 1 ton anvil | 9.99 |
| ACME | 2 ton anvil | 14.99 |
| ACME | Bird seed | 10.00 |
| ACME | Carrots | 2.50 |
| ACME | Detonator | 13.00 |
| ACME | Fuses | 3.42 |
| ACME | JetPack 1000 | 35.00 |
| ACME | JetPack 2000 | 55.00 |
| ACME | Oil can | 8.99 |
| ACME | Safe | 50.00 |
| ACME | Sling | 4.49 |
| ACME | TNT (1 stick) | 2.50 |
| ACME | TNT (5 sticks) | 10.00 |
| Anvils R Us | .5 ton anvil | 5.99 |
| Anvils R Us | 1 ton anvil | 9.99 |
| Anvils R Us | 2 ton anvil | 14.99 |
| Anvils R Us | Bird seed | 10.00 |
| Anvils R Us | Carrots | 2.50 |
| Anvils R Us | Detonator | 13.00 |
| Anvils R Us | Fuses | 3.42 |
| Anvils R Us | JetPack 1000 | 35.00 |
| Anvils R Us | JetPack 2000 | 55.00 |
| Anvils R Us | Oil can | 8.99 |
| Anvils R Us | Safe | 50.00 |
| Anvils R Us | Sling | 4.49 |
| Anvils R Us | TNT (1 stick) | 2.50 |
| Anvils R Us | TNT (5 sticks) | 10.00 |
| Furball Inc. | .5 ton anvil | 5.99 |
| Furball Inc. | 1 ton anvil | 9.99 |
| Furball Inc. | 2 ton anvil | 14.99 |
| Furball Inc. | Bird seed | 10.00 |
| Furball Inc. | Carrots | 2.50 |
| Furball Inc. | Detonator | 13.00 |
| Furball Inc. | Fuses | 3.42 |
| Furball Inc. | JetPack 1000 | 35.00 |
| Furball Inc. | JetPack 2000 | 55.00 |
| Furball Inc. | Oil can | 8.99 |
| Furball Inc. | Safe | 50.00 |
| Furball Inc. | Sling | 4.49 |
| Furball Inc. | TNT (1 stick) | 2.50 |
| Furball Inc. | TNT (5 sticks) | 10.00 |
| Jet Set | .5 ton anvil | 5.99 |
| Jet Set | 1 ton anvil | 9.99 |
| Jet Set | 2 ton anvil | 14.99 |
| Jet Set | Bird seed | 10.00 |
| Jet Set | Carrots | 2.50 |
| Jet Set | Detonator | 13.00 |
| Jet Set | Fuses | 3.42 |
| Jet Set | JetPack 1000 | 35.00 |
| Jet Set | JetPack 2000 | 55.00 |
| Jet Set | Oil can | 8.99 |
| Jet Set | Safe | 50.00 |
| Jet Set | Sling | 4.49 |
| Jet Set | TNT (1 stick) | 2.50 |
| Jet Set | TNT (5 sticks) | 10.00 |
| Jouets Et Ours | .5 ton anvil | 5.99 |
| Jouets Et Ours | 1 ton anvil | 9.99 |
| Jouets Et Ours | 2 ton anvil | 14.99 |
| Jouets Et Ours | Bird seed | 10.00 |
| Jouets Et Ours | Carrots | 2.50 |
| Jouets Et Ours | Detonator | 13.00 |
| Jouets Et Ours | Fuses | 3.42 |
| Jouets Et Ours | JetPack 1000 | 35.00 |
| Jouets Et Ours | JetPack 2000 | 55.00 |
| Jouets Et Ours | Oil can | 8.99 |
| Jouets Et Ours | Safe | 50.00 |
| Jouets Et Ours | Sling | 4.49 |
| Jouets Et Ours | TNT (1 stick) | 2.50 |
| Jouets Et Ours | TNT (5 sticks) | 10.00 |
| LT Supplies | .5 ton anvil | 5.99 |
| LT Supplies | 1 ton anvil | 9.99 |
| LT Supplies | 2 ton anvil | 14.99 |
| LT Supplies | Bird seed | 10.00 |
| LT Supplies | Carrots | 2.50 |
| LT Supplies | Detonator | 13.00 |
| LT Supplies | Fuses | 3.42 |
| LT Supplies | JetPack 1000 | 35.00 |
| LT Supplies | JetPack 2000 | 55.00 |
| LT Supplies | Oil can | 8.99 |
| LT Supplies | Safe | 50.00 |
| LT Supplies | Sling | 4.49 |
| LT Supplies | TNT (1 stick) | 2.50 |
| LT Supplies | TNT (5 sticks) | 10.00 |
+----------------+----------------+------------+
• Analysis
As you can see in the preceding output, the Cartesian
product is seldom what you want. The data returned here has
matched every product with every vendor, including products
with the incorrect vendor (and even vendors with no products at
all).
Caution
Don't Forget the WHERE Clause
Make sure all your joins have WHERE clauses, or MySQL returns
far more data than you want. Similarly, make sure your WHERE
clauses are correct. An incorrect filter condition causes MySQL
to return incorrect data.
Tip
Cross Joins Sometimes you'll
hear the type of join that returns a Cartesian Product referred
to as a cross join.
Inner Joins
The join you have been using so far is called an equijoina
join based on the testing of equality between two tables. This
kind of join is also called an inner join. In fact, you may use
a slightly different syntax for these joins, specifying the
type of join explicitly. The following SELECT statement returns
the exact same data as the preceding example:
• Input
SELECT vend_name, prod_name, prod_price
FROM vendors INNER JOIN products
ON vendors.vend_id = products.vend_id;

• Analysis
The SELECT in the statement is the same as the preceding
SELECT statement, but the FROM clause is different. Here the
relationship between the two tables is part of the FROM clause
specified as INNER JOIN. When using this syntax the join
condition is specified using the special ON clause instead of a
WHERE clause. The actual condition passed to ON is the same as
would be passed to WHERE.
Note
Which Syntax to Use? Per the ANSI SQL specification, use
of the INNER JOIN syntax is preferable. Furthermore, although
using the WHERE clause to define joins is indeed simpler, using
explicit join syntax ensures that you will never forget the
join condition, and it can affect performance, too (in some
cases).
Joining Multiple Tables
SQL imposes no limit to the number of tables that may be
joined in a SELECT statement. The basic rules for creating the
join remain the same. First list all the tables, and then
define the relationship between each. Here is an example:
• Input
SELECT prod_name, vend_name, prod_price, quantity
FROM orderitems, products, vendors
WHERE products.vend_id = vendors.vend_id
AND orderitems.prod_id = products.prod_id
AND order_num = 20005;

• Output
+----------------+-------------+------------+----------+
| prod_name | vend_name | prod_price | quantity |
+----------------+-------------+------------+----------+
| .5 ton anvil | Anvils R Us | 5.99 | 10 |
| 1 ton anvil | Anvils R Us | 9.99 | 3 |
| TNT (5 sticks) | ACME | 10.00 | 5 |
| Bird seed | ACME | 10.00 | 1 |
+----------------+-------------+------------+----------+
• Analysis
This example displays the items in order number 20005. Order
items are stored in the orderitems table. Each product is
stored by its product ID, which refers to a product in the
products table. The products are linked to the appropriate
vendor in the vendors table by the vendor ID, which is stored
with each product record. The FROM clause here lists the three
tables, and the WHERE clause defines both of those join
conditions. An additional WHERE condition is then used to
filter just the items for order 20005.
Caution
Performance Considerations
MySQL processes joins at run-time, relating each table as
specified. This process can become very resource intensive, so
be careful not to join tables unnecessarily. The more tables
you join, the more performance degrades.
Now would be a good time to revisit the following example
from Tutorial 19, "Working with Sub-queries." As you will
recall, this SELECT statement returns a list of customers who
ordered product TNT2:
• Input
SELECT cust_name, cust_contact
FROM customers
WHERE cust_id IN (SELECT cust_id
FROM orders
WHERE order_num IN (SELECT order_num
FROM orderitems
WHERE prod_id = 'TNT2'));

As mentioned in Tutorial 19, sub-queries might not always the
most efficient way to perform complex SELECT operations, and so
as promised, here is the same query using joins:
• Input
SELECT cust_name, cust_contact
FROM customers, orders, orderitems
WHERE customers.cust_id = orders.cust_id
AND orderitems.order_num = orders.order_num
AND prod_id = 'TNT2';

• Output
+----------------+--------------+
| cust_name | cust_contact |
+----------------+--------------+
| Coyote Inc. | Y Lee |
| Yosemite Place | Y Sam |
+----------------+--------------+
• Analysis
As explained in Tutorial 19, returning the data needed in
this query requires the use of three tables. But instead of
using them within nested sub-queries, here two joins are used
to connect the tables. There are three WHERE clause conditions
here. The first two connect the tables in the join, and the
last one filters the data for product TNT2.
Tip
It Pays to Experiment As you
can see, there is often more than one way to perform any given
SQL operation. And there is rarely a definitive right or wrong
way. Performance can be affected by the type of operation, the
amount of data in the tables, whether indexes and keys are
present, and a whole slew of other criteria. Therefore, it is
often worth experimenting with different selection mechanisms
to find the one that works best for you.
Summary
Joins are one of the most important and powerful features in
SQL, and using them effectively requires a basic understanding
of relational database design. In this tutorial, you learned
some of the basics of relational database design as an
introduction to learning about joins. You also learned how to
create an equi-join (also known as an inner join), which is the
most commonly used form of join. In the next tutorial you'll
learn how to create other types of joins.
[Next]
|