Query Optimization

Query Processing

The database server does the following to carry out a query.

Parsing and translation
Optimization
Evaluation

What is actually means…

We need to tokenize and figure out what the keywords are and validate against the SQL syntax and turn this into a relational algebra expression (Set Theory)
Then we make a query graph which we use to device an execution strategy
Execute the strategy
Complex queries are turned into query blocks first
- A query block has a single SELECT FROM expression as well as related GROUP BY and having clauses

Some Definitions

Selection: Choosing which tuple (rows) are relevant to a query

Projection: Getting rid out columns we do not need

To select something on a condition, we can select first to get what we need and then project the rest of the field, or the other way around.

Query Execution Plan

Each step of the plan needs annotations to specify how to evaluate the operation.

This includes what algorithm or index to use
An algebraic operation with the associated annotations about how to do it is called an evaluation primitives
The sequence of these primitives forms the plan
If there are multiple ways to carry out the plan, the system will need to assess which plan is best
It is not expected that users will write optimal queries

Optimization

The database server should choose the best approach via query optimization
- We use some sort of heuristics to do so
- Basically what we do here, is we figure out how long each segment of a query will take and then use this to figure out how long each plan will take (at a high level)
- The limiting factor is probably loading stuff from disk here

The number of block transfers and the number of seeks are important measures here. The compute how long we think an operation will take:

$b \times t_{T} + S \times t_{s}$
Here:
- b is the number of block transfers
- t_t is the time to transfer
- S is the number of seeks
- t_s is the time to seek We want to compare and expect the worst case scenario here as we do for Big O notation. Estimates of work include:
How busy a system is
What is in the buffer: Is stuff we care about in a more convenient location than on disk
Data layout: how is the data actually stored? Is it packed well?

The query optimizer focuses first on join relations since that is where the largest gains can be made.

A good way to optimize joins, is doing a selection and projection first to get an intermediate result and THEN doing a join so we reduce the amount of intermediate data and we can do less joins
We want to use common sub-expressions to reduce the amount of space used by representing the expressions during evaluation
For the database to know how expensive an operation can be (number of rows so we know to avoid joins etc), it can check metadata (to find rows), or guess
- Things that might be in metadata:
  - $n_{r}$ : the number of tuples in a relation r
  - $b_{r}$ : the number of blocks containing a relation r
  - $l_{r}$ : the size in bytes of relation r
  - $V (A, r)$ : the number of distinct values in r of attribute A
  - $h_{r, i}$ : the height of an index i defined on relation r
  - The metadata takes time to be maintained, maybe we only update this periodically during quiet periods?
We just need to be better than not optimizing (not perfect)

Areas that Queries can Accumulate

Disk I/O
Disk additional storage
Computation
Memory
Communication

Disk is the largest and most common one.

Join Elimination

Can a join be skipped? The optimizer can only skip if there is certainty the outcome will not be affected by skipping the join. Foreign key and not null constraints enable elimination.

Plan Selection

The difficulty here is choosing the plan with the lowest cost, and devising and calculating all possible evaluation plans. We want to make sure we don’t spend more time on analysis than the plan execution time itself.

Join Focused

If joins are the most expensive thing, a good plan could be to minimize the amount of joins being done.

The order of joins in a statement $r_{1} ⋈ r_{2} ⋈ r_{3}$ is something the optimizer can choose. Here, there are 3 relations and 12 orderings (permutations). We calculate this using $\frac{2 ( n - 1 )!}{( n - 1 )!}$ .
A shortcut that helps here is “memory”. If we previously computed $r_{1} ⋈ r_{2} ⋈ r_{3} ⋈ r_{4} ⋈ r_{5}$ then we have already computed the subset so we can use that ordering. The saved result can be used repeatedly.
This might not be globally optimal but this is an estimating process
The sort order in which tuples are generated is important if the result is used in another join. However, sort order is also specific to a problem.

Generating Alternatives

Efficient storage of expressions
Duplicate detection
1. Find things that are the same as something we have already seen
Remember optimal plans
1. So we don’t recompute things we already know
Early-terminating algorithm
1. If we are worse than the cheapest plan we have found, stop

Nested Subqueries

Can we turn nested queries into an equivalent join?
We could run the subquery once and use that temporary relation in a join

Guidelines for Heuristics

Perform selection early. The sooner, the fewer tuples we get as a result but we can’t throw away too much
Perform projection early, toss away information we don’t need. Once again, don’t throw away stuff we need.
Set limits. Choose our plan in a reasonable amount of time.
1. For small queries, don’t spend much time at all, we can spend more time on expensive queries. The time should scale depending on the complexity of the query.
Plan caching. We can reuse the plan of similar queries in the future. Maybe take an average of the last n executions to figure out how long a query will take.

🤖 Dan Huynh

Recent Notes