Query Processing
The database server does the following to carry out a query.
- Parsing and translation
- Optimization
- Evaluation
What is actually means…
- We need to tokenize and figure out what the keywords are and validate against the SQL syntax and turn this into a relational algebra expression (Set Theory)
- Then we make a query graph which we use to device an execution strategy
- Execute the strategy
- Complex queries are turned into query blocks first
- A query block has a single
SELECT FROMexpression as well as relatedGROUP BYand having clauses
- A query block has a single
Some Definitions
- Selection: Choosing which tuple (rows) are relevant to a query
- Projection: Getting rid out columns we do not need
To select something on a condition, we can select first to get what we need and then project the rest of the field, or the other way around.
Query Execution Plan
Each step of the plan needs annotations to specify how to evaluate the operation.
- This includes what algorithm or index to use
- An algebraic operation with the associated annotations about how to do it is called an evaluation primitives
- The sequence of these primitives forms the plan
- If there are multiple ways to carry out the plan, the system will need to assess which plan is best
- It is not expected that users will write optimal queries
Optimization
- The database server should choose the best approach via query optimization
- We use some sort of heuristics to do so
- Basically what we do here, is we figure out how long each segment of a query will take and then use this to figure out how long each plan will take (at a high level)
- The limiting factor is probably loading stuff from disk here
The number of block transfers and the number of seeks are important measures here. The compute how long we think an operation will take:
- Here:
- b is the number of block transfers
- t_t is the time to transfer
- S is the number of seeks
- t_s is the time to seek We want to compare and expect the worst case scenario here as we do for Big O notation. Estimates of work include:
- How busy a system is
- What is in the buffer: Is stuff we care about in a more convenient location than on disk
- Data layout: how is the data actually stored? Is it packed well?
The query optimizer focuses first on join relations since that is where the largest gains can be made.
- A good way to optimize joins, is doing a selection and projection first to get an intermediate result and THEN doing a join so we reduce the amount of intermediate data and we can do less joins
- We want to use common sub-expressions to reduce the amount of space used by representing the expressions during evaluation
- For the database to know how expensive an operation can be (number of rows so we know to avoid joins etc), it can check metadata (to find rows), or guess
- Things that might be in metadata:
- : the number of tuples in a relation r
- : the number of blocks containing a relation r
- : the size in bytes of relation r
- : the number of distinct values in r of attribute A
- : the height of an index i defined on relation r
- The metadata takes time to be maintained, maybe we only update this periodically during quiet periods?
- Things that might be in metadata:
- We just need to be better than not optimizing (not perfect)
Areas that Queries can Accumulate
- Disk I/O
- Disk additional storage
- Computation
- Memory
- Communication
Disk is the largest and most common one.
Join Elimination
Can a join be skipped? The optimizer can only skip if there is certainty the outcome will not be affected by skipping the join. Foreign key and not null constraints enable elimination.
Plan Selection
The difficulty here is choosing the plan with the lowest cost, and devising and calculating all possible evaluation plans. We want to make sure we don’t spend more time on analysis than the plan execution time itself.
Join Focused
If joins are the most expensive thing, a good plan could be to minimize the amount of joins being done.
- The order of joins in a statement is something the optimizer can choose. Here, there are 3 relations and 12 orderings (permutations). We calculate this using .
- A shortcut that helps here is “memory”. If we previously computed then we have already computed the subset so we can use that ordering. The saved result can be used repeatedly.
- This might not be globally optimal but this is an estimating process
- The sort order in which tuples are generated is important if the result is used in another join. However, sort order is also specific to a problem.
Generating Alternatives
- Efficient storage of expressions
- Duplicate detection
- Find things that are the same as something we have already seen
- Remember optimal plans
- So we don’t recompute things we already know
- Early-terminating algorithm
- If we are worse than the cheapest plan we have found, stop
Nested Subqueries
- Can we turn nested queries into an equivalent join?
- We could run the subquery once and use that temporary relation in a join
Guidelines for Heuristics
- Perform selection early. The sooner, the fewer tuples we get as a result but we can’t throw away too much
- Perform projection early, toss away information we don’t need. Once again, don’t throw away stuff we need.
- Set limits. Choose our plan in a reasonable amount of time.
- For small queries, don’t spend much time at all, we can spend more time on expensive queries. The time should scale depending on the complexity of the query.
- Plan caching. We can reuse the plan of similar queries in the future. Maybe take an average of the last n executions to figure out how long a query will take.