|
GreatSQL 的优化器主要用 mm tree 也就是 min-max tree 来确定条件的范围,然后根据不同索引的范围值来计算 cost,选取 cost 最小的索引来执行SQL。
下面用一个简单的例子来说明 mm tree 是什么。
greatsql> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT,date1 DATETIME);
greatsql> INSERT INTO t1 VALUES (1,10,'2021-03-25 16:44:00.123456'),(2,1,'2022-03-26 16:44:00.123456'),(3,4,'2023-03-27 16:44:00.123456'),(5,5,'2024-03-25 16:44:00.123456'),(7,null,'2020-03-25 16:44:00.123456'),(8,10,'2020-10-25 16:44:00.123456'),(11,16,'2023-03-25 16:44:00.123456');
greatsql> CREATE TABLE t2 (cc1 INT PRIMARY KEY, cc2 INT);
greatsql> INSERT INTO t2 VALUES (1,3),(2,1),(3,2),(4,3),(5,15);
greatsql> CREATE TABLE t3 (ccc1 INT, ccc2 varchar(100));
greatsql> INSERT INTO t3 VALUES (1,'aa1'),(2,'bb1'),(3,'cc1'),(4,'dd1'),(null,'ee');
greatsql> CREATE INDEX idx1 ON t1(c2);
greatsql> CREATE INDEX idx2 ON t1(c2,date1);
greatsql> CREATE INDEX idx2_1 ON t2(cc2);
greatsql> CREATE INDEX idx3_1 ON t3(ccc1);
greatsql> EXPLAIN SELECT * FROM t1 WHERE (c1=1 AND c2<10) OR (c2<6 AND date1 < '2023-03-27 16:44:00.123456') OR (c2>10 and c2<15 AND c1>0);
"analyzing_range_alternatives": {
"range_scan_alternatives": [
{
"index": "idx1",
"ranges": [ # 这里面就是一个mm tree二叉树结果
"NULL < c2 < 6",
"6 <= c2 < 10",
"10 < c2 < 15"
],
"index_dives_for_eq_ranges": true,
"rowid_ordered": false,
"using_mrr": false,
"index_only": false,
"in_memory": 1,
"rows": 5,
"cost": 8.51,
"chosen": false,
"cause": "cost"
},
{
"index": "idx2",
"ranges": [ # 这里面就是一个mm tree二叉树结果
"NULL < c2 < 6",
"6 <= c2 < 10",
"10 < c2 < 15"
],
"index_dives_for_eq_ranges": true, # 用范围扫描来估计cost,这个涉及到index dives,下一期讲
"rowid_ordered": false,
"using_mrr": false,
"index_only": true,
"in_memory": 1,
"rows": 5, # 这里的意思是在 c2<15 范围内有5条记录
"cost": 0.761828,
"chosen": true
}
],
"analyzing_roworder_intersect": {
"usable": false,
"cause": "too_few_roworder_scans"
}
},
"chosen_range_access_summary": {
"range_access_plan": {
"type": "range_scan",
"index": "idx2",
"rows": 5,
"ranges": [
"NULL < c2 < 6",
"6 <= c2 < 10",
"10 < c2 < 15"
]
},
"rows_for_plan": 5,
"cost_for_plan": 0.761828,
"chosen": true
}
}
}
]
},
mm tree 实现主要在 get_mm_tree 函数执行的。具体代码流程如下:
# 代码流程:make_join_plan --> estimate_rowcount --> get_quick_record_count --> test_quick_select
int test_quick_select() {
RANGE_OPT_PARAM param;
# 找出所有sql条件涉及的索引
if (setup_range_optimizer_param(thd, return_mem_root, temp_mem_root,
keys_to_use, table, query_block, ¶m)) {
return 0;
}
# 有condition条件的话,执行以下函数
get_mm_tree();
if(tree) {
# 用下面的函数来计算索引和范围对应的cost,这个下一期讲
get_key_scans_params();
}
}
SEL_TREE *get_mm_tree(THD *thd, RANGE_OPT_PARAM *param, table_map prev_tables,
table_map read_tables, table_map current_table,
bool remove_jump_scans, Item *cond) {
# 1、如果是Item::COND_ITEM的话,遍历所有参数
for (Item &item : *down_cast<Item_cond *>(cond)->argument_list()) {
get_mm_tree();
如果是and条件,tree = tree_and()
如果是OR条件,tree = tree_or()
}
# 2、如果非条件Item,见下表一
switch (cond_func->functype()) {
case Item_func::BETWEEN:
case Item_func::IN_FUNC:
case Item_func::MULT_EQUAL_FUNC:
default:
}
}
最后生成SEL_TREE,根据索引数量包含对应的SEL_ROOT,SEL_ROOT包含最小单元SEL_ARG,一个SEL_ARG就是一段范围,用key_range_flags来计算范围,见表二 SEL_ARG一组对象合成一个SEL_ROOT的图结构,内部通过SEL_ARG::next/prev来关联同一个索引列条件"OR",通过next_key_part来关联不同索引列条件的"AND" tree_or操作涉及的比较函数见表三,表四。
原则就是把2个不同的范围进行比较按照结果进行合并操作,涉及的范围拼接因为太多不展开细讲,具体看函数tree.cc tree_or会对两个不同条件组的共同列做key_or处理,生成这个交集列的范围二叉树。即对不同or条件出现次数最多的列做查找范围操作。注意,如果最后某个索引的范围是全覆盖的话是不会生成这个索引的mm tree的。
SEL_ARG的红黑二叉树结构:
parent 黑
/ \
当前SEL_ARG 当前SEL_ARG 红
/ \ / \
left right left right 黑
相关注释怎么看,解释一下:
Range: [--------] 这是一个条件范围
^ ^
start stop
Two overlapping ranges:
[-----] [----] [--]
[---] or [---] or [-------] 三个条件用OR连接起来
Ambiguity: *** 这个***代表模糊
The range starts or stops somewhere in the "***" range.
Example: a starts before b and may end before/the same place/after b
a: [----***] a可能在b的最大值前面/相等/后面
b: [---]
Adjacent ranges:
Ranges that meet but do not overlap. Example: a = "x < 3", b = "x >= 3"
a: ----]
b: [---- b的最小值和a的最大值重合
表一,Item_func对应的TREE操作
Item type | op |
---|---|
Item_func::COND_AND_FUNC | tree_and |
Item_func::COND_OR_FUNC | tree_or |
Item_func::BETWEEN | not between : tree_or between : tree_and |
Item_func::IN_FUNC | not in : tree_or in : tree_and |
Item_func::MULT_EQUAL_FUNC | tree_and |
Item_func::NE_FUNC | tree_or |
表二,SEL_ARG的key_range_flags
type | |
---|---|
NO_MIN_RANGE | from -inf |
NO_MAX_RANGE | to +inf |
NEAR_MIN | X < key |
NEAR_MAX | X > key |
UNIQUE_RANGE | AND(keypart_i = const_i),唯一索引,所有const_i非NULL |
EQ_RANGE | AND(keypart_i = const_i),所有const_i非空 |
NULL_RANGE | AND(keypart_i = const_i),唯一索引,至少一个const_i为NULL |
GEOM_FLAG | rtree index,使用ha_rkey_function::HA_READ_MBR_XXX |
SKIP_RANGE | 只用在NDB引擎 |
SKIP_RECORDS_IN_RANGE | 可以跳过index dives |
DESC_FLAG | 索引是DESC倒序的 |
表三,key_or的cmp_max_to_min的比较结果
Item type | 举例 | cmp结果 | 操作 |
---|---|---|---|
cur_key2 | (10 <= x ... ) | ||
cur_key1 | ( ... x < 10) | -2 | 拼接 |
cur_key1 | ( ... x < 9) | -1 | |
cur_key1 | ( ... x <= 10) | 0 | |
cur_key1 | ( ... x <= 12) | 1 | |
cur_key1 | (10 <= x ... ) | 2 |
表四,key_or的cmp_min_to_max的比较结果
Item type | 举例 | cmp结果 | 操作 |
---|---|---|---|
cur_key2 | (... x <= 10 ) | ||
cur_key1 | ( ... x < 10) | -2 | |
cur_key1 | (9 < x ... ) | -1 | |
cur_key1 | (10 < =x ... ) | 0 | |
cur_key1 | (12 < x ... ) | 1 | |
cur_key1 | (10 < x ... ) | 2 | 拼接 |
表五,SEL_TREE::type
类型 | 说明 |
---|---|
IMPOSSIBLE | 如果任何一个SEL_ROOT::type为IMPOSSIBLE,那么结果就是IMPOSSIBLE |
ALWAYS | 如果只有一个SEL_ROOT并且不是MAYBE_KEY,而且min_flag=NO_MIN_RANGE,max_flag=NO_MAX_RANGE |
KEY | 可以用于至少一个索引上的范围估计 |
注:多个 SEL_ROOT 组成一个 SEL_TREE
表六,SEL_ROOT::type
类型 | 说明 |
---|---|
IMPOSSIBLE | 不可用 |
MAYBE_KEY | 这个范围估计用于另一张表,在决定了表连接顺序之后会重新使用这个范围 |
KEY_RANGE | 可以用于索引上的范围估计 |
注:多个 SEL_ARG 组成一个 SEL_ROOT,一个 SEL_ARG 就是一个范围
接下来看几个例子来说明上面的代码:
greatsql> SELECT * FROM t1 WHERE (c1=1 AND c2<10) OR (c2<6 AND date1 < '2023-03-27 16:44:00.123456') OR (c2>10 and c2<15 AND c1>0);
# 最后mm tree结果如下:
"ranges": [
"NULL < c2 < 6",
"6 <= c2 < 10",
"10 < c2 < 15"
]
通过setup_range_optimizer_param()
获得的param->key_parts
数组,这里每个 key 都包含了主键信息。其中 key 对应param->key[keys]
key_parts | key | part | store_length | field | index |
---|---|---|---|---|---|
0 | 0 | 0 | 4 | c1 | primary key(c1) |
1 | 1 | 0 | 5 | c2 | idx1(c2) |
2 | 1 | 1 | 4 | c1 | primary key(c1) |
3 | 2 | 0 | 5 | c2 | idx2(c2,date1) |
4 | 2 | 1 | 6 | date1 | idx2(c2,date1) |
5 | 2 | 2 | 4 | c1 | primary key(c1) |
根据条件生成的 SEL_ARG
condition | flag | c1 | c2 | date1 |
---|---|---|---|---|
c2<10 | max_flag | NEAR_MAX | ||
min_flag | NEAR_MIN | |||
c1=1 | max_flag | 0 | ||
min_flag | 0 | |||
c2<6 | max_flag | NEAR_MAX | ||
min_flag | NEAR_MIN | |||
date1 < '2023-03-27 16:44:00.123456' | max_flag | 0 | ||
min_flag | NEAR_MIN | |||
c2>10 | max_flag | NO_MAX_RANGE | ||
min_flag | NEAR_MIN | |||
c2<15 | max_flag | NEAR_MAX | ||
min_flag | NEAR_MIN | |||
c1>0 | max_flag | NO_MAX_RANGE | ||
min_flag | NEAR_MIN |
SEL_ARG 经过 key_and 和 key_or 操作后,红黑二叉树结果如下,注意这个二叉树每个索引都有一个,最后通过 cost 值选择 cost 最小的 index。
next_key_part
6=<c2<10黑 -----------------> c1=1
/ \
c2<6红 10<c2<15红 -----------------> c1>0
next_key_part
现在换一个条件里面c1和c2均等的sql,看看最后的结果,看到每个索引用的都是自己相关列的范围进行cost计算。
greatsql> SELECT * FROM t1 WHERE (c1<7 AND c2<10) OR (c2<6 AND c1<10) OR (c2<15 AND c1<8);
"analyzing_range_alternatives": {
"range_scan_alternatives": [
{
"index": "PRIMARY", # 主键索引选了c1
"ranges": [
"c1 < 10"
],
"index_dives_for_eq_ranges": true,
"rowid_ordered": true,
"using_mrr": false,
"index_only": false,
"in_memory": 1,
"rows": 6,
"cost": 0.861465,
"chosen": true
},
{
"index": "idx1", # 索引idx1选了c2
"ranges": [
"NULL < c2 < 6",
"6 <= c2 < 10",
"10 <= c2 < 15"
],
"index_dives_for_eq_ranges": true,
"rowid_ordered": false,
"using_mrr": false,
"index_only": false,
"in_memory": 1,
"rows": 6,
"cost": 2.86,
"chosen": false,
"cause": "cost"
},
{
"index": "idx2", # 索引idx2选了c2
"ranges": [
"NULL < c2 < 6",
"6 <= c2 < 10",
"10 <= c2 < 15"
],
"index_dives_for_eq_ranges": true,
"rowid_ordered": false,
"using_mrr": false,
"index_only": true,
"in_memory": 1,
"rows": 6,
"cost": 0.862285,
"chosen": false,
"cause": "cost"
}
],
"analyzing_roworder_intersect": {
"usable": false,
"cause": "too_few_roworder_scans"
}
},
"chosen_range_access_summary": {
"range_access_plan": {
"type": "range_scan",
"index": "PRIMARY", # 最后选了主键索引来估计cost
"rows": 6,
"ranges": [
"c1 < 10"
]
},
"rows_for_plan": 6,
"cost_for_plan": 0.861465,
"chosen": true
}
}
}
看看每个OR条件都只有一个单独列的情况,因为都无法生成对应索引的tree,所以最后的结果没有走范围估计。
greatsql> SELECT * FROM t1 WHERE c1<7 OR c2<5 OR date1 < '2022-03-25 16:44:00';
"rows_estimation": [
{
"table": "`t1`",
"range_analysis": {
"table_scan": {
"rows": 7,
"cost": 3.05
},
"potential_range_indexes": [
{
"index": "PRIMARY",
"usable": true,
"key_parts": [
"c1"
]
},
{
"index": "idx1",
"usable": true,
"key_parts": [
"c2",
"c1"
]
},
{
"index": "idx2",
"usable": true,
"key_parts": [
"c2",
"date1",
"c1"
]
}
],
"best_covering_index_scan": {
"index": "idx2",
"cost": 0.952742,
"chosen": true
},
"setup_range_conditions": [
],
"range_scan_possible": false, # 这里表示最后没有生成mm tree
"cause": "condition_always_true", # 因为tree->type == SEL_TREE::ALWAYS所以最后无法生成mm tree
"group_index_range": {
"chosen": false,
"cause": "not_group_by_or_distinct"
},
"skip_scan_range": {
"chosen": false,
"cause": "disjuntive_predicate_present"
}
}
}
]
},
从上面优化器最早的步骤我们知道了在rows_estimation
的时候可以通过 mm tree 来确定是否可以用范围扫描来查询,并且接着分别用不同索引的范围来计算 cost,最后选出 cost 最小的索引和范围。如果单个索引计算出来发现范围全覆盖,就不会生成对应的 mm tree 了,也就不会对这个索引用范围扫描了。其中涉及mm tree 的 cost 计算下期介绍。
合作电话:010-64087828
社区邮箱:greatsql@greatdb.com