最近遇到的一个库&表字符集大坑

2022-04-19 16:53:50 浏览数 (1)

我们生产环境的建表语句,之前一直要求研发提交时候不要带上字符集和排序集,这样就可以follow数据库默认的配置。但是最近发现掉坑里了。

至于是什么坑,为什么掉坑。可以看下面的例子

代码语言:javascript复制
环境:
CentOS7 
GreatSQL8.0.25,可以理解为Percona的加强版,增加了一些挺不错的feature,例如并行查询之类特性。 
这个案例只要是MySQL8的任一版本就可以。


[test]> s
--------------
/usr/local/mysql/bin/mysql  Ver 8.0.25-15 for Linux on x86_64 (GreatSQL, Release 15, Revision c7feae175e0)

Connection id:    27
Current database: test
Current user:   root@localhost
SSL:      Not in use
Current pager:    less
Using outfile:    ''
Using delimiter:  ;
Server version:   8.0.25-15 GreatSQL, Release 15, Revision c7feae175e0
Protocol version: 10
Connection:   Localhost via UNIX socket
Server characterset:  utf8mb4
Db     characterset:  utf8mb4
Client characterset:  utf8mb4
Conn.  characterset:  utf8mb4
UNIX socket:    /data/GreatSQL/mysql.sock
Binary data as:   Hexadecimal
Uptime:     15 hours 44 min 8 sec

Threads: 4  Questions: 272  Slow queries: 25  Opens: 431  Flush tables: 3  Open tables: 347  Queries per second avg: 0.004
--------------

-- 字符集和字符排序集
[test]> show global variables like '%charac%';
 -------------------------- -------------------------------------------- 
| Variable_name            | Value                                      |
 -------------------------- -------------------------------------------- 
| character_set_client     | utf8mb4                                    |
| character_set_connection | utf8mb4                                    |
| character_set_database   | utf8mb4                                    |
| character_set_filesystem | binary                                     |
| character_set_results    | utf8mb4                                    |
| character_set_server     | utf8mb4                                    |
| character_set_system     | utf8mb3                                    |
| character_sets_dir       | /usr/local/GreatSQL-8.0.25/share/charsets/ |
 -------------------------- -------------------------------------------- 


[test]> show global variables like '%collation%';
 ------------------------------- -------------------- 
| Variable_name                 | Value              |
 ------------------------------- -------------------- 
| collation_connection          | utf8mb4_0900_ai_ci |
| collation_database            | utf8mb4_0900_ai_ci |
| collation_server              | utf8mb4_0900_ai_ci |
| default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci |
 ------------------------------- -------------------- 
4 rows in set (0.00 sec)


-- 开始创建库表
-- db1模拟的是mysql5.7升级到8.0之前,业务侧提交的建库建表语句
CREATE DATABASE db1 DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

use db1;

create table t ( `a` int DEFAULT NULL,
  `b` varchar(100) DEFAULT NULL
) ENGINE=InnoDB;


-- db2模拟的是升级到8.0之后,业务侧提交的建库建表语句
CREATE DATABASE db2 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

use db2;

create table t ( `a` int DEFAULT NULL,
  `b` varchar(100) DEFAULT NULL
) ENGINE=InnoDB;


-- 看下字符集的情况
[db2]> select 
`TABLE_SCHEMA`,
`TABLE_NAME`,
`TABLE_COLLATION`,
`CREATE_TIME` 
from information_schema.`TABLES` 
where 
`TABLE_SCHEMA` IN ('db1','db2') ;

 -------------- ------------ -------------------- --------------------- 
| TABLE_SCHEMA | TABLE_NAME | TABLE_COLLATION    | CREATE_TIME         |
 -------------- ------------ -------------------- --------------------- 
| db1          | t          | utf8_general_ci    | 2022-04-18 13:14:54 |
| db2          | t          | utf8mb4_0900_ai_ci | 2022-04-18 13:14:57 |
 -------------- ------------ -------------------- --------------------- 
2 rows in set (0.00 sec)

上面这种在生产上这回出现下面这种情况:

刚开始运行的时候,我们用的是mysql5.7,建库默认用字符集utf8 字符排序集utf8_general_ci。这里一点问题也没有。

后来,数据库版本升级到了8.0了,并且改了默认字符集为utf8mb4 字符排序集为utf8mb4_0900_ai_ci,这里也没任何问题。

新创建的数据库,不显式指定的话,也follow数据库层面的字符集和字符排序集,也就是utf8mb4 和utf8mb4_0900_ai_ci 。

但是, 如果我们在老的数据库里面创建新表的时候,如果不显式指定的话,会follow所在数据库的字符集和字符排序集的(也就是 虽然升级到8.x了,但是创建的表还是用的utf8 和utf8_general_ci),这就操蛋了啊。。

代码语言:javascript复制
-- 按时间排序,看下最近有哪些表的字符集存在异常

SELECT 
`TABLE_SCHEMA`,
`TABLE_NAME`,
`TABLE_COLLATION`,
`CREATE_TIME` 
FROM information_schema.`TABLES` 
WHERE 
`TABLE_SCHEMA` NOT IN ('sys','mysql','information_schema','performance_schema') 
AND  
TABLE_COLLATION='utf8_general_ci'
ORDER BY CREATE_TIME DESC 
LIMIT 10

这一堆的表,咋改进呢,挺费事的,如果你直接执行 alter database xxx DEFAULT CHARACTER SET utf8mb4; 这样风险很高的,基本上停服搞了,如果库下面的表有事务还没提交,这个alter database一直是pending的。

那么, 只能从其它方面下手了:

1、严格控制DDL语句,新的库和表必须显式定义 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

2、对于已经是utf8编码的的表 ,暂时不去动它

3、找个时间窗口,把表全部转为 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci (工作量不小。。)

tips,字符集问题带来的索引不生效的案例:

代码语言:javascript复制
[test]> alter table db1.t add index idx_b(b);
[test]> alter table db2.t add index idx_b(b);

[test]> select * from db1.t;
 ------ ------ 
| a    | b    |
 ------ ------ 
|    1 | abc  |
 ------ ------ 
1 row in set (0.00 sec)

[test]> select * from db2.t;
 ------ ------ 
| a    | b    |
 ------ ------ 
|    1 | abc  |
 ------ ------ 
1 row in set (0.00 sec)


-- 示例1, utf8mb3 join utf8mb4 
[test]> select aa.*,bb.* from db1.t as aa inner join db2.t bb on aa.b=bb.b ;
 ------ ------ ------ ------ 
| a    | b    | a    | b    |
 ------ ------ ------ ------ 
|    1 | abc  |    1 | abc  |
 ------ ------ ------ ------ 
1 row in set (0.00 sec)


[test]> desc select aa.*,bb.* from db1.t as aa inner join db2.t bb on aa.b=bb.b G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: aa
   partitions: NULL
         type: ALL
possible_keys: idx_b
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: bb
   partitions: NULL
         type: ref
possible_keys: idx_b
          key: idx_b
      key_len: 403
          ref: db1.aa.b
         rows: 1
     filtered: 100.00
        Extra: Using index condition
2 rows in set, 2 warnings (0.00 sec)


[test]> show warnings G
*************************** 1. row ***************************
  Level: Warning
   Code: 1739
Message: Cannot use ref access on index 'idx_b' due to type or collation conversion on field 'b'
*************************** 2. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select `db1`.`aa`.`a` AS `a`,`db1`.`aa`.`b` AS `b`,`db2`.`bb`.`a` AS `a`,`db2`.`bb`.`b` AS `b` from `db1`.`t` `aa` join `db2`.`t` `bb` where (`db1`.`aa`.`b` = `db2`.`bb`.`b`)
2 rows in set (0.00 sec)


-- 示例2, utf8mb4 join utf8mb3
[test]> select aa.*,bb.* from db2.t as aa inner join db1.t bb on aa.b=bb.b ;
 ------ ------ ------ ------ 
| a    | b    | a    | b    |
 ------ ------ ------ ------ 
|    1 | abc  |    1 | abc  |
 ------ ------ ------ ------ 
1 row in set (0.00 sec)
[test]> desc select aa.*,bb.* from db2.t as aa inner join db1.t bb on aa.b=bb.b G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: aa
   partitions: NULL
         type: ALL
possible_keys: idx_b
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: bb
   partitions: NULL
         type: ALL
possible_keys: idx_b
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using where; Using join buffer (hash join)
2 rows in set, 3 warnings (0.00 sec)


[test]> show warnings G
*************************** 1. row ***************************
  Level: Warning
   Code: 1739
Message: Cannot use ref access on index 'idx_b' due to type or collation conversion on field 'b'
*************************** 2. row ***************************
  Level: Warning
   Code: 1739
Message: Cannot use range access on index 'idx_b' due to type or collation conversion on field 'b'
*************************** 3. row ***************************
  Level: Note
   Code: 1003
Message: /* select#1 */ select `db2`.`aa`.`a` AS `a`,`db2`.`aa`.`b` AS `b`,`db1`.`bb`.`a` AS `a`,`db1`.`bb`.`b` AS `b` from `db2`.`t` `aa` join `db1`.`t` `bb` where (`db2`.`aa`.`b` = `db1`.`bb`.`b`)
3 rows in set (0.00 sec)

0 人点赞