Commit Graph

1 Commits

Author SHA1 Message Date
Roberto Dip
9896d591c4
ensure duplicates are removed before enforcing collations (#10814)
Related to #10787, this tries to find in the tables with High likelihood
described in the issue.

This successfully accounts for unique keys that contain leading/trailing
whitespace and are using a collation with a pad attribute set to `NO
PAD` (considers whitespace as any other character instead of ignoring
it)

I haven't found a way to successfully detect the same scenario for
special unicode characters, for example:

```
mysql> SELECT TABLE_NAME, TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = 'software';
+------------+--------------------+
| TABLE_NAME | TABLE_COLLATION    |
+------------+--------------------+
| software   | utf8mb4_general_ci |
+------------+--------------------+
1 row in set (0.01 sec)

mysql> select vendor COLLATE utf8mb4_unicode_ci from software where name = 'zchunk-libs' GROUP BY vendor COLLATE utf8mb4_unicode_ci;
+-----------------------------------+
| vendor COLLATE utf8mb4_unicode_ci |
+-----------------------------------+
| vendor                            |
| vendor?                           |
+-----------------------------------+
2 rows in set (0.01 sec)

mysql> ALTER TABLE `software` CONVERT TO CHARACTER SET `utf8mb4` COLLATE `utf8mb4_unicode_ci`;
ERROR 1062 (23000): Duplicate entry 'zchunk-libs-1.2.1-rpm_packages--vendor\2007-x86_64' for key 'unq_name'
```
> **Note** that `?`  in "vendor?" is an unicode character
2023-03-29 13:31:24 -03:00