9

I have a MySQL database containing data that is input by our customers.

Customers are told their data should only be encoded as UTF-8, however this is currently not being enforced and some people are using characters not encoded in UTF-8.

We intend to add some checking, however how should we detect data that isn't encoded using UTF-8? Is there a SQL or MySQL query (or admin command) that will show me the data that is NOT encoded as UTF-8?

Hannah Vernon
  • 66,859
  • 22
  • 158
  • 297
Bret
  • 93
  • 1
  • 1
  • 3

1 Answers1

12

You can check for the existence of (non-)UTF-8 data by comparing byte length to character length on a column, e.g.:

SELECT * FROM MyTable
WHERE LENGTH(MyColumn) <> CHAR_LENGTH(MyColumn)

Multibyte characters will have a greater LENGTH (bytes), so you'll need to look for where that condition isn't met.

Note that MySQL's utf8 character set isn't true Unicode UTF-8 as it only supports a maximum of 3 bytes per character. If your MySQL is later than 5.5.3 you can use utf8mb4 to get 4 bytes per character.

dartonw
  • 1,297
  • 9
  • 11
  • So in my case using SELECT * FROM TABLE where length(COLUMN) > char_length(COLUMN) worked a treat and returned the entries that had none utf8. – Bret Sep 21 '14 at 19:33
  • 10
    I don't think this is right. UTF-8 can encode characters in 1 to 4 bytes, so how does comparing character length to byte length reveal non-UTF-8 data? One byte per character only contains (roughly) ASCII characters, so the question that you're answering here is how to find non ASCII chars, not non-UTF-8 chars. Am I missing something? – goron Sep 25 '16 at 22:35
  • This is incorrect -- it will flag any text with non-ASCII characters. Valid UTF8, but not ASCII -- just as @goron warned in 2016. – Mikhail T. Jan 12 '22 at 02:53
  • so, to expound upon this answer being incorrect, how would `LENGTH('this is a test') <> CHAR_LENGTH('this is a test')` be discovered as *not* being encoded in UTF-8? That string has the same `LENGTH()` and `CHAR_LENGTH()`. `CHAR_LENGTH()` returns the *number of characters in argument*, and `LENGTH()` returns *the length of a string in bytes*, which is the same for a string consisting of *only* ASCII-encoded characters. – Hannah Vernon Jan 12 '22 at 19:12
  • 1
    @HannahVernon you are right but that's not the question. The OP wants to find if there are characters that are not in UTF-8. Plain ASCII characters are encoded similarly in, say, Latin-1 and UTF-8, but extended ASCII chars are encoded differently. E.g. á in Latin-1 is 0xE1, but in UTF-8 it's 0xC3A1 -- different length. – mustaccio Jan 12 '22 at 19:39
  • @mustaccio but the question isn't about "extended ASCII chars", just if it's encoded using UTF-8 – Lamak Jan 12 '22 at 19:49
  • @mustaccio - yes, except the answer will only return rows where the length and char_length are *different*, but the OP wants to return rows that are encoded in ASCII. `this` is certainly encoded in ASCII, and has the same `length` and `char_length`, unless I'm misunderstanding. – Hannah Vernon Jan 12 '22 at 19:49
  • 1
    For plain ASCII any encoding equals that of UTF-8, for anything else the length will be different. If a column value contains only plain ASCII chars that's the same as UTF-8. `this` is represented in Latin-1 and UTF-8 with the same byte sequence. `thát` is not, which is exactly the problem the OP wants to solve. "not encoded in UTF-8" is not the same question as "encoded in ASCII". – mustaccio Jan 12 '22 at 19:54