ERROR 1253 (42000): COLLATION ‘utf8mb4_unicode_ci’ is not valid for CHARACTER SET ‘utf8’

If you come across the following error in MySQL; don’t be a dumb ass like me.

The error was being thrown from the following statement:

The problem is that the character set should be utf8mb4 not utf8. I was under the impression that the collation of utf8mb4 was part of the utf8 character set. Clearly not.

So the correct statement would be:

Note that you’ll need MySQL version >= 5.5 to use the utf8mb4 character set.

Notes on bits and bytes

Synopsis

A bit (binary digit) is the smallest unit of data in a computer. A bit has a single binary value, either 0 or 1.

A byte is a collection of bits. In most computer systems, there are eight bits in a byte.

When data is stored in more than one byte the additional byte is stacked to the left of the first byte.

The structure of an 8-bit Byte

Therefore a single 8-bit byte can hold eight 0s or 1s:

A Byte
1 2 3 4 5 6 7 8
0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1 0 or 1

So each of the eight boxes (bits) in a byte can have maximum two combinations. 0 or 1. Therefore each 8 bit byte can have a maximum of 256 unique combinations. (2^8 = 256)

Representing Data in Binary format.

Unsigned integers

(We’ll start with shorts for simplicity, although exactly the same principles apply to regular and long unsigned integers.)

Usually a short unsigned integer is stored in 16 bits wide (2 bytes). Each bit represents a decimal value:

Byte 2 Byte 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
32768’s column 16384’s column 8192’s column 4069’s column 2048’s column 1024’s column 512’s column 256’s column 128’s column 64’s column 32’s column 16’s column 8’s column 4’s column 2’s column 1’s column

So to represent the number 164 you would put a 1 in the 128’s column, a 1 in the 32’s column and a 1 in the 4’s column. Zero in the others.

short unsigned int foo = 164;
Byte 2 Byte 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
32768’s column 16384’s column 8192’s column 4069’s column 2048’s column 1024’s column 512’s column 256’s column 128’s column 64’s column 32’s column 16’s column 8’s column 4’s column 2’s column 1’s column
0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0

A Byte is made of 8 Bits, so the number of combination for
Byte would be 2^8 (256).

However, if asked the maximum (or highest) number represented then it
would be (2^8)-1 (255). As the first number is 0, not 1

Storing larger numbers

Exactly the same principle applies for regular integers as it does for shorts. Regular integers are usually stored in 32 bits wide (8 bytes). This gives the ability to store larger numbers. Here the columns for each of a 32 bits decimal representations and what 30,287 would look like in binary format:

unsigned int foo = 30287;
Byte 4 Byte 3 Byte 2 Byte 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
2147483648’s column 1073741824’s column 536870912’s column 268435456’s column 134217728’s column 67108864’s column 33554432’s column 16777216’s column 8388608’s column 4194304’s column 2097152’s column 1048576’s column 524288’s column 262144’s column 131072’s column 65536’s column 32768’s column 16384’s column 8192’s column 4069’s column 2048’s column 1024’s column 512’s column 256’s column 128’s column 64’s column 32’s column 16’s column 8’s column 4’s column 2’s column 1’s column
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 1

Signed integers

Almost always in modern computer science signed integers are stored using two’s compliment (although there are other methods). This is because two’s compliment does away with the negative zero requirement. This video explains it incredibly well:

Floating point numbers

TODO

Chars

The ASCII Example

The extended American Standard Code for Information Interchange (ASCII) is a character-encoding scheme originally based on the English alphabet that encodes 256 unique characters*

The computer uses a single (8-bit) byte to store each character in memory. So for example here is how the characters “A” and “$” are stored:

Binary Number Symbol
01000001 A
00100100 $

This in essence this is how computers store char values. Each value is mapped to a binary number.

Like anything, as well as being able to represent symbols in base 2, they can also be represented in base 10, 8 & 16.

Symbol Decimal Octal Hexidecimal Binary
$ 36 44 24 100100
j 106 152 6A 1101010

So, you can represent the $ symbol in any of the following ways:

char charSymbol = '$';
printf("Value of charSymbol is: %c \n", charSymbol); // outputs $

char charDec = 36;
printf("Value of charDec is: %c \n", charDec); // outputs $

char charOct = 044;
printf("Value of charOct is: %c \n", charOct); // outputs $

char charHex = 0x24;
printf("Value of charHex is: %c \n", charHex); // outputs $

char charBin = 0b100100;
printf("Value of charBin is: %c \n", charBin); // outputs $

*Some of which are no longer or rarely used and 33 are non-printing control characters (many now obsolete). Although there are 256 characters in the set usually only the first 127 are used. The full table can be found here: [ascii-code.com](http://www.ascii-code.com)

Notes on the printf specifiers in C

code type format
d int decimal (base ten) number
o int octal number (no leading ‘0’ supplied in printf)
x or X int hexadecimal number (no leading ‘0x’ supplied in printf; accepted if present in scanf) (for printf, ‘X’ makes it use upper case for the digits ABCDEF)
ld long decimal number (‘l’ can also be applied to any of the above to change the type from ‘int’ to ‘long’)
u unsigned decimal number
lu unsigned long decimal number
c char [footnote] single character
s char pointer string
f float [footnote] number with six digits of precision
g float [footnote] number with up to six digits of precision
e float [footnote] number with up to six digits of precision, scientific notation
lf double [footnote] number with six digits of precision
lg double [footnote] number with up to six digits of precision
le double [footnote] number with up to six digits of precision, scientific notation

Footnote: In printf(), the rvalue type promotions are expected. Thus %c
actually corresponds to a parameter of type int and %f and %g actually
correspond to parameters of type double. Thus in printf() there is no
difference between %f and %lf, or between %g and %lg.
However, in scanf() what is passed is a pointer to the variable so no
rvalue type promotions occur or are expected.
Thus %f and %lf are quite different in scanf, but the same in printf.

From http://www.cdf.toronto.edu/~ajr/209/notes/printf.html

Notes on character strings in C

Synopsis

To initilise a string of characters, use double quotation marks.

char myString[] = "Hello!";

This declaration is the equivalent of the statement

char myString[] = {'H', 'e', 'l', 'l', 'o', '!', '\0'};

\0 is the NULL character in C, and it is used to denote the termination of an array of chars.

To display a character string inside the printf function use the %s wildcard

printf("Value of myString is: %s \n", myString);

Why double quotation marks only?

In C (and in C++) single quotes identify a single character (char), while double quotes create a string literal. ‘a’ is a single a character literal (char), while “a” is a string literal containing an ‘a’ and a null terminator (effectively a 2 char array).

An example of string concatenation in C:

#include <stdio.h>

void concat(char result[],
            char string1[],
            char string2[])
{
    // Add string 1.
    char string1Char = string1[0];
    int i = 0;
    
    while (string1Char != '\0') {
        
        result[i] = string1[i];
        
        i++;
        string1Char = string1[i];
    }
    
    // Add string 2.
    char string2Char = string2[0];
    int u = 0;
    int j = i;
    
    while (string2Char != '\0') {
        
        result[j] = string2[u];
        
        u++;
        j++;
        string2Char = string2[u];
    }
    
    // Add NULL char.
    result[j] = '\0';
}

int main(int argc, const char * argv[])
{
    char firstName[] = "Jon";
    char lastName[] = "Matthews";
    char result[12];
    
    concat(result, firstName, lastName);
    
    printf("%s\n", result);
    
    return 0;
}

Notes on the const qualifier in C

Synopsis

const <variable declaration>

Variables that are defined as constant are expected to not have their rvalue modified. If you try and modify a constant’s rvalue during program execution the compiler might issue a warning (although it is not required to do so). One of the motivations for defining a variable as constant (in addition to added readability) is that it allows the compiler to place the constant variables into read-only memory.

Examples:

const int theNumberFive = 5;
const char myName[] = "Jon";
const double theNumberSix = 6.;

int main(int argc, const char * argv[])
{
    printf("A number is: %i\n", theNumberFive);
    printf("My name is: %s\n", myName);
    printf("Another number is: %f\n", theNumberSix);
    return 0;
}

Notes on macros in C

Synopsis

#define NAME expression

It’s standard convention in C that #define statements are defined with UPPERCASE names, although this is not required.

The preprocessor essentially does a “find and replace” with all #define statements substituting their key with it’s expression.

Therefore

#define MY_NAME "Jon"
printf(MY_NAME);

Is seen by the compiler as

#define MY_NAME "Jon"
printf("Jon");

This is why #define statements do not have terminating semicolons. See an example of why below:

#define MY_NAME "Jon";
printf(MY_NAME);

Would be interpreted by the compiler as

#define MY_NAME "Jon";
printf("Jon";); 

It’s common for #define statements to be defined at the start of a program, like so:

#include <stdio.h>

#define MY_NAME "Jon"
#define NUMBER_THREE 3

int main(int argc, const char * argv[])
{
    //...
}

However, this is not required – they can be defined anywhere within a program, so long as they’re defined before they’re referenced.

Scope

#define statements are always global, regardless of whether they have been declared inside or outside of a function.

Notes on the array data type in C

Synopsis

[data type] arrayName[array size] = {default value}

Initialisation

To create an array that can hold up to 10 (integer value) key value pairs (with the keys 0 through 9):

int myNumbers[10];

Assigning values:

int myNumbers[0] = 5;
myNumbers[1] = -3;
...
myNumbers[9] = 7;

Or you can use the following braces syntax for the declaration and value assignment on a single line.

int myNumbers[10] = {5, -3, 2, 7, -6, 1, 6, -3, 2, 7};

If you wish to set the value of specific keys only using braces, you can do so like this:

int myNumbers[10] = { [3] = 7, [8] = 2, [9] = 7 };

This sets the values of entries 3, 8 and 9.

In C unspecified values are usually set to zero (but aren’t always if there happens to be junk in that memory location). For example:

int otherNumbers[10];

otherNumbers[0] = 1;
otherNumbers[1] = 2;

might not necessarily mean that keys 2 through 9 have a value of zero (0). So it’s good practice when creating an array to explicitly set the values you want. With C++ there is a helper method to set the default value of an entire array. However, there isn’t such a luxury in C.

In C you can omit the size of the array (also known as the array dimension). E.g:

int numbers[] = {1, 2, 3, 4, 5};

This approach is fine so long as you initialise every element in the array at the point that the array is defined. If this is not the case, you must explicitly define the dimension like so:

int numbers[5];

Below are examples of a floating point arrays:

float realNums[3];
realNums[0] = 0.2;
realNums[1] = -4.;
realNums[2] = 6.7219;

double realNums2[3];
realNums2[0] = 0.2;
realNums2[1] = -4.;
realNums2[2] = 6.7219;

Below is an example of a char array. Single char arrays must use single quotes within the value assignments:

char myName[12] = {'J', 'o', 'n', ' ', 'M', 'a', 't', 't', 'h', 'e', 'w', 's'};
    
for (int i = 0; i < 12; i++) {
    printf("%c", myName[i]);
}

Enumeration

short unsigned int i;
for (i=0; i<10; i++) {
    printf("%i = %i\n", i, myNumbers[i]);
}

Arrays in functions

When you pass an *entire* array to a function, any modifications to that array modify the original array. As in, the parameter's array declaration isn't a copy of the original, but a reference to it. See example below:

void doubleScores(float scoreArray[], unsigned short int arraySize)
{
    unsigned short int i;
    for (i = 0; i < arraySize; i++) {
        scoreArray[i] = (scoreArray[i] * 2);
    }
}

int main(int argc, const char * argv[])
{
    
    float scores[3] = {0.f};
    
    scores[0] = 7.f;
    scores[1] = 6.33;
    scores[2] = 1.27;
    
    unsigned short int i;
    
    for (i=0; i<3; i++) {
        printf("score %i is %f\n", i, scores[i]);
    }
    
    printf("----------------\n");
    doubleScores(scores, 3);
    
    for (i=0; i<3; i++) {
        printf("Doubled score %i is %f\n", i, scores[i]);
    }
    
    return 0;
}

Output:

score 0 is 7.000000
score 1 is 6.330000
score 2 is 1.270000
----------------
Doubled score 0 is 14.000000
Doubled score 1 is 12.660000
Doubled score 2 is 2.540000

However, parameterised individual array elements arr[x] are passed as copies, not references (just like normal C primitives int, float, char etc).

void doubleScore(float singleScore)
{
    singleScore = (singleScore * 2);
}

int main(int argc, const char * argv[])
{
    
    float scores[3] = {0.f};
    
    scores[0] = 7.f;
    scores[1] = 6.33;
    scores[2] = 1.27;
    
    printf("Score 2 is is %f\n", scores[2]);
    
    doubleScore(scores[2]);
    
    printf("Score 2 is is %f\n", scores[2]);
    
    return 0;
}

Output:

Score 2 is 1.270000
Score 2 is 1.270000

Multidimensional Arrays

int array2D[number_of_rows][number_of_columns];

Here is the initialisation of a multidimensional array with 2 rows and 3 columns:

int array2D[2][3] =
{
{ 2, 7, 1 }, // row 1 (3 columns in each)
{ 9, 4, 2 }  // row 2 (3 columns in each)
};

This array can be depicted like so:

- Column 1 Column 2 Column 3
Row 1 2 7 1
Row 2 9 4 2

Items are accessed like so:

- Column 1 Column 2 Column 3
Row 1 array2D[0][0] array2D[0][1] array2D[0][2]
Row 2 array2D[1][0] array2D[1][1] array2D[1][2]

Example of a three dimensional array:

int array3D[2][3][2] =
    {
        { { 2, 6 }, { 7, 3 }, { 1, 5 } },
        { { 9, 8 }, { 4, 4 }, { 2, 1 } }
    };

Notes on the char data type in C

Synopsis

Single character only, like ‘a’ or ‘0’ with the exception of escape characters such as ‘\n’, ‘\r’ or ‘\t’. Single character constants (chars) must use single quotation marks for their declaration.

char

char myChar     = 'a';
char myChar2    = ';';
char myChar3    = '0';
char myChar4    = '\n';
printf("Value of myChar is: %c \n", myChar);
printf("Value of myChar2 is: %c \n", myChar2);
printf("Value of myChar3 is: %c \n", myChar3);
printf("Value of myChar4 is: %c \n", myChar4);

Technically chars are integers. Just with a narrower range.

Usually the range will be 1 byte (8 bits). This is because 1 byte allows for 2^8 unique combinations (256). And there are 256 ASCII characters (http://www.ascii-code.com). Therefore each character can fit into a single byte. This means that there isn’t a need for an extra byte.

You can test the range like so:

printf("On this machine, char is stored in %lu bytes. (%lu bits wide).\n",
           sizeof(char),
           ((sizeof(char)) * 8));

You can also declare chars using their decimal, hexidecimal, octal or binary* equivalents.

The code example below represents the tilde constant (see the ASCII table @  http://www.ascii-code.com)

char charSymbol = '~';
printf("Value of charSymbol is: %c \n", charSymbol); // outputs ~

char charDec = 126;
printf("Value of charDec is: %c \n", charDec); // outputs ~

char charHex = 0x7E;
printf("Value of charHex is: %c \n", charHex); // outputs ~

char charOct = 0176;
printf("Value of charOct is: %c \n", charOct); // outputs ~

char charBin =  0b01111110;
printf("Value of charBin is: %c \n", charBin); // outputs ~

All of which are identical.

*Standard C does not support binary constants like 0b01111110; that’s a gcc extension.

You can also test a variable’s rvalue against the decimal, hex, oct or binary values.

char c;
    
printf("Enter a letter:\n");
scanf("%c", &c);
    
char tilde = 0x7E;
    
if (c == tilde){
    printf("you entered a tilde!\n");
}

Why single quotation marks only?

In C (and in C++) single quotes identify a single character (char), while double quotes create a [string literal](http://joncarlmatthews.com/c/notes/data%20type/2014/08/08/c-notes-data-types-string.html). ‘a’ is a single a character literal (char), while “a” is a string literal containing an ‘a’ and a null terminator (effectively a 2 char array).