I'm attempting to fine-tune Google BERT to be able to classify some text to a single integer label (multiclass classification). I have the model up and running, however the predicted labels are all the same across my test data, and the accuracy is very low. The label is 'cpv', and I've encoded this using sklearns LabelEncoder:
The Data - 10,000 rows
title_description cpv cpv_enc
95140 software package and information systems Softw... 48000000 23
88628 Secure Building Resource Labour Secure Buildin... 45000000 22
147728 Consultancy Barrow Hill Railway Bridge Consult... 71000000 33
102112 Gas appliance maintenance services Statutory G... 50000000 24
83953 Finlock Gutter Removal and Replacement with PV... 45000000 22
Encode Labels
I'm using SparseCategoricalCrossEntropy with a softmax activation as I have 45 labels in total and they are mutually exclusive. In case this is an error the encoding code is below:
label_encoder = LabelEncoder()
df['cpv'] = label_encoder.fit_transform(df['cpv'])
Data Splits
train, test = train_test_split(df, test_size=0.10, shuffle=True, stratify=df['cpv_enc'])
train, val = train_test_split(train, test_size=0.30, shuffle=True, stratify=train['cpv_enc'])
y_train = np.asarray(train['cpv_enc'])
y_test = np.asarray(test['cpv_enc'])
y_val = np.asarray(val['cpv_enc'])
I'm using Google BERT as a pre-trained model, for which I'm using the bert-for-tf2
library to incorporate it as a layer into my model. I won't post the details of the how I've created the inputs but I've adopted the code from this guide. Essentially this creates a function create_input_array
which yields a list of 3 2-dimensional numpy arrays containing the tokenized word ids, mask ids and segment ids.
x_train = create_input_array(train['title_description'], tokenizer=tokenizer, max_seq_len=128)
x_test = create_input_array(test['title_description'], tokenizer=tokenizer, max_seq_len=128)
x_val = create_input_array(val['title_description'], tokenizer=tokenizer, max_seq_len=128)
Keras Model
MAX_SEQ_LEN=128
input_word_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32)
input_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32)
segment_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
x = tf.keras.layers.GlobalAveragePooling1D()(sequence_output)
x = tf.keras.layers.Dropout(0.2)(x)
out = tf.keras.layers.Dense(45, activation="softmax", name="dense_output")(x)
model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(x_train,y_train, validation_data=(x_val, y_val),epochs=1,batch_size=32,shuffle=True)
197/197 [==============================] - ETA: 0s - loss: 3.8014 - accuracy: 0.0413
Incorporating BERT means I only need to train for a few epochs, however after fitting, the accuracy is extremely low, and all the predictions come out as the same label. I've been hacking at this for a few days now and don't seem to be getting anywhere. I've also tried using to_categorical to output 1-hot encoded labels, and then using these with a CategoricalCrossentropy loss, but this doesn't help.
I have no idea where I'm going wrong, any help would be greatly appreciated. The model structure itself was also adapted from a guide, so I'm open to completely re-building it!
Update 1
I've changed the accuracy to metrics.sparse_categorical_accuracy
, and the loss over 4 epochs is as follows:
Epoch 1/4
2100/2100 [==============================] - 170s 81ms/step - loss: 3.7983 - sparse_categorical_accuracy: 0.0459 - val_loss: 3.8056 - val_sparse_categorical_accuracy: 0.0385
Epoch 2/4
2100/2100 [==============================] - 168s 80ms/step - loss: 3.8016 - sparse_categorical_accuracy: 0.0425 - val_loss: 3.7979 - val_sparse_categorical_accuracy: 0.0463
Epoch 3/4
2100/2100 [==============================] - 168s 80ms/step - loss: 3.7977 - sparse_categorical_accuracy: 0.0465 - val_loss: 3.7979 - val_sparse_categorical_accuracy: 0.0463