0

I'm attempting to fine-tune Google BERT to be able to classify some text to a single integer label (multiclass classification). I have the model up and running, however the predicted labels are all the same across my test data, and the accuracy is very low. The label is 'cpv', and I've encoded this using sklearns LabelEncoder:

The Data - 10,000 rows

                     title_description                         cpv    cpv_enc
95140   software package and information systems Softw...   48000000    23
88628   Secure Building Resource Labour Secure Buildin...   45000000    22
147728  Consultancy Barrow Hill Railway Bridge Consult...   71000000    33
102112  Gas appliance maintenance services Statutory G...   50000000    24
83953   Finlock Gutter Removal and Replacement with PV...   45000000    22

Encode Labels

I'm using SparseCategoricalCrossEntropy with a softmax activation as I have 45 labels in total and they are mutually exclusive. In case this is an error the encoding code is below:

label_encoder = LabelEncoder()
df['cpv'] = label_encoder.fit_transform(df['cpv'])

Data Splits

train, test = train_test_split(df, test_size=0.10, shuffle=True, stratify=df['cpv_enc'])
train, val = train_test_split(train, test_size=0.30, shuffle=True, stratify=train['cpv_enc'])

y_train = np.asarray(train['cpv_enc'])
y_test = np.asarray(test['cpv_enc'])
y_val = np.asarray(val['cpv_enc'])

I'm using Google BERT as a pre-trained model, for which I'm using the bert-for-tf2 library to incorporate it as a layer into my model. I won't post the details of the how I've created the inputs but I've adopted the code from this guide. Essentially this creates a function create_input_array which yields a list of 3 2-dimensional numpy arrays containing the tokenized word ids, mask ids and segment ids.

x_train = create_input_array(train['title_description'], tokenizer=tokenizer, max_seq_len=128)
x_test = create_input_array(test['title_description'], tokenizer=tokenizer, max_seq_len=128)
x_val = create_input_array(val['title_description'], tokenizer=tokenizer, max_seq_len=128)

Keras Model

MAX_SEQ_LEN=128
input_word_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32)
input_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32)
segment_ids = tf.keras.layers.Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32)
pooled_output, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
x = tf.keras.layers.GlobalAveragePooling1D()(sequence_output)
x = tf.keras.layers.Dropout(0.2)(x)
out = tf.keras.layers.Dense(45, activation="softmax", name="dense_output")(x)

model = tf.keras.models.Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
model.fit(x_train,y_train, validation_data=(x_val, y_val),epochs=1,batch_size=32,shuffle=True)

197/197 [==============================] - ETA: 0s - loss: 3.8014 - accuracy: 0.0413

Incorporating BERT means I only need to train for a few epochs, however after fitting, the accuracy is extremely low, and all the predictions come out as the same label. I've been hacking at this for a few days now and don't seem to be getting anywhere. I've also tried using to_categorical to output 1-hot encoded labels, and then using these with a CategoricalCrossentropy loss, but this doesn't help.

I have no idea where I'm going wrong, any help would be greatly appreciated. The model structure itself was also adapted from a guide, so I'm open to completely re-building it!

Update 1

I've changed the accuracy to metrics.sparse_categorical_accuracy, and the loss over 4 epochs is as follows:

Epoch 1/4
2100/2100 [==============================] - 170s 81ms/step - loss: 3.7983 - sparse_categorical_accuracy: 0.0459 - val_loss: 3.8056 - val_sparse_categorical_accuracy: 0.0385
Epoch 2/4
2100/2100 [==============================] - 168s 80ms/step - loss: 3.8016 - sparse_categorical_accuracy: 0.0425 - val_loss: 3.7979 - val_sparse_categorical_accuracy: 0.0463
Epoch 3/4
2100/2100 [==============================] - 168s 80ms/step - loss: 3.7977 - sparse_categorical_accuracy: 0.0465 - val_loss: 3.7979 - val_sparse_categorical_accuracy: 0.0463
ML_Engine
  • 175
  • 1
  • 6

0 Answers0