Help vectorizing text from tfds load in Tensorflow

draglord · Feb 6, 2024

Hello

I am trying to vectorize text (wikipedia) using tfds load. I am trying to do something like [this](https://www.tensorflow.org/text/tutorials/text_classification_rnn)

This nlp example contains imdb reviews data and i was able to successfully follow it. But i am not able to do it for wikipedia dataset. Apparently there is some inherent difference between the types of datasets.

I have tried the following


import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.layers.experimental.preprocessing  import TextVectorization
# Load Wikipedia dataset from tfds
dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True, split=tfds.Split.TRAIN)

print(type(dataset))
for i in dataset:
    print(i['text'].numpy().decode('utf-8'))

# Create a TextVectorization layer to convert text to vectors
vectorize_layer = TextVectorization(
    max_tokens=100,
    output_mode='int',
    output_sequence_length=50
)

# Adapt the vectorization layer to the dataset
vectorize_layer.adapt(dataset.map(lambda x,y: x['text']))


model = tf.keras.Sequential([
    vectorize_layer,
    tf.keras.layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=64, mask_zero=True),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

This much runs without a problem. But when i fit the model


model.fit(dataset, epochs=5)

Then i get the error


>TypeError: Expected string passed to parameter 'input' of op 'StringLower', got {'text': <tf.Tensor 'IteratorGetNext:0' shape=() dtype=string>, 'title': <tf.Tensor 'IteratorGetNext:1' shape=() dtype=string>} of type 'dict' instead. Error: Expected string, got <tf.Tensor 'IteratorGetNext:0' shape=() dtype=string> of type 'Tensor' instead.

What can i do?

thanks

vishalrao · Feb 6, 2024

Just a wild guess (looking at error message) and since you are printing your dataset in your code using the following?

Try instead of:

Python:

mode.fit(dataset, epoch=5)

Try this:

Python:

mode.fit(dataset.map(lambda x,y: x['text']), epoch=5)

.
But instead of running multiple times, save it to a variable for re-use.

Python:

dataset_text = dataset.map(lambda x,y: x['text'])

.

Did some reading of the TF docs, maybe avoid calling

Code:

dataset.map()

in

Code:

TextVecorization.adapt()

call? Just pass in

Code:

dataset

as-is to that and also to

Code:

model.fit()

call?

draglord · Feb 7, 2024

Hey

Thanks for the response

I was able to solve it using this

dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True)

# Prepare the text
datatexts = [example['text'].numpy().decode('utf-8') for example in dataset['train']]

labels = [0] * len(texts)
# Dummy labels for illustration purposes

# Create a TextVectorization layer
vectorize_layer = TextVectorization( max_tokens=50000,
output_mode='tf-idf',)

# Adapt the layer to the text data

vectorize_layer.adapt(texts)

# Vectorize the text data

vectorized_texts = vectorize_layer(texts)
labels = tf.convert_to_tensor(labels, dtype=tf.float32)

I had to vectorize the texts variable, instead of the dataset variable. It's working now, thank you

Help vectorizing text from tfds load in Tensorflow

draglord

vishalrao

Global Moral Police

draglord