Help vectorizing text from tfds load in Tensorflow

Hello

I am trying to vectorize text (wikipedia) using tfds load. I am trying to do something like [this](https://www.tensorflow.org/text/tutorials/text_classification_rnn)

This nlp example contains imdb reviews data and i was able to successfully follow it. But i am not able to do it for wikipedia dataset. Apparently there is some inherent difference between the types of datasets.

I have tried the following


import tensorflow as tf import tensorflow_datasets as tfds from tensorflow.keras.layers.experimental.preprocessing import TextVectorization # Load Wikipedia dataset from tfds dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True, split=tfds.Split.TRAIN) print(type(dataset)) for i in dataset: print(i['text'].numpy().decode('utf-8')) # Create a TextVectorization layer to convert text to vectors vectorize_layer = TextVectorization( max_tokens=100, output_mode='int', output_sequence_length=50 ) # Adapt the vectorization layer to the dataset vectorize_layer.adapt(dataset.map(lambda x,y: x['text']))

model = tf.keras.Sequential([ vectorize_layer, tf.keras.layers.Embedding(input_dim=len(vectorize_layer.get_vocabulary()), output_dim=64, mask_zero=True), tf.keras.layers.GlobalAveragePooling1D(), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ]) # Compile the model model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

This much runs without a problem. But when i fit the model


model.fit(dataset, epochs=5)


Then i get the error

>TypeError: Expected string passed to parameter 'input' of op 'StringLower', got {'text': <tf.Tensor 'IteratorGetNext:0' shape=() dtype=string>, 'title': <tf.Tensor 'IteratorGetNext:1' shape=() dtype=string>} of type 'dict' instead. Error: Expected string, got <tf.Tensor 'IteratorGetNext:0' shape=() dtype=string> of type 'Tensor' instead.

What can i do?

thanks
 
Just a wild guess (looking at error message) and since you are printing your dataset in your code using the following?

Try instead of:
Python:
mode.fit(dataset, epoch=5)

Try this:
Python:
mode.fit(dataset.map(lambda x,y: x['text']), epoch=5)
.
But instead of running multiple times, save it to a variable for re-use.

Python:
dataset_text = dataset.map(lambda x,y: x['text'])
.

Did some reading of the TF docs, maybe avoid calling
Code:
dataset.map()
in
Code:
TextVecorization.adapt()
call? Just pass in
Code:
dataset
as-is to that and also to
Code:
model.fit()
call?
 
Last edited:
Hey

Thanks for the response

I was able to solve it using this

dataset, info = tfds.load("wikipedia/20230601.ab", with_info=True)

# Prepare the text
datatexts = [example['text'].numpy().decode('utf-8') for example in dataset['train']]

labels = [0] * len(texts)
# Dummy labels for illustration purposes

# Create a TextVectorization layer
vectorize_layer = TextVectorization( max_tokens=50000,
output_mode='tf-idf',)

# Adapt the layer to the text data

vectorize_layer.adapt(texts)

# Vectorize the text data

vectorized_texts = vectorize_layer(texts)
labels = tf.convert_to_tensor(labels, dtype=tf.float32)

I had to vectorize the texts variable, instead of the dataset variable. It's working now, thank you
 
Back
Top