Jokes-GPT

About

A Casual Language Joke Modeling Bot

Trained & Fine-Tuned via AlekseyKorshuk/gpt2-jokes Model On Hugging Face by Magical Macaronis

Model: Jokes-GPT
Fine-Tuned On: AlekseyKorshuk/gpt2-jokes
Dev Time: 3 Weeks
Team: Magical Macaronis

Epochs: 10
Loss: 0.001
Repo: Aj-Cdr/jokes-gpt
Organization: AI-CAMP

Trained using the Fraser & Jester dataset of an approximate 2 million reddit jokes. The primary intention of this product is to evoke hilarity, mainly to lighten someone’s mood, while fundamentally test the proficiency of AI in producing an emotion so simple and yet healthy through something as multi-faceted and variable as comedy.

Downloads

# Of Team Members

Hours Worked

# Of Approx Jokes Trained On

Project Tools & Assets

Python 100%

HTML 100%

CSS 85%

JavaScript 50%

Bootstrap 90%

Gradio 80%

Statistics

Challenges

Along the road of getting the right model, there were many challenges. The first one was getting to understand the concepts behind and how to implement NLP models. Of course, over the course of the first and second weeks, we learned and overcame that obstacle. Next was finding the right model, as discussed in the rest of the slides. Everyone tried a different model, but eventually we landed upon a decent one. Finally, the quality of jokes is the last challenge. Filtering out explicitness (bad/inappropriate words) is being fixed through fine-tuning the model and using less explicit data. Time restrictions were also another factor for this project but if we were to continue more on this project, we would add multilingual support, interactive conversations, and integration with social media

Timeline

Distil-GPT

The pre-trained model that we used INITIALLY was distil-GPT. This model was used for 99.5% test size, and used 0.5% trained data. Since this was 232,000 rows we used, the trained data was using 1160 rows. Initially, the validation rate was decreasing. Then errors started to pop up with the runtime, which made me change parameters in my program. Then the validation LOSS slightly fluctuated and started to increase after multiple epochs. Finally, with adjusting to A100 Gpu, it started to process the code faster.

Errors: Memory Error, Unable to generate proper text, gibberish.
Loss: 2.1765
Hyperparameters={Evaluation_strategy = “epoch”, learning_rate = 1e-5, weight_decay = 0.01, push_to_hub = True, num_train_epochs = 3, per_device_train_batch_size = 1}
Input: Cat
Output: Cat s e r i e s t h o w ? i e n o f l e s t e r y t h e 1 g S t i v e n g. W h u p

gpt-neo-125m

Model: EleutherAI/gpt-neo-125m
Loss: 1.7751
Dataset: First 1000 items of Jiri Roznovjak’s “Question-Answer Jokes” (Kaggle); attempted as an alternative to Fraser & Jester datasets
Training time: ~1-2 minutes
Downsides: Outputs are at best only superficially coherent, at worst nonsense not related to jokes.
Hyperparameters={Evaluation_strategy = “epoch”, learning_rate = 5e-05, train_batch_size = 8, eval_batch_size = 8, seed = 42, num_train_epochs = 1, lr_scheduler_type = linear, optimizer = Adam with betas=(0.9, 0.999) & epsilon=1e-08 }
Input: Knock Knock
Output: Knock Knock-out and Knock-out-in-the-world?

Crumb-GPT

Added fp16=True, to speed up processing. Precision reduced, However, the validation loss was higher. Lot of experimentation with batch size and learning rate. No explicit documentation Experimentation with hyperparameters Time-taken to process: 11 hours
Loss: 1.5231
Dataset: Fraser Short Jokes
Downsides: High Evaluation Loss
Hyperparameters={evaluation_strategy = "epoch", learning_rate=3e-5, weight_decay = 0.01, num_train_epochs = 15, per_device_trained_batch_size = 30}
Input: What do you call a
Output: What do you call a man with a bad speech impediment? A coffee-o-phile (Credit to John Oliver for this)

Jokes-GPT

Pre-trained model: AlekseyKorshuk/gpt2-jokes · Hugging Face
Amount of train data used:
Hugging Face Dataset: Fraser/short-jokes · Datasets at Hugging Face
Kaggle Dataset: Jester 1.7M jokes ratings dataset | Kaggle
hyperparameters={ learning_rate: 5e-05 train_batch_size: 8 eval_batch_size: 8 seed: 42 optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 lr_scheduler_type: linear num_epochs: 10 }
Loss: 0.1577
Training Time: 10 min due to multi-gpu factor in the pretrained model & T4 gpu on colab.
Downsides: Explicitness & Slight Grammar Issues.
Input: Your momma's
Output: Your momma's so fat I said, "Hey momma, we need t oget a big pizza to help her. We'll get a big pizza by noon, we're starving."