Around early 2023, rumours started flying in the comments of different Instagram posts. There seems to be something not quite right with the translate feature. Commenting "Oouga Bouga" triggers the translate button to appear and seems to translate nonsensical variations. Some are racist translations, some are sexual innuendos, some seem to be ramblings of sentient beings.
This is a screen recorded from that time -
None of the user comments show the behaviour anymore. But everyone expected this to be quickly patched because this had the potential to be a PR disaster for Instagram. However, it's December of 2023 and still didn't seem to be patched.
The comments of this post have some examples that still work in different ways. Coming to March 2024, Matt Rose posts a compilation of all the different variations that were discovered by users.
Also a comment by a reddit-user points out the connection to google translate, where similar behaviour has been observed. You can see the behaviour still exists.
At this point, when searching for other bugs between Somali and English, I came across this post on a Google support forum
This issue now has been traced atleast back to 2019 mainly on the same Somali to English translation pipeline.
Then I found r/translategate which has examples from August of 2018. If you go through some of the top posts, you will see that the examples exist in multiple languages. These are languages that might not have been extensively trained on. However, some of them are still surprising - Here is a 2025 screenshot of a Bulgarian to English translation that still works.
I was quite puzzled at this point at how this was happening post the 2023 cambrian explosion of LLMs. That is when I came across this post by a reddit user
This user's theory was that user contributions had poisoned the well. However, if you try to trace it from 2018 to 2025, this would mean that the poisoning was quite effective. So I started to look at other sources to see how Google had built the translation pipelines.
Around 2012 fresh out of the "AI winter" that started in the 1987, neural nets were starting to seem like they could be better approaches to Machine Learning. This was in conjunction with an explosion of GPU-based parallel processing capabilities that enabled neural networks to be more powerful. AlexNet had won a hard fought 10% lead on the runner up in the ImageNet challenge. This was followed by breakthroughs with DeepMind, specifically in AlphaZero and AlphaGo which solved for Chess and Go gameplays.
In 2014, 3 Google engineers Ilya Sutskever, Oriol Vinyals, Quoc V. Le wrote a paper that demonstrated that sequences of words could be semantically linked by reversing them.
This was a far leaner method of semantic linkage than what was then the bleeding edge of language translation - Statistical Machine Translation. This method broke down sentences into words or phrases and statistically chose translations for those words from a large vocabulary dictionary
What they proposed as an alternative was a "sequence-to-sequence" approach instead of a word-to-word approach. To do this they created a new type of neural net that they called the Long Short-Term Memory (LSTM). To train this model, they used 12 million English-French sentence pairs from the WMT'14 dataset consisting of 348M French words and 304M English words. The English sentences were reversed during the training processes. This caused the model to retain a lot more of the context between the words in the sequences. This context however, was not very effective for very long sequences and was prone to hallucinations.
Google, in the years that followed started to train a corpus of
NMT is born and christened GNMT (https://research.google/blog/a-neural-network-for-machine-translation-at-production-scale/)
Uses seq2seq LSTMs
Hallucinations in Neural Machine Translation
- https://arxiv.org/pdf/1609.08144
Oct 2016
Three inherent weaknesses of Neural Machine Translation are responsible for this gap: its slower training and inference speed, ineffectiveness in dealing with rare words,and sometimes failure to translate all words in the source sentence
- https://proceedings.mlr.press/v162/bansal22b/bansal22b.pdf
Back translation for NMT is particularly sensitive to data quality
Backdoor Attacks on Multilingual Machine Translation
Critical Dataset ethnography