Introducing BERT
What is BERT?
BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.
What makes BERT different?
BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (in this case, Wikipedia).
Why does this matter? Pre-trained representations can either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single word embedding representation for each word in the vocabulary. For example, the word “bank” would have the same context-free representation in “bank account” and “bank of the river.” Contextual models instead generate a representation of each word that is based on the other words in the sentence. For example, in the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the ... account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.
The Strength of Bidirectionality If bidirectionality is so powerful, why hasn’t it been done before? To understand why, consider that unidirectional models are efficiently trained by predicting each word conditioned on the previous words in the sentence. However, it is not possible to train bidirectional models by simply conditioning each word on its previous and next words, since this would allow the word that’s being predicted to indirectly “see itself” in a multi-layer model.
To solve this problem, we use the straightforward technique of masking out some of the words in the input and then condition each word bidirectionally to predict the masked words. For example:
While this idea has been around for a very long time, BERT is the first time it was successfully used to pre-train a deep neural network.
BERT also learns to model relationships between sentences by pre-training on a very simple task that can be generated from any text corpus: Given two sentences A and B, is B the actual next sentence that comes after A in the corpus, or just a random sentence? For example:
How i extended BERT for chat bot
Already pretrained BERT was fine tuned on SQUAD database.
The model is pre-trained on 40 epochs over a 3.3 billion word corpus, including BooksCorpus (800 million words) and English Wikipedia (2.5 billion words).
The model is fine tuned in Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles
Deployment
I used Flask - A web services' framework in Python, to wrap a machine learning Python code into an API.
Training and Maintenance
Google BERT is a pre-trained model and there is no training involved.
You can fine tune it though like i did on SQUAD data set.
If you can spend some time on understanding the underlying code you can customize it to better suit your domain and requirement, like we did.
Once the code is deployed it needs to be constantly monitored and evaluated to understand improvement scope.
No day-to-day training is required.
Infra spec
Though the BERT pre-trained model should be able to run on any infra spec that is generally advised for any analytics use case, the infra that Google has advised for fine tuning is on the higher end by non-Google standard.
(Though BERT without fine tuning is also efficient, fine tuning result in substantial accuracy improvements.)
As per Google –
Fine-tuning is inexpensive. All of the results in the paper can be replicated in at most 1 hour on a single Cloud TPU, or a few hours on a GPU, starting from the exact same pre-trained model. SQuAD, for example, can be trained in around 30 minutes on a single Cloud TPU to achieve a Dev F1 score of 91.0%, which is the single system state-of-the-art.
All results on the paper were fine-tuned on a single Cloud TPU, which has 64GB of RAM. It is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB - 16GB of RAM, because the maximum batch size that can fit in memory is too small.
The fine-tuning examples which use BERT-Base should be able to run on a GPU that has at least 12GB of RAM using the hyperparameters given.
Most of the examples below assumes that you will be running training/evaluation on your local machine, using a GPU like a Titan X or GTX 1080.
References I used for my learning and some content
https://www.slideshare.net/shauryauppal/nlp-state-of-the-art-bert
https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
https://marwandebbiche.com/posts/serving-ml-models-using-flask/
https://towardsdatascience.com/flask-and-heroku-for-online-machine-learning-deployment-425beb54a274
Comments