目录
北京时间10.31 23时许,bert官方版代码正式出炉~
https://github.com/google-research/bert
原文的解读和pytorch版本的解读参考:https://daiwk.github.io/posts/nlp-bert.html
参考参考机器之心发的谷歌终于开源BERT代码:3 亿参数量,机器之心全面解读
代码结构:
`-- bert
|-- CONTRIBUTING.md
|-- create_pretraining_data.py
|-- extract_features.py
|-- __init__.py
|-- LICENSE
|-- modeling.py
|-- modeling_test.py
|-- optimization.py
|-- optimization_test.py
|-- README.md
|-- run_classifier.py
|-- run_pretraining.py
|-- run_squad.py
|-- sample_text.txt
|-- tokenization.py
`-- tokenization_test.py
1 directory, 16 files
有这几个版本(在进行WordPiece分词之前是否区分大小写:是:cased,否:uncased(即全部转为小写)):
每个zip中包含如下三个文件:
例如:
uncased_L-12_H-768_A-12
|-- bert_config.json
|-- bert_model.ckpt.data-00000-of-00001
|-- bert_model.ckpt.index
|-- bert_model.ckpt.meta
|-- checkpoint
`-- vocab.txt
0 directories, 6 files
下载glue数据,使用https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e的py,执行【记住要是python3!!!!!】。。不过在墙内好像怎么都下不下来。。
python download_glue_data.py --data_dir glue_data --tasks all
官方文档:https://github.com/nyu-mll/GLUE-baselines
如果是国内,先把这个clone下来:https://github.com/wasiahmad/paraphrase_identification
然后
python download_glue_data.py --data_dir glue_data --tasks all --path_to_mrpc=paraphrase_identification/dataset/msr-paraphrase-corpus
注意,如果要用glove,从https://nlp.stanford.edu/projects/glove/下载下来的840B版本的zip就有2G多,直接unzip解压不了。。可以
7z x glove.840B.300d.zip
然后就很风骚。。。
7z x glove.840B.300d.zip
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US,Utf16=on,HugeFiles=on,64 bits,56 CPUs x64)
Scanning the drive for archives:
1 file, 2176768927 bytes (2076 MiB)
Extracting archive: glove.840B.300d.zip
--
Path = glove.840B.300d.zip
Type = zip
Physical Size = 2176768927
64-bit = +
Everything is Ok
Size: 5646236541
Compressed: 2176768927
参考https://github.com/nyu-mll/GLUE-baselines,装allennlp==0.7.0,torch>=0.4.1,可以跑glue数据集的baseline:
py=/home/xxx/python-3-tf-cpu/bin/python3.6
alias superhead='/opt/compiler/gcc-4.8.2/lib/ld-linux-x86-64.so.2 --library-path /opt/compiler/gcc-4.8.2/lib:$LD_LIBRARY_PATH '
alias python='superhead $py'
python main.py \
--exp_dir EXP_DIR \
--run_dir RUN_DIR \
--train_tasks all \
--cove 0 \
--cuda -1 \
--eval_tasks all \
--glove 1 \
--word_embs_file ./emb_dir/glove.840B.300d.txt
export BERT_BASE_DIR=/path/to/bert/uncased_L-12_H-768_A-12
export GLUE_DIR=/path/to/glue
python run_classifier.py \
--task_name=MRPC \
--do_train=true \
--do_eval=true \
--data_dir=$GLUE_DIR/MRPC \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=/tmp/mrpc_output/
输出:
***** Eval results *****
eval_accuracy = 0.845588
eval_loss = 0.505248
global_step = 343
loss = 0.505248
表示dev set上有84.55%的准确率,像MRPC(glue_data中的一个数据集)这样的小数据集,即使从pretrained的checkpoint开始,仍然可能在dev set的accuracy上会有很高的variance(跑多次,可能结果在84-88%之间)。
paper的源码是用c++写的,这里用py又实现了一遍。。实现masked lm和next sentence prediction。
输入文件的格式:一行一句话(对于next sentence prediction这很重要),不同文档间用空行分隔。例如sample_text.txt
:
Something glittered in the nearest red pool before him.
Gold, surely!
But, wonderful to relate, not an irregular, shapeless fragment of crude ore, fresh from Nature's crucible, but a bit of jeweler's handicraft in the form of a pla
in gold ring.
Looking at it more attentively, he saw that it bore the inscription, "May to Cass."
Like most of his fellow gold-seekers, Cass was superstitious.
The fountain of classic wisdom, Hypatia herself.
As the ancient sage--the name is unimportant to a monk--pumped water nightly that he might study by day, so I, the guardian of cloaks and parasols, at the sacred
doors of her lecture-room, imbibe celestial knowledge.
From my youth I felt in me a soul above the matter-entangled herd.
She revealed to me the glorious fact, that I am a spark of Divinity itself.
输出是一系列的TFRecord
的tf.train.Example
。
注意:这个脚本把整个输入文件都放到内存里了,所以对于大文件,你可能需要把文件分片,然后跑多次这个脚本,得到一堆tf_examples.tf_record*
,然后把这些文件都作为下一个脚本run_pretraining.py
的输入。
参数:
max_seq_length*masked_lm_prob
(这个脚本不会自动设置)python create_pretraining_data.py \
--input_file=./sample_text.txt \
--output_file=/tmp/tf_examples.tfrecord \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--do_lower_case=True \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--masked_lm_prob=0.15 \
--random_seed=12345 \
--dupe_factor=5
输出如下:
INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] indeed , it was recorded in [MASK] star that a fortunate early [MASK] ##r had once picked up on the highway a solid chunk [MASK] gold quartz which the [MASK] had freed from its inc [MASK] ##ing soil , and washed into immediate and [MASK] popularity . [SEP] rainy season , [MASK] insult show habit of body , and seldom lifted their eyes to the rift ##ed [MASK] india - ink washed skies [MASK] them . " cass " beard [MASK] elliot early that morning , but not with a view to [MASK] . a leak in his [MASK] roof , - - quite [MASK] with his careless , imp ##rov ##ide ##nt habits , - - had rouse ##d him at 4 a [MASK] m [SEP]
INFO:tensorflow:input_ids: 101 5262 1010 2009 2001 2680 1999 103 2732 2008 1037 19590 2220 103 2099 2018 2320 3856 2039 2006 1996 3307 1037 5024 20000 103 2751 20971 2029 1996 103 2018 10650 2013 2049 4297 103 2075 5800 1010 1998 8871 2046 6234 1998 103 6217 1012 102 16373 2161 1010 103 15301 2265 10427 1997 2303 1010 1998 15839 4196 2037 2159 2000 1996 16931 2098 103 2634 1011 10710 8871 15717 103 2068 1012 1000 16220 1000 10154 103 11759 2220 2008 2851 1010 2021 2025 2007 1037 3193 2000 103 1012 1037 17271 1999 2010 103 4412 1010 1011 1011 3243 103 2007 2010 23358 1010 17727 12298 5178 3372 14243 1010 1011 1011 2018 27384 2094 2032 2012 1018 1037 103 1049 102
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
INFO:tensorflow:masked_lm_positions: 7 12 13 25 30 36 45 52 53 54 68 74 81 82 93 99 103 105 125 0
INFO:tensorflow:masked_lm_ids: 17162 2220 4125 1997 4542 29440 20332 4233 1037 16465 2030 2682 2018 13763 5456 6644 1011 8335 1012 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
INFO:tensorflow:next_sentence_labels: 0
INFO:tensorflow:*** Example ***
INFO:tensorflow:tokens: [CLS] and there burst on phil ##am ##mon ' s astonished eyes a vast semi ##ci ##rcle of blue sea [MASK] ring ##ed with palaces and towers [MASK] [SEP] like most of [MASK] fellow gold - seekers , cass was super ##sti [MASK] . [SEP]
INFO:tensorflow:input_ids: 101 1998 2045 6532 2006 6316 3286 8202 1005 1055 22741 2159 1037 6565 4100 6895 21769 1997 2630 2712 103 3614 2098 2007 22763 1998 7626 103 102 2066 2087 1997 103 3507 2751 1011 24071 1010 16220 2001 3565 16643 103 1012 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:input_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_positions: 10 20 23 27 32 39 42 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_ids: 22741 1010 2007 1012 2010 2001 20771 0 0 0 0 0 0 0 0 0 0 0 0 0
INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
INFO:tensorflow:next_sentence_labels: 1
INFO:tensorflow:Wrote 60 total instances
python run_pretraining.py \
--input_file=/tmp/tf_examples.tfrecord \
--output_dir=/tmp/pretraining_output \
--do_train=True \
--do_eval=True \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--train_batch_size=32 \
--max_seq_length=128 \
--max_predictions_per_seq=20 \
--num_train_steps=20 \
--num_warmup_steps=10 \
--learning_rate=2e-5
跑的时候发现会充分利用显存,具体不是特别清楚,显存太小应该也跑不了吧。由于sample_text.txt很小,所以会overfit。log如下(最后会生成一个eval_results.txt
文件,记录***** Eval results *****
部分):
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [20/100]
INFO:tensorflow:Evaluation [30/100]
INFO:tensorflow:Evaluation [40/100]
INFO:tensorflow:Evaluation [50/100]
INFO:tensorflow:Evaluation [60/100]
INFO:tensorflow:Evaluation [70/100]
INFO:tensorflow:Evaluation [80/100]
INFO:tensorflow:Evaluation [90/100]
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2018-10-31-18:13:12
INFO:tensorflow:Saving dict for global step 20: global_step = 20, loss = 0.27842212, masked_lm_accuracy = 0.94665253, masked_lm_loss = 0.27976906, next_sentence_accuracy = 1.0, next_sentence_loss = 0.0002133457
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 20: ./pretraining_output/model.ckpt-20
INFO:tensorflow:***** Eval results *****
INFO:tensorflow: global_step = 20
INFO:tensorflow: loss = 0.27842212
INFO:tensorflow: masked_lm_accuracy = 0.94665253
INFO:tensorflow: masked_lm_loss = 0.27976906
INFO:tensorflow: next_sentence_accuracy = 1.0
INFO:tensorflow: next_sentence_loss = 0.0002133457
具体可以看对应的tensorboard,比较卡,猜测是模型比较大,截图如下:
还有个projector,如下:
左边可以选哪个模型的哪一层
然后在中间的图中可以选中一个点,这样在最右边会显示出与这个点最近的n个点,度量方式可以选择cos或者欧氏距离。
输入文件input.txt
格式:
sentence A ||| sentence B
sentence A
,不要分隔符python extract_features.py \
--input_file=input.txt \
--output_file=/tmp/output.json \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--layers=-1,-2,-3,-4 \
--max_seq_length=128 \
--batch_size=8
例如输入的内容是『大家』,那么输出的output.json
格式如下:
其中的”linex_index”表示第几行
{
"linex_index": 0,
"features": [{
"token": "[CLS]",
"layers": [{
"index": -1,
"values": [1.507966, -0.155272, 0.108119, ..., 0.111],
}, {
"index": -2,
"values": [1.39443, 0.307064, 0.483496, ..., 0.332],
}, {
"index": -3,
"values": [0.961682, 0.757408, 0.720898, ..., 0.332],
}, {
"index": -4,
"values": [-0.275457, 0.632056, 1.063737, ..., 0.332],
}, {
"token": "大",
"layers": [{
"index": -1,
"values": [0.326004, -0.313136, 0.233399, ..., 0.111],
}, {
"index": -2,
"values": [0.795364, 0.361322, -0.116774, ..., 0.332],
}, {
"index": -3,
"values": [0.807957, 0.206743, -0.359639, ..., 0.332],
}, {
"index": -4,
"values": [-0.226106, -0.129655, -0.128466, ..., 0.332],
}, {
"token": "家",
"layers": [{
"index": -1,
"values": [1.768678, -0.814265, 0.016321, ..., 0.111],
}, {
"index": -2,
"values": [1.76887, -0.020193, 0.44832, 0.193271, ..., 0.332],
}, {
"index": -3,
"values": [1.695086, 0.050979, 0.188321, -0.537057, ..., 0.332],
}, {
"index": -4,
"values": [0.745073, -0.09894, 0.166217, -1.045382, ..., 0.332],
}, {
"token": "[SEP]",
"layers": [{
"index": -1,
"values": [0.881939, -0.34753, 0.210375, ..., 0.111],
}, {
"index": -2,
"values": [-0.047698, -0.030813, 0.041558, ..., 0.332],
}, {
"index": -3,
"values": [-0.049113, -0.067705, 0.018293, ..., 0.332],
}, {
"index": -4,
"values": [0.000215, -0.057331, -3.2e-05, ..., 0.332],
}]
}]
}
基于预训练的中文模型中的vocab,把网络改小,基于190w的中文语料(还是用默认的wordpiece分词)进行单机cpu训练,一个句子当成一篇文档,这个句子当成sentence2,这个句子的tag当成sentence1:
模型配置如下:
{
"attention_probs_dropout_prob": 0.1,
"directionality": "bidi",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 64,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 8,
"num_hidden_layers": 2,
"pooler_fc_size": 64,
"pooler_num_attention_heads": 12,
"pooler_num_fc_layers": 3,
"pooler_size_per_head": 32,
"pooler_type": "first_token_transform",
"type_vocab_size": 2,
"vocab_size": 21128
}
参数设置如下:
## g_max_predictions_per_seq approx_to g_max_seq_length * g_masked_lm_prob
# online or offline
export train_mode=offline
export param_name=param1
export g_train_batch_size=128
export g_num_train_steps=10000
export g_max_seq_length=128
export g_max_predictions_per_seq=20
export g_masked_lm_prob=0.15
export g_dupe_factor=3
sh -x scripts/run_train_bert.sh > log/$param_name.log &
# online or offline
export train_mode=offline
export param_name=param2
export g_train_batch_size=64
export g_num_train_steps=10000
export g_max_seq_length=128
export g_max_predictions_per_seq=20
export g_masked_lm_prob=0.15
export g_dupe_factor=3
sh -x scripts/run_train_bert.sh > log/$param_name.log &
# online or offline
export train_mode=offline
export param_name=param3
export g_train_batch_size=128
export g_num_train_steps=10000
export g_max_seq_length=128
export g_max_predictions_per_seq=8
export g_masked_lm_prob=0.05
export g_dupe_factor=5
sh -x scripts/run_train_bert.sh > log/$param_name.log &
# online or offline
export train_mode=offline
export param_name=param4
export g_train_batch_size=64
export g_num_train_steps=10000
export g_max_seq_length=128
export g_max_predictions_per_seq=8
export g_masked_lm_prob=0.05
export g_dupe_factor=5
sh -x scripts/run_train_bert.sh > log/$param_name.log &
# online or offline
export train_mode=offline
export param_name=param5
export g_train_batch_size=32
export g_num_train_steps=10000
export g_max_seq_length=128
export g_max_predictions_per_seq=20
export g_masked_lm_prob=0.15
export g_dupe_factor=3
sh -x scripts/run_train_bert.sh > log/$param_name.log &
# online or offline
export train_mode=offline
export param_name=param6
export g_train_batch_size=32
export g_num_train_steps=10000
export g_max_seq_length=128
export g_max_predictions_per_seq=8
export g_masked_lm_prob=0.05
export g_dupe_factor=5
sh -x scripts/run_train_bert.sh > log/$param_name.log &
wait
跑1w个step,效果如下(图中训了2w步的那个忘了是啥配置了…):
可见,同为1w个step,参数1训练时间最久,但loss最低
每秒的example数:
每秒的global-steps:
拿来eval时,next sentence的准确率:
拿来eval时,masked lm的准确率就比较。。。了:
我们发现,代码里没看到tf.summary相关的代码,却可以看到tensorboard…
是因为用了tpuestimator。。。”TPUEstimator API 不支持 tensorboard 的自定义摘要。但是,基本摘要会自动记录到模型目录中的事件文件中。”
https://cloud.google.com/tpu/docs/tutorials/migrating-to-tpuestimator-api