Thanks to visit codestin.com
Credit goes to github.com

Skip to content

NLP4Science/tensorflow-nlp

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

	Code has been run on Google Colab, thanks Google for providing computational resources

Contents

  • Natural Language Processing(自然语言处理)

    • Text Classification(文本分类)

      • IMDB(English Data)

         Abstract:
         
         1. we show the classic ML model (tfidf + logistic regression) is able to reach 89.6%
        
            which is decent for its simplicity, efficiency and low-cost
        
         2. we show FastText model is able to reach 90% accuracy
        
         3. we show cnn-based model is able to improve the accuracy to 91.7%
        
         4. we show rnn-based model is able to improve the accuracy to 92.6%
        
         5. we show pretrained model (bert) is able to improve the accuracy to 94%
        
         6. we show pretrained model (alberta) is able to improve the accuracy to 94.7%
        
         7. we use back-translation, label smoothing, cyclical lr as training helpers
        
    • Text Matching(文本匹配)

      • SNLI(English Data)

         Abstract:
        
         1. we show DAM (lots of interact) is able to reach 85.3% accuracy
        
         2. we show Pyramid (RNN + image processing) is able to improve the accuracy to 87.1%
        
         3. we show ESIM (RNN + lots of interact) is able to improve the accuracy to 87.4%
        
         4. we show RE2 (RNN + lots of interact + residual) is able to improve the accuracy to 88.3%
        
         5. we show BERT (pretrained model) is able to improve the accuracy to 90.4%
        
         6. we show RoBERTa (pretrained model) is able to improve the accuracy to 91.1%
        
         7. we use label smoothing and cyclical lr as training helpers
        
      • 微众银行智能客服(Chinese Data)

         Abstract:
        
         1. we show ESIM, Pyramid, RE2 are able to reach 82.5% ~ 82.9% accuracy (very close)
        
         2. we show RE2 is able to be improved to 83.8% by using cyclical lr and label smoothing
        
         3. we show BERT (pretrained model) is able to further improve the accuracy to 84.75%
        
         Reflection:
        
            we process text on char level and have not considered the word boundary information
        
         because word segmentation can bring segmentation errors (but less sequential path)
        
         BERT implicitly considers the word boundary information in its pretraining process
        
         perhaps this is one of the reasons of its empirical improvement
        
    • Spoken Language Understanding(对话理解)

      • ATIS(English Data)
    • Generative Dialog(生成式对话)

      • 青云语料(Chinese Data)
    • Multi-turn Dialogue Rewriting(多轮对话改写)

      • 20k 腾讯 AI 研发数据(Chinese Data)

         Highlight:
         
         1. our implementation of rnn-based pointer network reaches 60% exact match without bert
        
            which is higher than other implementations using bert
        
            e.g. (https://github.com/liu-nlper/dialogue-utterance-rewriter) 57.5% exact match
        
         2. we show how to deploy model in Java environment
        
         3. we show this task can be decomposed into two tasks (sequence tagging & generation)
        
            and we show the performance of sequence tagging stage: 79.6% recall and 78.7% precision
        
    • Semantic Parsing(语义解析)

      • Facebook AI Research Data(English Data)

         Highlight:
         
         our implementation of pointer-generator reaches 80% exact match
        
         which is higher than all the results of the original paper including rnng
        
         (https://aclweb.org/anthology/D18-1300)
        
    • Question Answering(问题回答)

      • bAbI(Engish Data)
    • Text Processing Tools(文本处理工具)

  • Knowledge Graph(知识图谱)

  • Recommender System(推荐系统)

    • Movielens 1M(English Data)

Text Classification

└── finch/tensorflow2/text_classification/imdb
	│
	├── data
	│   └── glove.840B.300d.txt          # pretrained embedding, download and put here
	│   └── make_data.ipynb              # step 1. make data and vocab: train.txt, test.txt, word.txt
	│   └── train.txt  		     # incomplete sample, format <label, text> separated by \t 
	│   └── test.txt   		     # incomplete sample, format <label, text> separated by \t
	│   └── train_bt_part1.txt  	     # (back-translated) incomplete sample, format <label, text> separated by \t
	│
	├── vocab
	│   └── word.txt                     # incomplete sample, list of words in vocabulary
	│	
	└── main              
		└── attention_linear.ipynb   # step 2: train and evaluate model
		└── attention_conv.ipynb     # step 2: train and evaluate model
		└── fasttext_unigram.ipynb   # step 2: train and evaluate model
		└── fasttext_bigram.ipynb    # step 2: train and evaluate model
		└── sliced_rnn.ipynb         # step 2: train and evaluate model
		└── sliced_rnn_bt.ipynb      # step 2: train and evaluate model

Text Matching

└── finch/tensorflow2/text_matching/snli
	│
	├── data
	│   └── glove.840B.300d.txt       # pretrained embedding, download and put here
	│   └── download_data.ipynb       # step 1. run this to download snli dataset
	│   └── make_data.ipynb           # step 2. run this to generate train.txt, test.txt, word.txt 
	│   └── train.txt  		  # incomplete sample, format <label, text1, text2> separated by \t 
	│   └── test.txt   		  # incomplete sample, format <label, text1, text2> separated by \t
	│
	├── vocab
	│   └── word.txt                  # incomplete sample, list of words in vocabulary
	│	
	└── main              
		└── dam.ipynb      	  # step 3. train and evaluate model
		└── esim.ipynb      	  # step 3. train and evaluate model
		└── ......

└── finch/tensorflow2/text_matching/chinese
	│
	├── data
	│   └── make_data.ipynb           # step 1. run this to generate char.txt and char.npy
	│   └── train.csv  		  # incomplete sample, format <text1, text2, label> separated by comma 
	│   └── test.csv   		  # incomplete sample, format <text1, text2, label> separated by comma
	│
	├── vocab
	│   └── cc.zh.300.vec             # pretrained embedding, download and put here
	│   └── char.txt                  # incomplete sample, list of chinese characters
	│   └── char.npy                  # saved pretrained embedding matrix for this task
	│	
	└── main              
		└── pyramid.ipynb      	  # step 2. train and evaluate model
		└── esim.ipynb      	  # step 2. train and evaluate model
		└── ......

Spoken Language Understanding

└── finch/tensorflow2/spoken_language_understanding/atis
	│
	├── data
	│   └── glove.840B.300d.txt           # pretrained embedding, download and put here
	│   └── make_data.ipynb               # step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── atis.train.w-intent.iob       # incomplete sample, format <text, slot, intent>
	│   └── atis.test.w-intent.iob        # incomplete sample, format <text, slot, intent>
	│
	├── vocab
	│   └── word.txt                      # list of words in vocabulary
	│   └── intent.txt                    # list of intents in vocabulary
	│   └── slot.txt                      # list of slots in vocabulary
	│	
	└── main              
		└── bigru.ipynb               # step 2. train and evaluate model
		└── bigru_self_attn.ipynb     # step 2. train and evaluate model
		└── transformer.ipynb         # step 2. train and evaluate model
		└── transformer_elu.ipynb     # step 2. train and evaluate model

Generative Dialog

└── finch/tensorflow1/free_chat/chinese_qingyun
	│
	├── data
	│   └── raw_data.csv           		# raw data downloaded from external
	│   └── make_data.ipynb           	# step 1. run this to generate vocab {char.txt} and data {train.txt & test.txt}
	│   └── train.txt           		# processed text file generated by {make_data.ipynb}
	│
	├── vocab
	│   └── char.txt                	# list of chars in vocabulary for chinese
	│   └── cc.zh.300.vec			# fastText pretrained embedding downloaded from external
	│   └── char.npy			# chinese characters and their embedding values (300 dim)	
	│	
	└── main
		└── lstm_seq2seq_train.ipynb    # step 2. train and evaluate model
		└── lstm_seq2seq_export.ipynb   # step 3. export model
		└── lstm_seq2seq_infer.ipynb    # step 4. model inference
		└── transformer_train.ipynb     # step 2. train and evaluate model
		└── transformer_export.ipynb    # step 3. export model
		└── transformer_infer.ipynb     # step 4. model inference
  • Task: 青云语料(Chinese Data)

      Training Data: 107687, Testing Data: 3350
    

Semantic Parsing

└── finch/tensorflow2/semantic_parsing/tree_slu
	│
	├── data
	│   └── glove.840B.300d.txt     	# pretrained embedding, download and put here
	│   └── make_data.ipynb           	# step 1. run this to generate vocab: word.txt, intent.txt, slot.txt 
	│   └── train.tsv   		  	# incomplete sample, format <text, tokenized_text, tree>
	│   └── test.tsv    		  	# incomplete sample, format <text, tokenized_text, tree>
	│
	├── vocab
	│   └── source.txt                	# list of words in vocabulary for source (of seq2seq)
	│   └── target.txt                	# list of words in vocabulary for target (of seq2seq)
	│	
	└── main
		└── lstm_seq2seq_tf_addons.ipynb           # step 2. train and evaluate model
		└── ......
		

Knowledge Graph Inference

└── finch/tensorflow2/knowledge_graph_completion/wn18
	│
	├── data
	│   └── download_data.ipynb       	# step 1. run this to download wn18 dataset
	│   └── make_data.ipynb           	# step 2. run this to generate vocabulary: entity.txt, relation.txt
	│   └── wn18  		          	# wn18 folder (will be auto created by download_data.ipynb)
	│   	└── train.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│   	└── valid.txt  		  	# incomplete sample, format <entity1, relation, entity2> separated by \t 
	│   	└── test.txt   		  	# incomplete sample, format <entity1, relation, entity2> separated by \t
	│
	├── vocab
	│   └── entity.txt                  	# incomplete sample, list of entities in vocabulary
	│   └── relation.txt                	# incomplete sample, list of relations in vocabulary
	│	
	└── main              
		└── distmult_1-N.ipynb    	# step 3. train and evaluate model

Knowledge Graph Tools


Knowledge Base Question Answering

  • Rule-based System(基于规则的系统)

    For example, we want to answer the following questions:

     	宝马是什么?  /  what is BMW?
         	我想了解一下宝马  /  i want to know about the BMW
         	给我介绍一下宝马  /  please introduce the BMW to me
     	宝马这个牌子的汽车怎么样?  /  how is the car of BMW group?
         	宝马如何呢?  /  how is the BMW?
         	宝马汽车好用吗?  /  is BMW a good car to use?
         	宝马和奔驰比怎么样?  /  how is the BMW compared to the Benz?
         	宝马和奔驰比哪个好?  /  which one is better, the BMW or the Benz?
         	宝马和奔驰比哪个更好?  /  which one is even better, the BMW or the Benz?
    

Question Answering

└── finch/tensorflow1/question_answering/babi
	│
	├── data
	│   └── make_data.ipynb           		# step 1. run this to generate vocabulary: word.txt 
	│   └── qa5_three-arg-relations_train.txt       # one complete example of babi dataset
	│   └── qa5_three-arg-relations_test.txt	# one complete example of babi dataset
	│
	├── vocab
	│   └── word.txt                  		# complete list of words in vocabulary
	│	
	└── main              
		└── dmn_train.ipynb
		└── dmn_serve.ipynb
		└── attn_gru_cell.py

Text Processing Tools


Recommender System

└── finch/tensorflow1/recommender/movielens
	│
	├── data
	│   └── make_data.ipynb           		# run this to generate vocabulary
	│
	├── vocab
	│   └── user_job.txt
	│   └── user_id.txt
	│   └── user_gender.txt
	│   └── user_age.txt
	│   └── movie_types.txt
	│   └── movie_title.txt
	│   └── movie_id.txt
	│	
	└── main              
		└── dnn_softmax.ipynb
		└── ......

Multi-turn Dialogue Rewriting

└── finch/tensorflow1/multi_turn_rewrite/chinese/
	│
	├── data
	│   └── make_data.ipynb         # run this to generate vocab, split train & test data, make pretrained embedding
	│   └── corpus.txt		# original data downloaded from external
	│   └── train_pos.txt		# processed positive training data after {make_data.ipynb}
	│   └── train_neg.txt		# processed negative training data after {make_data.ipynb}
	│   └── test_pos.txt		# processed positive testing data after {make_data.ipynb}
	│   └── test_neg.txt		# processed negative testing data after {make_data.ipynb}
	│
	├── vocab
	│   └── cc.zh.300.vec		# fastText pretrained embedding downloaded from external
	│   └── char.npy		# chinese characters and their embedding values (300 dim)	
	│   └── char.txt		# list of chinese characters used in this project 
	│	
	└── main              
		└── baseline_lstm_train.ipynb
		└── baseline_lstm_export.ipynb
		└── baseline_lstm_predict.ipynb
  • Task: 20k 腾讯 AI 研发数据(Chinese Data)

     data split as: training data (positive): 18986, testing data (positive): 1008
    
     Training data = 2 * 18986 because of 1:1 Negative Sampling
    

About

Building blocks for NLP and Text Generation in TensorFlow 2.x / 1.x

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 98.3%
  • Python 1.1%
  • Other 0.6%