Giới thiệu
Trong bài viết trước của chúng tôi, chúng tôi đã triển khai và ứng dụng để chuyển đổi từ tiếng Khmer sang tiếng La Mã bằng cách viết logic từ đầu theo giấy đã cho vì chúng tôi không có đủ dữ liệu để áp dụng tìm hiểu sâu cho vấn đề này. Tuy nhiên, chúng tôi nhận thấy rằng trong bản dịch của Google, họ cũng chuyển đổi từ tiếng Khmer sang tiếng La Mã. Do đó, chúng ta có thể dễ dàng sử dụng danh sách các từ tiếng Khmer trong bài viết trước của chúng tôi để có được danh sách La Mã hóa. Sau đó, chúng ta có thể sử dụng những dữ liệu này để đào tạo mô hình của mình để chuyển đổi từ tiếng Khmer sang tiếng La Mã.
Kế hoạch tấn công
Có nhiều máy học thuật toán mà chúng ta có thể sử dụng để giải quyết vấn đề của mình. Vì, vấn đề của chúng tôi là triển khai một mô hình để dịch từ tiếng Khmer sang tiếng La Mã, một thuật toán hạt rất nổi bật về vấn đề này. Đó là kiến trúc Seq2Seq. Mô hình Seq2Seq là mô hình lấy một chuỗi đầu vào (từ, chữ cái, chuỗi thời gian, v.v.) và đưa ra chuỗi kết quả khác. Mô hình này đã đạt được rất nhiều thành công trong các nhiệm vụ như dịch máy, tóm tắt văn bản và chú thích hình ảnh. Google Dịch bắt đầu sử dụng một mô hình như vậy trong sản xuất vào cuối năm 2016. Hơn nữa, chúng tôi cũng đã sử dụng mô hình này để triển khai bài viết của mình về chatbot .
Thực hiện
Đối với thử nghiệm này, chúng tôi đang sử dụng Keras để phát triển mô hình Seq2Seq của chúng tôi. May mắn thay, Keras cũng có một hướng dẫn về xây dựng một mô hình để dịch tiếng Anh sang tiếng Pháp. Thay vào đó, chúng tôi sẽ sửa đổi các mã đó để dịch từ tiếng Khmer sang tiếng La Mã. Nếu không hiểu mã của tôi, bạn có thể kiểm tra mã gốc để được giải thích thêm tại đây .
Đầu tiên, chúng tôi nhập các gói cần thiết:
1 2 3 4 5 6 | <span class="token keyword">import</span> numpy <span class="token keyword">as</span> np <span class="token keyword">import</span> pandas <span class="token keyword">as</span> pd <span class="token keyword">from</span> __future__ <span class="token keyword">import</span> print_function <span class="token keyword">from</span> keras <span class="token punctuation">.</span> models <span class="token keyword">import</span> Model <span class="token keyword">from</span> keras <span class="token punctuation">.</span> layers <span class="token keyword">import</span> Input <span class="token punctuation">,</span> LSTM <span class="token punctuation">,</span> Dense |
Sau đó, chúng tôi tải dữ liệu vào bộ nhớ bằng gấu trúc:
1 2 3 4 5 6 7 8 9 10 11 12 | data_kh <span class="token operator">=</span> pd <span class="token punctuation">.</span> read_csv <span class="token punctuation">(</span> <span class="token string">"data/data_kh.csv"</span> <span class="token punctuation">,</span> header <span class="token operator">=</span> <span class="token boolean">None</span> <span class="token punctuation">)</span> data_rom <span class="token operator">=</span> pd <span class="token punctuation">.</span> read_csv <span class="token punctuation">(</span> <span class="token string">"data/data_rom.csv"</span> <span class="token punctuation">,</span> header <span class="token operator">=</span> <span class="token boolean">None</span> <span class="token punctuation">)</span> batch_size <span class="token operator">=</span> <span class="token number">32</span> <span class="token comment"># Batch size for training.</span> epochs <span class="token operator">=</span> <span class="token number">100</span> <span class="token comment"># Number of epochs to train for.</span> latent_dim <span class="token operator">=</span> <span class="token number">256</span> <span class="token comment"># Latent dimensionality of the encoding space.</span> num_samples <span class="token operator">=</span> <span class="token number">7154</span> <span class="token comment"># Number of samples to train on.</span> <span class="token comment"># Vectorize the data.</span> input_texts <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span> target_texts <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token punctuation">]</span> input_characters <span class="token operator">=</span> <span class="token builtin">set</span> <span class="token punctuation">(</span> <span class="token punctuation">)</span> target_characters <span class="token operator">=</span> <span class="token builtin">set</span> <span class="token punctuation">(</span> <span class="token punctuation">)</span> |
Khi dữ liệu được tải, chúng ta cần xóa chúng và tách nó thành ký tự riêng lẻ:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <span class="token keyword">for</span> input_text <span class="token keyword">in</span> data_kh <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">]</span> <span class="token punctuation">:</span> input_text <span class="token operator">=</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> input_text <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> input_texts <span class="token punctuation">.</span> append <span class="token punctuation">(</span> input_text <span class="token punctuation">)</span> <span class="token keyword">for</span> char <span class="token keyword">in</span> input_text <span class="token punctuation">:</span> <span class="token keyword">if</span> char <span class="token operator">not</span> <span class="token keyword">in</span> input_characters <span class="token punctuation">:</span> input_characters <span class="token punctuation">.</span> add <span class="token punctuation">(</span> char <span class="token punctuation">)</span> <span class="token keyword">for</span> target_text <span class="token keyword">in</span> data_rom <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">]</span> <span class="token punctuation">:</span> target_text <span class="token operator">=</span> <span class="token string">'t'</span> <span class="token operator">+</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> target_text <span class="token punctuation">)</span> <span class="token punctuation">.</span> strip <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token operator">+</span> <span class="token string">'n'</span> target_texts <span class="token punctuation">.</span> append <span class="token punctuation">(</span> target_text <span class="token punctuation">)</span> <span class="token keyword">for</span> char <span class="token keyword">in</span> <span class="token builtin">str</span> <span class="token punctuation">(</span> target_text <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token keyword">if</span> char <span class="token operator">not</span> <span class="token keyword">in</span> target_characters <span class="token punctuation">:</span> target_characters <span class="token punctuation">.</span> add <span class="token punctuation">(</span> char <span class="token punctuation">)</span> num_encoder_tokens <span class="token operator">=</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> input_characters <span class="token punctuation">)</span> num_decoder_tokens <span class="token operator">=</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> target_characters <span class="token punctuation">)</span> max_encoder_seq_length <span class="token operator">=</span> <span class="token builtin">max</span> <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> txt <span class="token punctuation">)</span> <span class="token keyword">for</span> txt <span class="token keyword">in</span> input_texts <span class="token punctuation">]</span> <span class="token punctuation">)</span> max_decoder_seq_length <span class="token operator">=</span> <span class="token builtin">max</span> <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> txt <span class="token punctuation">)</span> <span class="token keyword">for</span> txt <span class="token keyword">in</span> target_texts <span class="token punctuation">]</span> <span class="token punctuation">)</span> |
Tiếp theo, chúng tôi khởi tạo mảng cho chuỗi đầu vào và đầu ra dựa trên chiều dài tối đa của dữ liệu mẫu đầu vào và đầu ra.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | input_token_index <span class="token operator">=</span> <span class="token builtin">dict</span> <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token punctuation">(</span> char <span class="token punctuation">,</span> i <span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token punctuation">,</span> char <span class="token keyword">in</span> <span class="token builtin">enumerate</span> <span class="token punctuation">(</span> input_characters <span class="token punctuation">)</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> target_token_index <span class="token operator">=</span> <span class="token builtin">dict</span> <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token punctuation">(</span> char <span class="token punctuation">,</span> i <span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token punctuation">,</span> char <span class="token keyword">in</span> <span class="token builtin">enumerate</span> <span class="token punctuation">(</span> target_characters <span class="token punctuation">)</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> encoder_input_data <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> input_texts <span class="token punctuation">)</span> <span class="token punctuation">,</span> max_encoder_seq_length <span class="token punctuation">,</span> num_encoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">,</span> dtype <span class="token operator">=</span> <span class="token string">'float32'</span> <span class="token punctuation">)</span> decoder_input_data <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> input_texts <span class="token punctuation">)</span> <span class="token punctuation">,</span> max_decoder_seq_length <span class="token punctuation">,</span> num_decoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">,</span> dtype <span class="token operator">=</span> <span class="token string">'float32'</span> <span class="token punctuation">)</span> decoder_target_data <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> input_texts <span class="token punctuation">)</span> <span class="token punctuation">,</span> max_decoder_seq_length <span class="token punctuation">,</span> num_decoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">,</span> dtype <span class="token operator">=</span> <span class="token string">'float32'</span> <span class="token punctuation">)</span> |
Sau đó, chúng tôi mã hóa / giải mã dữ liệu đầu vào và đầu ra trước khi chuyển nó vào mô hình của chúng tôi:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | input_token_index <span class="token operator">=</span> <span class="token builtin">dict</span> <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token punctuation">(</span> char <span class="token punctuation">,</span> i <span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token punctuation">,</span> char <span class="token keyword">in</span> <span class="token builtin">enumerate</span> <span class="token punctuation">(</span> input_characters <span class="token punctuation">)</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> target_token_index <span class="token operator">=</span> <span class="token builtin">dict</span> <span class="token punctuation">(</span> <span class="token punctuation">[</span> <span class="token punctuation">(</span> char <span class="token punctuation">,</span> i <span class="token punctuation">)</span> <span class="token keyword">for</span> i <span class="token punctuation">,</span> char <span class="token keyword">in</span> <span class="token builtin">enumerate</span> <span class="token punctuation">(</span> target_characters <span class="token punctuation">)</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> encoder_input_data <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> input_texts <span class="token punctuation">)</span> <span class="token punctuation">,</span> max_encoder_seq_length <span class="token punctuation">,</span> num_encoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">,</span> dtype <span class="token operator">=</span> <span class="token string">'float32'</span> <span class="token punctuation">)</span> decoder_input_data <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> input_texts <span class="token punctuation">)</span> <span class="token punctuation">,</span> max_decoder_seq_length <span class="token punctuation">,</span> num_decoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">,</span> dtype <span class="token operator">=</span> <span class="token string">'float32'</span> <span class="token punctuation">)</span> decoder_target_data <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> input_texts <span class="token punctuation">)</span> <span class="token punctuation">,</span> max_decoder_seq_length <span class="token punctuation">,</span> num_decoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">,</span> dtype <span class="token operator">=</span> <span class="token string">'float32'</span> <span class="token punctuation">)</span> |
Sử dụng Keras chúng ta có thể xây dựng một seq2seq một cách dễ dàng:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | <span class="token comment"># Define an input sequence and process it.</span> encoder_inputs <span class="token operator">=</span> Input <span class="token punctuation">(</span> shape <span class="token operator">=</span> <span class="token punctuation">(</span> <span class="token boolean">None</span> <span class="token punctuation">,</span> num_encoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">)</span> encoder <span class="token operator">=</span> LSTM <span class="token punctuation">(</span> latent_dim <span class="token punctuation">,</span> return_state <span class="token operator">=</span> <span class="token boolean">True</span> <span class="token punctuation">)</span> encoder_outputs <span class="token punctuation">,</span> state_h <span class="token punctuation">,</span> state_c <span class="token operator">=</span> encoder <span class="token punctuation">(</span> encoder_inputs <span class="token punctuation">)</span> <span class="token comment"># We discard `encoder_outputs` and only keep the states.</span> encoder_states <span class="token operator">=</span> <span class="token punctuation">[</span> state_h <span class="token punctuation">,</span> state_c <span class="token punctuation">]</span> <span class="token comment"># Set up the decoder, using `encoder_states` as initial state.</span> decoder_inputs <span class="token operator">=</span> Input <span class="token punctuation">(</span> shape <span class="token operator">=</span> <span class="token punctuation">(</span> <span class="token boolean">None</span> <span class="token punctuation">,</span> num_decoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token comment"># We set up our decoder to return full output sequences,</span> <span class="token comment"># and to return internal states as well. We don't use the</span> <span class="token comment"># return states in the training model, but we will use them in inference.</span> decoder_lstm <span class="token operator">=</span> LSTM <span class="token punctuation">(</span> latent_dim <span class="token punctuation">,</span> return_sequences <span class="token operator">=</span> <span class="token boolean">True</span> <span class="token punctuation">,</span> return_state <span class="token operator">=</span> <span class="token boolean">True</span> <span class="token punctuation">)</span> decoder_outputs <span class="token punctuation">,</span> _ <span class="token punctuation">,</span> _ <span class="token operator">=</span> decoder_lstm <span class="token punctuation">(</span> decoder_inputs <span class="token punctuation">,</span> initial_state <span class="token operator">=</span> encoder_states <span class="token punctuation">)</span> decoder_dense <span class="token operator">=</span> Dense <span class="token punctuation">(</span> num_decoder_tokens <span class="token punctuation">,</span> activation <span class="token operator">=</span> <span class="token string">'softmax'</span> <span class="token punctuation">)</span> decoder_outputs <span class="token operator">=</span> decoder_dense <span class="token punctuation">(</span> decoder_outputs <span class="token punctuation">)</span> <span class="token comment"># Define the model that will turn</span> <span class="token comment"># `encoder_input_data` & `decoder_input_data` into `decoder_target_data`</span> model <span class="token operator">=</span> Model <span class="token punctuation">(</span> <span class="token punctuation">[</span> encoder_inputs <span class="token punctuation">,</span> decoder_inputs <span class="token punctuation">]</span> <span class="token punctuation">,</span> decoder_outputs <span class="token punctuation">)</span> <span class="token comment"># Run training</span> model <span class="token punctuation">.</span> <span class="token builtin">compile</span> <span class="token punctuation">(</span> optimizer <span class="token operator">=</span> <span class="token string">'rmsprop'</span> <span class="token punctuation">,</span> loss <span class="token operator">=</span> <span class="token string">'categorical_crossentropy'</span> <span class="token punctuation">,</span> metrics <span class="token operator">=</span> <span class="token punctuation">[</span> <span class="token string">'accuracy'</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> |
Sau đó, chúng ta có thể bắt đầu đào tạo mô hình của mình:
1 2 3 4 5 6 | model <span class="token punctuation">.</span> fit <span class="token punctuation">(</span> <span class="token punctuation">[</span> encoder_input_data <span class="token punctuation">,</span> decoder_input_data <span class="token punctuation">]</span> <span class="token punctuation">,</span> decoder_target_data <span class="token punctuation">,</span> batch_size <span class="token operator">=</span> batch_size <span class="token punctuation">,</span> epochs <span class="token operator">=</span> epochs <span class="token punctuation">,</span> validation_split <span class="token operator">=</span> <span class="token number">0.2</span> <span class="token punctuation">)</span> |
Và đừng quên lưu mô hình được đào tạo của chúng tôi nếu bạn không thử lại lần nữa:
1 2 3 4 5 6 7 | <span class="token comment"># Save model</span> <span class="token comment"># serialize model to JSON</span> model_json <span class="token operator">=</span> model <span class="token punctuation">.</span> to_json <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token keyword">with</span> <span class="token builtin">open</span> <span class="token punctuation">(</span> <span class="token string">"model.json"</span> <span class="token punctuation">,</span> <span class="token string">"w"</span> <span class="token punctuation">)</span> <span class="token keyword">as</span> json_file <span class="token punctuation">:</span> json_file <span class="token punctuation">.</span> write <span class="token punctuation">(</span> model_json <span class="token punctuation">)</span> model <span class="token punctuation">.</span> save <span class="token punctuation">(</span> <span class="token string">'s2s.h5'</span> <span class="token punctuation">)</span> |
Kiểm tra
Sau khi đào tạo hoàn tất, bây giờ chúng tôi có thể kiểm tra mô hình của chúng tôi và kiểm tra kết quả:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | encoder_model <span class="token operator">=</span> Model <span class="token punctuation">(</span> encoder_inputs <span class="token punctuation">,</span> encoder_states <span class="token punctuation">)</span> decoder_state_input_h <span class="token operator">=</span> Input <span class="token punctuation">(</span> shape <span class="token operator">=</span> <span class="token punctuation">(</span> latent_dim <span class="token punctuation">,</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> decoder_state_input_c <span class="token operator">=</span> Input <span class="token punctuation">(</span> shape <span class="token operator">=</span> <span class="token punctuation">(</span> latent_dim <span class="token punctuation">,</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> decoder_states_inputs <span class="token operator">=</span> <span class="token punctuation">[</span> decoder_state_input_h <span class="token punctuation">,</span> decoder_state_input_c <span class="token punctuation">]</span> decoder_outputs <span class="token punctuation">,</span> state_h <span class="token punctuation">,</span> state_c <span class="token operator">=</span> decoder_lstm <span class="token punctuation">(</span> decoder_inputs <span class="token punctuation">,</span> initial_state <span class="token operator">=</span> decoder_states_inputs <span class="token punctuation">)</span> decoder_states <span class="token operator">=</span> <span class="token punctuation">[</span> state_h <span class="token punctuation">,</span> state_c <span class="token punctuation">]</span> decoder_outputs <span class="token operator">=</span> decoder_dense <span class="token punctuation">(</span> decoder_outputs <span class="token punctuation">)</span> decoder_model <span class="token operator">=</span> Model <span class="token punctuation">(</span> <span class="token punctuation">[</span> decoder_inputs <span class="token punctuation">]</span> <span class="token operator">+</span> decoder_states_inputs <span class="token punctuation">,</span> <span class="token punctuation">[</span> decoder_outputs <span class="token punctuation">]</span> <span class="token operator">+</span> decoder_states <span class="token punctuation">)</span> reverse_input_char_index <span class="token operator">=</span> <span class="token builtin">dict</span> <span class="token punctuation">(</span> <span class="token punctuation">(</span> i <span class="token punctuation">,</span> char <span class="token punctuation">)</span> <span class="token keyword">for</span> char <span class="token punctuation">,</span> i <span class="token keyword">in</span> input_token_index <span class="token punctuation">.</span> items <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> reverse_target_char_index <span class="token operator">=</span> <span class="token builtin">dict</span> <span class="token punctuation">(</span> <span class="token punctuation">(</span> i <span class="token punctuation">,</span> char <span class="token punctuation">)</span> <span class="token keyword">for</span> char <span class="token punctuation">,</span> i <span class="token keyword">in</span> target_token_index <span class="token punctuation">.</span> items <span class="token punctuation">(</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token keyword">def</span> <span class="token function">decode_sequence</span> <span class="token punctuation">(</span> input_seq <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token comment"># Encode the input as state vectors.</span> states_value <span class="token operator">=</span> encoder_model <span class="token punctuation">.</span> predict <span class="token punctuation">(</span> input_seq <span class="token punctuation">)</span> <span class="token comment"># Generate empty target sequence of length 1.</span> target_seq <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token number">1</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">,</span> num_decoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token comment"># Populate the first character of target sequence with the start character.</span> target_seq <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">0</span> <span class="token punctuation">,</span> target_token_index <span class="token punctuation">[</span> <span class="token string">'t'</span> <span class="token punctuation">]</span> <span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">1</span> <span class="token punctuation">.</span> <span class="token comment"># Sampling loop for a batch of sequences</span> <span class="token comment"># (to simplify, here we assume a batch of size 1).</span> stop_condition <span class="token operator">=</span> <span class="token boolean">False</span> decoded_sentence <span class="token operator">=</span> <span class="token string">''</span> <span class="token keyword">while</span> <span class="token operator">not</span> stop_condition <span class="token punctuation">:</span> output_tokens <span class="token punctuation">,</span> h <span class="token punctuation">,</span> c <span class="token operator">=</span> decoder_model <span class="token punctuation">.</span> predict <span class="token punctuation">(</span> <span class="token punctuation">[</span> target_seq <span class="token punctuation">]</span> <span class="token operator">+</span> states_value <span class="token punctuation">)</span> <span class="token comment"># Sample a token</span> sampled_token_index <span class="token operator">=</span> np <span class="token punctuation">.</span> argmax <span class="token punctuation">(</span> output_tokens <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token operator">-</span> <span class="token number">1</span> <span class="token punctuation">,</span> <span class="token punctuation">:</span> <span class="token punctuation">]</span> <span class="token punctuation">)</span> sampled_char <span class="token operator">=</span> reverse_target_char_index <span class="token punctuation">[</span> sampled_token_index <span class="token punctuation">]</span> decoded_sentence <span class="token operator">+=</span> sampled_char <span class="token comment"># Exit condition: either hit max length</span> <span class="token comment"># or find stop character.</span> <span class="token keyword">if</span> <span class="token punctuation">(</span> sampled_char <span class="token operator">==</span> <span class="token string">'n'</span> <span class="token operator">or</span> <span class="token builtin">len</span> <span class="token punctuation">(</span> decoded_sentence <span class="token punctuation">)</span> <span class="token operator">></span> max_decoder_seq_length <span class="token punctuation">)</span> <span class="token punctuation">:</span> stop_condition <span class="token operator">=</span> <span class="token boolean">True</span> <span class="token comment"># Update the target sequence (of length 1).</span> target_seq <span class="token operator">=</span> np <span class="token punctuation">.</span> zeros <span class="token punctuation">(</span> <span class="token punctuation">(</span> <span class="token number">1</span> <span class="token punctuation">,</span> <span class="token number">1</span> <span class="token punctuation">,</span> num_decoder_tokens <span class="token punctuation">)</span> <span class="token punctuation">)</span> target_seq <span class="token punctuation">[</span> <span class="token number">0</span> <span class="token punctuation">,</span> <span class="token number">0</span> <span class="token punctuation">,</span> sampled_token_index <span class="token punctuation">]</span> <span class="token operator">=</span> <span class="token number">1</span> <span class="token punctuation">.</span> <span class="token comment"># Update states</span> states_value <span class="token operator">=</span> <span class="token punctuation">[</span> h <span class="token punctuation">,</span> c <span class="token punctuation">]</span> <span class="token keyword">return</span> decoded_sentence <span class="token keyword">for</span> seq_index <span class="token keyword">in</span> <span class="token builtin">range</span> <span class="token punctuation">(</span> <span class="token number">100</span> <span class="token punctuation">)</span> <span class="token punctuation">:</span> <span class="token comment"># Take one sequence (part of the training set)</span> <span class="token comment"># for trying out decoding.</span> input_seq <span class="token operator">=</span> encoder_input_data <span class="token punctuation">[</span> seq_index <span class="token punctuation">:</span> seq_index <span class="token operator">+</span> <span class="token number">1</span> <span class="token punctuation">]</span> decoded_sentence <span class="token operator">=</span> decode_sequence <span class="token punctuation">(</span> input_seq <span class="token punctuation">)</span> <span class="token keyword">print</span> <span class="token punctuation">(</span> <span class="token string">'KH: '</span> <span class="token operator">+</span> input_texts <span class="token punctuation">[</span> seq_index <span class="token punctuation">]</span> <span class="token operator">+</span> <span class="token string">', Roman: '</span> <span class="token operator">+</span> target_texts <span class="token punctuation">[</span> seq_index <span class="token punctuation">]</span> <span class="token operator">+</span> <span class="token string">', predicted: '</span> <span class="token operator">+</span> decoded_sentence <span class="token punctuation">)</span> |
Hãy chạy nó.
Dựa trên kết quả, có vẻ như mô hình của chúng tôi đã bị lỗi thời. Vì vậy, đến lượt bạn cải thiện mô hình này để làm cho nó tuyệt vời hơn.
Tài nguyên
- nguồn đang
- https://keras.io/examples/lstm_seq2seq/
- https://github.com/udacity/deep-learning-v2-pytorch/blob/master/reciverse-neural-networks/char-rnn/Character_Level_RNN_Solution.ipynb
- https://karpathy.github.io/2015/05/21/rnn-effectively/
- https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263
- https://www.guru99.com/seq2seq-model.html
Cái gì tiếp theo?
Trong bài viết, chúng tôi đã học cách chuẩn bị dữ liệu văn bản của mình và chúng tôi tạo mô hình sẽ lấy dữ liệu chúng tôi xử lý và sử dụng nó để đào tạo dịch từ tiếng Khmer sang tiếng La Mã. Chúng tôi đã sử dụng một kiến trúc gọi là (seq2seq) hoặc (Bộ giải mã mã hóa), Nó phù hợp để giải quyết vấn đề tuần tự. Trong trường hợp của chúng tôi, chuỗi đầu vào là các từ tiếng Khmer và chuỗi ra ngoài của chúng tôi là từ La Mã trong đó độ dài của nó là khác nhau. Tuy nhiên, mô hình của chúng tôi chưa tạo ra dự đoán tốt và đến lượt bạn cải thiện mô hình này để cạnh tranh với google.