Preamble
On a nice day, I sat here writing a tutorial to make a simple chatbot, say no to chatbot framework like Rasa . This bot I will code in NLU direction, ie no stories, no dialogue memory, no slotfilling, no action, bla bla …
Language: Python Operating system: Ubuntu 16.04 Technique: Database, Intent Classification
Database
DB: Postgres
To ease the code and run the product, I created a shell shell file that automatically runs the installation commands run.sh
as below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #!/bin/bash sudo apt-get update -y sudo apt-get install -y software-properties-common sudo add-apt-repository ppa:deadsnakes/ppa sudo apt-get update -y sudo apt-get install -y build-essential python3.6 python3.6-dev python3-pip libpq-dev postgresql postgresql-contrib python-virtualenv -y virtualenv -p /usr/bin/python3.6 funnybot_env source funnybot_env/bin/activate python3.6 -m3 pip install pip --upgrade python3.6 -m pip install wheel pip3 install -r requirements.txt |
The requirements.txt
file contains the packages of python, before installing all postgres:
1 2 | psycopg2>=2.7,<3.0 |
To run the file run.sh
you use this command sh run.sh
After installation is complete, you create a database, well, then I also write a shell file for convenient installation
create_db.sh
:
1 2 3 4 5 6 7 8 9 10 11 12 13 | #!/bin/bash #tạo db sudo -u postgres psql postgres -c "CREATE DATABASE funnybot WITH ENCODING 'UTF8';" #tạo user và password sudo -u postgres psql postgres -c "CREATE USER bot_owner WITH PASSWORD '123456';" #config role sudo -u postgres psql postgres -c "ALTER ROLE bot_owner SET client_encoding TO 'utf8';" sudo -u postgres psql postgres -c "ALTER ROLE bot_owner SET default_transaction_isolation TO 'read committed';" sudo -u postgres psql postgres -c "ALTER ROLE bot_owner SET timezone TO 'UTC';" #độ ưu tiên sudo -u postgres psql postgres -c "GRANT ALL PRIVILEGES ON DATABASE funnybot TO bot_owner;" |
After creating the db, create a function that connects to the db funnybot:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import psycopg2 def connect_db(): conn = None try: #kết nối db dùng thư viện psycopg2 conn = psycopg2.connect(host="localhost", database="funnybot", user="bot_owner", password="123456") #khai báo con trỏ cur = conn.cursor() return conn, cur except (Exception, psycopg2.DatabaseError) as error: print(error) |
After connecting to db funnybot, you create the table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | def create_tables(): commands = [ """ CREATE TABLE bot ( bot_id SERIAL PRIMARY KEY, bot_name VARCHAR(255) NOT NULL ); """, """ CREATE TABLE intents ( intent_id SERIAL PRIMARY KEY, intent_name VARCHAR(255) NOT NULL ); """, """ CREATE TABLE intents_bot ( intent_id int REFERENCES intents (intent_id) ON UPDATE CASCADE, bot_id int REFERENCES bot (bot_id) ON UPDATE CASCADE, CONSTRAINT intent_bot_pkey PRIMARY KEY (intent_id, bot_id) ); """, """ CREATE TABLE training_data ( training_data_id SERIAL PRIMARY KEY, intent_id int NOT NULL REFERENCES intents (intent_id), content VARCHAR(255) NOT NULL ); """, """ CREATE TABLE response_data ( response_data_id SERIAL PRIMARY KEY, intent_id int NOT NULL REFERENCES intents (intent_id), answer VARCHAR(255) NOT NULL ); """ ] conn, cur = connect_db() for command in commands: cur.execute(command) cur.close() conn.commit() if __name__ == '__main__': create_tables() |
Add some data to the table:
- Add bot
1 2 3 4 5 6 7 8 9 10 11 | def insert_bot_data(): sql = "INSERT INTO bot(bot_name) VALUES(%s)" conn, cur = connect_db() cur.execute(sql, ("funnybot",)) conn.commit() cur.close() |
- Add intents
1 2 3 4 5 6 7 8 9 10 11 | def insert_intents_data(): sql = "INSERT INTO intents(intent_name) VALUES(%s)" conn, cur = connect_db() cur.executemany(sql, [('chào hỏi',), ('tạm biệt',), ('xin lỗi',), ('cảm ơn',)]) conn.commit() cur.close() |
- Add training data for intents
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def insert_training_data(): sql = "INSERT INTO training_data(intent_id, content) VALUES(%s, %s)" conn, cur = connect_db() data = [ (1, 'chào bạn',), (1, 'chào anh',), (1, 'em chào anh',), (1, 'chào chị',), (1, 'chào đằng ấy',), (1, 'hello',), (2, 'tạm biệt',), (2, 'tạm biệt anh',), (2, 'hẹn gặp lại',), (2, 'lần sau gặp lại',), (2, 'goodbye',), (3, 'xin lỗi',), (3, 'xin lỗi bạn',), (3, 'xin lỗi anh',), (3, 'thật có lỗi',), (3, 'em rất xin lỗi',), (4, 'cảm ơn',), (4, 'cảm ơn bạn',), (4, 'cảm ơn anh',), (4, 'cảm ơn rất nhiều',), (4, 'em rất biết ơn anh',), (4, 'em cảm ơn anh',) ] cur.executemany(sql, data) conn.commit() cur.close() |
- Add the answer
1 2 3 4 5 6 7 8 9 10 11 12 13 | def insert_response_data(): sql = "INSERT INTO response_data(intent_id, answer) VALUES(%s, %s)" conn, cur = connect_db() data = [(1, 'chào anh/chị',), (2, 'chúc anh chị một ngày tốt lành',), (3, 'xin lỗi anh/chị'), (4, 'cảm ơn anh/chị')] iệu cur.executemany(sql, data) conn.commit() cur.close() |
Sure you feel a bit of data like this does not need DB, but not when doing a real project, you will encounter a lot of data problems: large amount of data, confidentiality of data data, data storage capacity, structured data, data access, bla bla … Not only need to import a csv file . Ok, it’s a bit rambling, after having data we need to process it to put in the model of Machine Learning to classify the user’s Intent
Intent Classification
Please understand, I do not guide how to get data from db postgres because the article will be thin and long. In general, if you want to retrieve data, change the sql SELECT * FROM table_name;
to SELECT * FROM table_name;
is to be.
Import package needed, here I use MultinomialNB probability model and text-to-vector (Bag of words, TFIDF) technique of scikit-learn
1 2 3 4 5 6 7 | from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline import numpy as np |
Data preparation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | training_data = [ (1, 'chào bạn'), (1, 'chào anh'), (1, 'em chào anh'), (1, 'chào chị'), (1, 'chào đằng ấy'), (1, 'hello'), (2, 'tạm biệt'), (2, 'tạm biệt anh'), (2, 'hẹn gặp lại'), (2, 'lần sau gặp lại'), (2, 'goodbye'), (3, 'xin lỗi'), (3, 'xin lỗi bạn'), (3, 'xin lỗi anh'), (3, 'thật có lỗi'), (3, 'em rất xin lỗi'), (4, 'cảm ơn'), (4, 'cảm ơn bạn'), (4, 'cảm ơn anh'), (4, 'cảm ơn rất nhiều'), (4, 'em rất biết ơn anh'), (4, 'em cảm ơn anh') ] response_data = [ (1, 'chào anh/chị'), (2, 'chúc anh chị một ngày tốt lành'), (3, 'xin lỗi anh/chị'), (4, 'cảm ơn anh/chị') ] |
After we have prepared the raw data above, we will train train
1 2 3 4 5 6 7 8 9 10 11 | def convert_to_x_y(data): X = [] y = [] for d in data: X.append(d[1]) y.append([d[0]]) X = np.array(X) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0) return X_train, y_train |
Here I use sklearn’s Pipeline module to simplify and simplify the preprocessing process and define model. The model only needs to fit data => data will go through the pipeline. After the train is finished predicting, a few more logical steps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | def mapping_response(intent): for r in response_data: if r[0] == intent[0]: return r[1] def train_and_predict(msg): X_train, y_train = convert_to_x_y(training_data) clf = Pipeline([ ('vectorizer', CountVectorizer(ngram_range=(1, 2))), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()) ]) clf.fit(X_train, np.ravel(y_train)) intent = clf.predict([msg]) answer = mapping_response(intent) return answer |
Result
Conclusion
Come on come alone, I plan to do more Entity Extraction but due to busy too temporarily put aside, maybe in the future will do later
Thank you for watching this article