Graph convolution network approach for the problem of extracting information from invoices

Tram Ho


The Mobile capture receipts Optical Character Recognition (MC-OCR) is a receipt photo contest with 2 tasks and my team participated in the second task to extract basic information including SELLER, SELLER_ADDRESS, TIMESTAMP, TOTAL_COST (seller, location, time and total payment) from pre-collected phone bills.


The invoice images provided by the organizers have a not small background (even more than 50%), tilted and rotated in many different directions. Therefore, in order to recognize the most accurate text, it is necessary to remove the exterior and rotate the rest of the invoice image in its correct direction.

Segmentation and rotation

To get the reciept segment out of the background we use a network called Basnet (Boundary-aware salient object detection). This is a network salient object detection – simply understand that it only cares for foreground / object and background without knowing which class the object belongs to (an upgraded version of the same author’s Basnet is U 2 mathbf {U} ^ 2 U 2 ). I always used the pretrained model of the author to use and did not proceed with any fine-tune step.



Figure 1: Result from model segmentation

After the segment we calculate the angle between the top-bottom axis of the receipt and the vertical axis of the image and then rotate the receipt in the right direction.

Image orientation (determines the direction of the receipt)

We used a self-supervised network to determine the direction of the receipt with the same idea as in the paper Unsupervised Representation Learning by Predicting Image Rotations by Spyros Gidaris . Each receipt could be in 1 of 4 different directions that was rotated 0, 90, 180, 270 degrees as shown below.


Figure 2: Self-supervised for the rotation image problem

We used the back-bone ResNet with our test results almost 96%. (This is done after the contest to increase the time inference, while in the contest we use OCR for all directions and calculate the maximum number of words to determine direction)

Text detection and recognition

Like other teams, I use CRAFT for text detection and VietOCR for text recognition. There have been many articles about this I will not go into it anymore

Graph convolution network

To solve the key information extraction problem, there are many approaches such as text classification or template matching but I think the Graph approach is the best. Each receipt is modeled in graph form G ( V , E ) G (V, E) G ( V , E ) in it V V V (vertices / nodes) is the set of vertices corresponding to the bounding box of each area with text (textbox / text bounding box) and E E E is the set of edges representing the relationship between the vertices. This problem belongs to the Node classification class with many proposed ideas such as PICK (processing keyinformation extraction from documents using improved graph learning-convolutional networks) combining vision feature and text feature, the point I don’t like in this article is the the too rigid combination of these two features. My team likes the approach based on text feature and box position more so I should use Residual Gated Graph Convnets paper by Xavier Bresson author, a very famous character with many graph paper. Xavier Bresson also created a Benchmarking Graph Neural Networks with lots of models written on the DGL library.


Figure 3: Graph architecture for node classification problem

Definition of feature node and edge feature.

1. Node features

Node featue is synthesized from textbox and text coordinates recognized by the OCR model. Each textbox is a vector L = ( x i , y i i [ first , 4 ] ) L = (x_i, ~ y_i | ~ i in [1,4]) L = (X i , y i  i [ 1 , 4 ] ) in which ( xi ,yi x_i, ~ y_i x i , y i corner coordinates of the textbox). Text from OCR will be embedding and put into an LSTM network. The textbox and text information are then combined by element-wise adding together to make the feature node . I use character embedding without using pretrained models like word2vec or bert for the following reasons: a) the invoice contains both Vietnamese and English and numbers, b) many Vietnamese words have been identified with incorrect accents / bars and finally c) not much information about context.

2. Edge features

Edge represents the association between each pair of nodes in the graph.
We first define the links between any two nodes. Assuming that the text in the reciept is arranged in order of left-right top-down, two nodes are said to be linked if:

d ( v , v j ) = a b S ( v y v j , y ) < 3 × H v d (v, ~ v_j) = abs (v_y – v_ {j, y}) <3 times h_v d ( v , v j ) = a b s (v y v j, y ) < 3 × h v

Inside H H h is the height of the current node. Simply put, two nodes are considered to be linked when the distance is axial y y y between them does not exceed 3 times the height of the current node. We define the edge feature of two associated nodes as a vector of axial distance x x x and y y y is given by the formula:

d i S t a n c e ( v i , v j ) = ( a b S ( v i , x v j , x ) , a b S ( v i , y v j , y ) ) distance (v_i, ~ v_j) = (abs (v_ {i, x} – v_ {j, x}), abs (v_ {i, y} – v_ {j, y})) d i s t a n c e (v i , v j ) = (A b s (V i, x v j, x ), A b s (V i, y v j, y ) )

3. Network architecture (Graph model architecture) I use a graph model named Residual Gated Graph Convnets . Edge and node features as defined above are brought over to the RG-GCN (Residual Gated Graph Convnets) layer.

H = x + ( A x + v j v η ( e j ) B x j ) + , mathbf {h} = mathbf {x} + left ( mathbf {Ax} + sum_ {v_j to v} eta (e_j) odot mathbf {Bx} _j right) ^ +, h = x + A x + v j v Σ η (e j ) B x j + ,

Inside x mathbf {x} x is Residual (or skip connection as in Resnet), A x mathbf {Ax} A x is the effect of the current node . . . sum … . . . is the impact of neighboring nodes, + ^ + + is the ReLu function). η eta η is the proportion of each neighboring node affecting the current node and is calculated by the formula:

η ( e j ) = σ ( e j ) ( v k v σ ( e k ) ) first , eta (e_j) = sigma (e_j) left ( sum_ {v_k to v} sigma (e_k) right) ^ {- 1}, η (e j ) = σ (e j ) (V k v Σ σ (e k ) ) 1 ,

σ sigma σ is the sigmod function, e j e_j e j and e k e_k e k are the features of the edges associated with the current node v v v from neighboring nodes v j v_j v j and vk v_k v k . ej e_j e j in order are calculated from the following formulas:

e j = C e j x + D x j + E x , e_j = mathbf {C} e_j ^ {x} + mathbf {Dx} _j + mathbf {Ex}, e j = C e j x + D x j + E x ,

e j H = e j x + ( e j ) + , e_j ^ h = e_j ^ x + (e_j) ^ +, f h j = e j x + (E j ) + ,

with e j x e_j ^ x e j x ande j H e_j ^ h f h j is the input and outout of the hidden layer from the edge’s feature vector e j e_j e j join the current vertex v v v with vertices vj v_j v j . A , B , C , D , E mathbf {A, B, C, D, E} A , B , C , D , E are matrices of rotations learned from network training.

The graph model can be defined by class as follows

After the stack (L = 8) layers of RG-GCN the node-features were dense layer and shared weights for all nodes and finally used the cross entropy error function to classify the nodes.

Prepare dataset and training

Pseudo label – add fake label

To create data training for the graph model, I add the ground truth (including 4 polygon vertices of text, text and class label) to the dataset created in the previous step (Text detection) and remove the textboxes if it overlaps with the textbox of BTC by IoU> 0.2. The remaining boxes will be labeled Other.

Data Augmentation

To increase the diversity for the dataset, we enriched by replacing the SELLER and ADDRESS fields based on a dictionary generated from the ground truth of BTC and randomized for TIMESTAMP and TOTAL. Both text detectors CTPN and CRAFT are also used for data enrichment.

Training and accuracy

Dataset after enriched to about 10k samples and divided 80:20 for train and test.
The process on GTX 1080TI with 10 epochs gives graph as follows:

The results will be accurate on each field as follows:

Post processing

1. Spelling correction

I use grounth truth to correct the text in case of wrong recognition. For example, with the address and the name of the company / shop / market is usually its own name, so I create a dictionary in pairs with the key value of the recognized text and the value is the ground truth of BTC. When inference if company / address matches a key in dictionary, that companty / address is replaced with value in dictionary.

2. Regular expression

In some cases, the date / timestamp is incorrect because when making pseudo label, the value between grouth truth and OCR does not match. In this case when inference some date cases are omitted, a regular expression is used in this case to correctly extract the datetime / timestamp part of the graph model.


The results of the contest we achieved top 3 public test and top 4 private test in task 2. We also wrote a short paper about this article. Code and paper will be public when the RIVF2021 conference ends. A big thank you to my linguistic advisor and math advisor, Mr. Hung Nguyen, for helping me complete this article.


Share the news now

Source : Viblo