Use Tesseract to create OCR server with Spring boot
1. What is tesseract?
Tesseract is a leading OCR (Optical Character Recognition) engine today. This tool is distributed with Apache 2.0 open source license. It supports character recognition on image files and outputs as plain characters, html, pdf, tsv, invisible-text-only pdf. Users can use it directly or programmers can use functions through the API.
Tesseract was developed by Hewlett-Packard Laboratories Bristol at Hewleett-Packard Co, Greeley Colorado from 1985 to 1994. After that, it was updated with some minor changes and discontinued development after 1998. By 2005, Tesseract was distributed as open source by HP and developed by Google since 2006.
Currently, Tesseract has developed to version 3.0x and can work on three popular operating systems, Windows, Mac and Linux. This tool supports character recognition in over 100 different languages, including Vietnamese. Not only that, we can train the program to use Tesseract to be able to identify a certain language.
2. Installing and preparing the project (on Linux environment)
a> Maven Dependency
1 2 3 4 5 6 | <dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>3.2.1</version> </dependency> |
b> Download tessdata data from Github
https://github.com/tesseract-ocr/tessdata
c> Install Tesseract for Linux with the command:
sudo apt-get install tesseract-ocr
Check the version
tesseract -v
3. Create project
Step 1: create a basic Spring Boot project
Step 2: Rename the tessdata-master data folder that you downloaded from git to tessdata and copy it into the project.
Step 3: Add Dependency into the project
1 2 3 4 5 6 | <dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>3.2.1</version> </dependency> |
Project structure
DemoOrcServerApplication Class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | import net.sourceforge.tess4j.Tesseract; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; import org.springframework.context.annotation.Bean; @SpringBootApplication public class DemoOrcServerApplication { public static void main(String[] args) { SpringApplication.run(DemoOrcServerApplication.class, args); } @Bean Tesseract getTesseract(){ Tesseract tesseract = new Tesseract(); tesseract.setDatapath("./tessdata"); return tesseract; } } |
Class OcrController
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import com.example.demoorcserver.dto.OcrResult; import com.example.demoorcserver.services.OcrService; import net.sourceforge.tess4j.TesseractException; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.http.ResponseEntity; import org.springframework.web.bind.annotation.PostMapping; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RequestParam; import org.springframework.web.bind.annotation.RestController; import org.springframework.web.multipart.MultipartFile; import java.io.IOException; @RestController @RequestMapping("/ocr") public class OcrController { @Autowired private OcrService ocrService; @PostMapping("/upload") public ResponseEntity<OcrResult> upload(@RequestParam("file") MultipartFile file) throws IOException, TesseractException { return ResponseEntity.ok(ocrService.ocr(file)); } } |
Class OcrResult
1 2 3 4 5 6 7 | import lombok.Data; @Data public class OcrResult { private String result; } |
Class OcrService
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | import com.example.demoorcserver.dto.OcrResult; import net.sourceforge.tess4j.Tesseract; import net.sourceforge.tess4j.TesseractException; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Service; import org.springframework.web.multipart.MultipartFile; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; @Service public class OcrService { @Autowired private Tesseract tesseract; public OcrResult ocr(MultipartFile file) throws IOException, TesseractException { File convFile = convert(file); String text = tesseract.doOCR(convFile); OcrResult ocrResult = new OcrResult(); ocrResult.setResult(text); return ocrResult; } public static File convert(MultipartFile file) throws IOException { File convFile = new File(file.getOriginalFilename()); convFile.createNewFile(); FileOutputStream fos = new FileOutputStream(convFile); fos.write(file.getBytes()); fos.close(); return convFile; } } |
4. Check results
Our input is the image
Use postman to check:
So our program has worked well.