当前位置：首页 > news >正文

使用Tesseract-OCR对PDF等图片文件进行文字识别

news 2026/5/19 14:32:11

安装

用 Homebrew 来安装 Tesseract

brew install tesseract

2. 完成 tessearact 的安装后，还需要安装中文数据包，执行以下两个操作，

brew info tesseract

执行这个指令的目的，是找到 Homebrew 把 tesseract 安装在文件夹内，例如，

/usr/local/Cellar/tesseract/3.05.02/share/tessdata/.

然后打开 Tesseract 的语言数据包的网页，点击 “chi_sim.traineddata”，电脑自动下载简体中文数据包。

git clone https://github.com/tesseract-ocr/tessdata_fast.git

git clone https://github.com/tesseract-ocr/tessdata_best.git 高清版

GitHub - tesseract-ocr/tessdata_best: Best (most accurate) trained LSTM models.

最后，把简体中文数据包chi_sim.traineddata，复制安装 tesseract 的文件夹内。

命令行用法

我们首先来看tesseract是否正确安装，同时验证版本：

$ tesseract --version
tesseract 4.1.0-rc1-56-g7fbdleptonica-1.76.0libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.1.2Found AVX2Found AVXFound SSE

识别的基本用法是”imagename outputbase [options…]”，4.1的版本options只能通过”-l”选择语言，比如：

tesseract test.png test -l chi_sim

它对test.png进行ocr，然后把识别结果保存在test.txt里。默认输出格式是文本文件，我们也可以让它输出pdf：

tesseract test.png test -l chi_sim pdf

除此之外，还有隐藏(extrac)的选项，需要样这个命令才会显示这些高级功能：

$ tesseract --help-extra
Usage:tesseract --help | --help-extra | --help-psm | --help-oem | --versiontesseract --list-langs [--tessdata-dir PATH]tesseract --print-parameters [options...] [configfile...]tesseract imagename|imagelist|stdin outputbase|stdout [options...] [configfile...]OCR options:--tessdata-dir PATH   Specify the location of tessdata path.--user-words PATH     Specify the location of user words file.--user-patterns PATH  Specify the location of user patterns file.--dpi VALUE           Specify DPI for input image.-l LANG[+LANG]        Specify language(s) used for OCR.-c VAR=VALUE          Set value for config variables.Multiple -c arguments are allowed.--psm NUM             Specify page segmentation mode.--oem NUM             Specify OCR Engine mode.
NOTE: These options must occur before any configfile....省略了psm和oem的详细解释，后面会介绍。

比如使用psm，很多老的文档都是：

tesseract test.png test -l chi_sim -psm 1

这在新版本会有问题，必须用–psm才行：

tesseract test.png test -l chi_sim --psm 1

参数–oem指定使用的算法，0代表老的算法；1代表LSTM算法；2代表两者的结合；3代表系统自己选择。

参数–psm指定页面切分模式：

Page segmentation modes:0    Orientation and script detection (OSD) only.1    Automatic page segmentation with OSD.2    Automatic page segmentation, but no OSD, or OCR. (not implemented)3    Fully automatic page segmentation, but no OSD. (Default)4    Assume a single column of text of variable sizes.5    Assume a single uniform block of vertically aligned text.6    Assume a single uniform block of text.7    Treat the image as a single text line.8    Treat the image as a single word.9    Treat the image as a single word in a circle.10    Treat the image as a single character.11    Sparse text. Find as much text as possible in no particular order.12    Sparse text with OSD.13    Raw line. Treat the image as a single text line,bypassing hacks that are Tesseract-specific.

默认是3，也就是自动的页面切分，但是不进行方向(Orientation)和文字(script，其实并不等同于文字，比如俄文和乌克兰文都使用相同的script，中文和日文的script也有重合的部分)的检测。如果我们要识别的是单行的文字，我可以指定7。OSD算法参考这里。我们这里已经知道文字是中文，并且方向是horizontal(从左往右再从上往下的写法，古代中国是从上往下从右往左），因此使用默认的3就可以了。

Java接口

Java接口使用的是javacpp-presets，这个项目强烈推荐Java程序员关注一下！！！它可以让Java开发者调用很多流行的C++库，包括：OpenCV、FFmpeg、OpenBLAS、CPython、LLVM、CUDA、MXNet、TensorFlow等等。当然也包括我们这里用到的Leptonica和Tesseract。

依赖

		<dependency><groupId>org.bytedeco.javacpp-presets</groupId><artifactId>tesseract-platform</artifactId><version>4.0.0-1.4.4</version></dependency>

我们这里只把C++的基本用法和按行输出用Java实现，其它的例子读者依葫芦画瓢把C++代码变成等价的Java代码就行了。javacpp-presets实现的代码和C++基本长得一样。

基本例子

完整代码在这里。

BytePointer outText;TessBaseAPI api = new TessBaseAPI();
// Initialize tesseract-ocr with English, without specifying tessdata path
if (api.Init(null, "eng") != 0) {System.err.println("Could not initialize tesseract.");System.exit(1);
}// Open input image with leptonica library
PIX image = pixRead(args.length > 0 ? args[0] : "testen-1.png");
api.SetImage(image);
// Get OCR result
outText = api.GetUTF8Text();
System.out.println("OCR output:\n" + outText.getString());// Destroy used object and release memory
api.End();
api.close();
outText.deallocate();
pixDestroy(image);

上面的代码和C++的基本长得一样，因为C++没有GC，因此需要下面那些销毁对象的操作。如果要识别中文，那么需要修改Init的第二个参数：

if (api.Init(null, "chi_sim") != 0) {

但是如果直接执行，会出现如下错误：

Error opening data file /home/travis/build/javacpp-presets/tesseract/cppbuild/linux-x86_64/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

也就是默认会去”/home/travis/build/…“找模型，这是travis ci的路径，我们的机器当然没有。

为了解决这个问题有两种办法，第一种是运行程序是设置环境变量：

# 读者需要改成自己的路径
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata
java -cp .....

另外一种方法就是调用init的时候指定路径：

if (api.Init("/usr/share/tesseract-ocr/4.00/tessdata", "eng") != 0) {System.err.println("Could not initialize tesseract.");System.exit(1);
}

按行输出

完整代码在这里。

BOXA boxes = api.GetComponentImages(tesseract.RIL_TEXTLINE, true, (PointerPointer) null, null);
System.out.print(String.format("Found %d textline image components.\n", boxes.n()));
for (int i = 0; i < boxes.n(); i++) {BOX box = boxes.box(i);api.SetRectangle(box.x(), box.y(), box.w(), box.h());BytePointer text = api.GetUTF8Text();int conf = api.MeanTextConf();System.out.println(String.format("Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",i, box.x(), box.y(), box.w(), box.h(), conf, text.getString()));text.deallocate();
}

另还有一种方法

<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.4.1</version>

</dependency>

原文链接：https://blog.csdn.net/qq_39522120/article/details/135503159

使用Tesseract-OCR对PDF等图片文件进行文字识别

安装

命令行用法

Java接口

依赖

基本例子

按行输出

相关文章：

使用Tesseract-OCR对PDF等图片文件进行文字识别

部署YOLOv8模型的实用常见场景

SpringBoot缓存

STC89C52串口通信详解

基础算法|线性结构|前缀和学习

设计模式之模版方法实践

sql中COALESCE函数详解

rust-analyzer报错“Failed to spawn one or more proc-macro servers,....“怎么解决？

Communications--9--一文读懂双机热备冗余原理

可调恒定电流稳压器NSI50150ADT4G车规级LED驱动器提供专业的汽车级照明解决方案

Unity中使用代码动态修改URP管线下的标准材质是否透明

关于制作Python游戏全过程(汇总1)

独立站营销新纪元：AI与大数据塑造个性化体验的未来

C语言项目实战——贪吃蛇

ArmSoM规划开发基于RK3576的开发套件

视频剪辑如何提取伴奏？短视频剪辑有妙方

【Web】浅浅地聊SnakeYaml反序列化两条常见利用链

详解openGauss客户端工具gsql的高级用法

开源工业软件：SCADA系统开源

关于AI彩票预测算法的设想

性能优化必看：你的Unity粒子特效为什么这么卡？从ParticleSystem参数入手排查

Vue3代码编辑器终极指南：5分钟学会vue-codemirror专业集成

解决企业IT服务管理复杂性的iTop开源CMDB架构实践

别再为版本号头疼了！手把手教你搞定Windows上ChromeDriver与Chrome的版本匹配（附最新镜像源）

STM32F103C8T6驱动安信可GP-01定位模块：从NMEA数据解析到经纬度显示的完整流程

从零到一：用面包板和晶体管手搓一个4bit加法器（附完整电路图与避坑指南）

如何快速解锁WeMod高级功能：面向游戏玩家的完整免费方案

RVC-WebUI语音克隆工具：从零开始的完整实战指南

明日方舟自动化：用MAA重构你的游戏体验，告别重复劳动

内网手机远程桌面：解锁高效协同的数字密钥