How-to-extract-text-from-PDF-Image-files-OCR-CarlZeng

news/2025/11/14 20:20:52/文章来源:https://www.cnblogs.com/backuper/p/19223283

本文阐述如何自建并使用OCR识别图片或PDF中文字(转化成文本等进一步处理), 以及NetSuite调用OCR API的场景雏形.
How to extract text from PDF(Image) files. 20251113 引入自建ocr服务

docker部署OCR项目

支持离线+API, 自建ocr服务

Docker compose, vi docker-compose.yml

version: "3"
services:trwebocr:image: mmmz/trwebocr:latestcontainer_name: trwebocrrestart: unless-stoppedports:- "8103:8089"environment:- LANG=zh_CN.UTF-8volumes:- ./data:/app/tr_web/data  #持久化OCR数据

运行docker后, OCR服务开启

访问方式一: (通过网页访问)

访问方式二: 文字识别API

描述： 进行文字识别与检测的接口

地址： https://ocr.carlzeng.com:3/api/tr-run/

方法： POST

请求参数：

参数名称	是否必选	数据类型	描述
file	和 img 二选一	file	通过上传的方式来发送图片的字段
img	和 file 二选一	string	图片的base64值，不需要前缀。
compress	否	int	值为空时，默认将图片最长边压缩到1600px。值为 0 时，不压缩图片。值为非0 时，将最长边压缩到该值的大小。
is_draw	否	int	值为 0 时，不返回图片。（没有data['img_detected']返回）

返回参数：

参数名称	是否必选	数据类型	描述
code	是	int	识别结果的状态码，识别成功为200，有异常为 400
msg	是	string	识别结果的文字信息
data	否	dict	识别结果，若识别异常则没有此字段
data['img_detected']	是	string	画出文字区域的图片base64值
data['raw_out']	是	list	识别结果的输出
data['speed_time']	是	float	识别的耗时

返回示例：

{"code": 200,"msg": "\u6210\u529f", "data": {"img_detected": "data:image/jpeg;base64,/9j/4AAQSkZJR5t...","raw_out": [[[11, 13, 402, 36], "\u753b\u51fa\u6587\u5b57\u533a\u57df\u7684\u56fe\u7247base64\u503c", 0.9999545514583588], [[11, 112, 215, 36], "\u8bc6\u522b\u7ed3\u679c\u7684\u8f93\u51fa", 0.999962397984096], [[11, 171, 158, 36], "\u8bc6\u522b\u7684\u8017\u65f6", 0.999971580505371]], "speed_time": 0.67}}

本小节灵感:

https://github.com/alisen39/TrWebOCR）

https://post.smzdm.com/p/agwev0l6/

NetSuite API呼叫OCR

Background: below is SS1.0 as example since it came from NetSuite email plugin, SS2.0 is the same thing.

下面使用的第三方服务平台, 同样把地址修改为上方的自建平台也是实现ocr识别功能.

1. Registry a API key throw https://ocr.space/OCRAPI

There are limitations for Free Plan

var importFile = attachments[indexAtt];importFile.setIsOnline(true);
var intFileId = nlapiSubmitFile(importFile);
var strInvFileUrl = "https://" + nlapiGetContext().getCompany() + ".app.netsuite.com"+ objInvoiceFileRec.getURL();
strInvFileUrl = encodeURIComponent(strInvFileUrl);

3. Send Request to https://api.ocr.space/parse/imageurl?apikey=abcAPIKEYabc&filetype=PDF&isTable=true&url=

var response = nlapiRequestURL(strReqUrl, null, a);
There are varience of parameters for this API, in my case, it's invoice formated as table, that's why I send isTable=true to identify it; then it will help me to locate the expected cell and values.

4. Got and parsed the Response, we will get the Text messages on the PDF or Images.

var arrParsedLines = (objOcrRes['ParsedResults'] && objOcrRes['ParsedResults'][0]) ? objOcrRes['ParsedResults'][0]['TextOverlay']['Lines']: null;
var objVndBillData = parseDataFromInvPdf(arrParsedLines);

定制服务下单