当前位置：首页 > news >正文

Nougat：结合光学神经网络，引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

news 2026/2/9 17:06:10

Nougat：结合光学神经网络，引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

这是Nougat的官方存储库，Nougat是一种学术文档PDF解析器，可以理解LaTeX数学和表格。

Project page: https://facebookresearch.github.io/nougat/

1.安装

From pip:

pip install nougat-ocr

From repository:

pip install git+https://github.com/facebookresearch/nougat

Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions here

如果您想从API调用模型或生成数据集，则会有额外的依赖项。
安装通过

pip install "nougat-ocr[api]" or pip install "nougat-ocr[dataset]"

1.2 获取PDF的预测

1.2.1 CLI

To get predictions for a PDF run

$ nougat path/to/file.pdf -o output_directory

目录或文件的路径(其中每行都是PDF的路径)也可以作为位置参数传递

$ nougat path/to/directory -o output_directory

usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT][--recompute] [--markdown] [--no-skipping] pdf [pdf ...]positional arguments:pdf                   PDF(s) to process.options:-h, --help            show this help message and exit--batchsize BATCHSIZE, -b BATCHSIZEBatch size to use.--checkpoint CHECKPOINT, -c CHECKPOINTPath to checkpoint directory.--model MODEL_TAG, -m MODEL_TAGModel tag to use.--out OUT, -o OUT     Output directory.--recompute           Recompute already computed PDF, discarding previous predictions.--full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.--no-markdown         Do not add postprocessing step for markdown compatibility.--markdown            Add postprocessing step for markdown compatibility (default).--no-skipping         Don't apply failure detection heuristic.--pages PAGES, -p PAGESProvide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.

The default model tag is 0.1.0-small. If you want to use the base model, use 0.1.0-base.

$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base

In the output directory every PDF will be saved as a .mmd file, the lightweight markup language, mostly compatible with Mathpix Markdown (we make use of the LaTeX tables).

Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of [MISSING_PAGE] responses, try to run with the --no-skipping flag. Related: #11, #67

1.2.2 API

With the extra dependencies you use app.py to start an API. Call

$ nougat_api

通过向http://127.0.0.1:8503/ predict/发出POST请求来获得PDF文件的预测。它还接受参数“start”和“stop”，以限制计算选择页码(包括边界)。

响应是一个带有文档标记文本的字符串。

curl -X 'POST' \'http://127.0.0.1:8503/predict/' \-H 'accept: application/json' \-H 'Content-Type: multipart/form-data' \-F 'file=@<PDFFILE.pdf>;type=application/pdf'

To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5

2.Dataset

2.1 生成数据集

To generate a dataset you need

A directory containing the PDFs
A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure
A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"

Next run

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

Additional arguments include

Argument	Description
`--recompute`	recompute all splits
`--markdown MARKDOWN`	Markdown output dir
`--workers WORKERS`	How many processes to use
`--dpi DPI`	What resolution the pages will be saved at
`--timeout TIMEOUT`	max time per paper in seconds
`--tesseract`	Tesseract OCR prediction for each page

Finally create a jsonl file that contains all the image paths, markdown text and meta information.

python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl

For each jsonl file you also need to generate a seek map for faster data loading:

python -m nougat.dataset.gen_seek file.jsonl

The resulting directory structure can look as follows:

root/
├── images
├── train.jsonl
├── train.seek.map
├── test.jsonl
├── test.seek.map
├── validation.jsonl
└── validation.seek.map

Note that the .mmd and .json files in the path/paired/output (here images) are no longer required.
This can be useful for pushing to a S3 bucket by halving the amount of files.

2.2Training

To train or fine tune a Nougat model, run

python train.py --config config/train_nougat.yaml

2.3 Evaluation

Run

python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json

To get the results for the different text modalities, run

python -m nougat.metrics path/to/results.json

2.4 FAQ

Why am I only getting [MISSING_PAGE]?

Nougat was trained on scientific papers found on arXiv and PMC. Is the document you’re processing similar to that?
What language is the document in? Nougat works best with English papers, other Latin-based languages might work. Chinese, Russian, Japanese etc. will not work.
If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs (#11). Try passing the --no-skipping flag for now.
Where can I download the model checkpoint from.

They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing --model 0.1.0-{base,small}

参考链接：
https://github.com/facebookresearch/nougat

更多优质内容请关注公号：汀丶人工智能；会提供一些相关的资源和优质文章，免费获取阅读。

Nougat：结合光学神经网络，引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

Nougat：结合光学神经网络，引领学术PDF文档的智能解析、挖掘学术论文PDF的价值这是Nougat的官方存储库，Nougat是一种学术文档PDF解析器，可以理解LaTeX数学和表格。 Project page: https://facebookresearch.github.io/nougat/ …...

编程日记 2023/12/14 6:28:31

涉密网络的IP查询防护策略

涉密网络的安全性对于维护国家、企业及个人的核心利益至关重要。在当今数字化时代，网络攻击日益猖獗，其中IP查询是攻击者获取目标信息的一种常见手段。本文将探讨涉密网络中防护IP查询的关键策略，以确保网络的机密性和安全性。 1. 专用VPN和…...

编程日记 2023/12/14 6:27:30

基础算法(1):排序(1):选择排序

今天对算法产生了兴趣，开始学习基础算法，比如排序，模拟，贪心，递推等内容，算法是很重要的，它是解决某个问题的特定方法，程序数据结构算法，所以对算法的学习是至关重要的&a…...

编程日记 2023/12/14 6:26:29

GeoTrust OV证书

当谈到网站安全性和可信度时，GeoTrust OV证书是一个备受推崇的选择。作为一家备受尊敬的数字证书颁发机构，GeoTrust以其卓越的品牌声誉和高质量的产品而闻名于世。GeoTrust OV证书提供了一系列的安全功能，同时还具有出色的性价比，…...

编程日记 2023/12/14 6:25:28

第一个“hello Android”程序

1、首先安装Android studio（跳过） Android Studio是由Google推出的官方集成开发环境（IDE），专门用于Android应用程序的开发。它是基于JetBrains的IntelliJ IDEA IDE构建的，提供了丰富的功能和工具&#xff0…...

编程日记 2023/12/14 6:24:27

docker-compose安装nacos和msql

docker-compose安装nacos和msql 前言前提已经安装docker-compose，如果没有安装，则可以查看上面系列文章中的安装教程。并且文章中使用的是mobaxterm连接虚拟机。 1、下载2、创建并运行前言前提已经安装docker-compose，如果没有安装&#x…...

编程日记 2023/12/14 6:23:25

AnythingLLM：基于RAG方案构专属私有知识库（开源｜高效｜可定制）

一、前言继OpenAI和Google的产品发布会之后，大模型的能力进化速度之快令人惊叹，然而，对于很多个人和企业而言，为了数据安全不得不考虑私有化部署方案，从GPT-4发布以来，国内外的大模型就拉开了很明显的差距…...

编程日记 2023/12/14 6:22:24

常见的工作流编排引擎

常见工作流框架：微服务编排引擎工作流框架还是比较多的，按照语言分类的话，有 Java: jBPM、Activiti、SWF PHP: Tpflow、PHPworkflow Go: Cadence（Cadence由Uber开发并开源，Maxim Fateev是Cadence的主架构师&#…...

编程日记 2023/12/14 6:21:23

期末总复习（重点！！！）

一、第6章异常处理 1、什么是异常、什么是异常处理异常是指程序在运行过程中发生的错误事件，影响程序的正常执行。异常并不是一定会发生，默认情况下，程序运行中遇到异常时将会终止，并在控制台打印出异常出现的堆栈信息。异常处理…...

编程日记 2023/12/14 6:20:22

input 获取焦点后样式的修改

一、实现目标 1.没有获取焦点时样子 2.获取焦点时代码： <input class"input"placeholder"请输入关键字" input"loadNode" />css .input {border-radius: 14px;border:1px solid #e4e4e4;margin: 5px;margin-top: 10px;wi…...

编程日记 2023/12/14 6:17:20

持续集成交付CICD：Jenkins使用GitLab共享库实现自动上传前后端项目Nexus制品

目录一、实验 1.GitLab本地导入前后端项目 2.Jenkins新建前后端项目流水线 3.Sonarqube录入质量阈与质量配置 4.修改GitLab共享库代码 5.Jenkins手动构建前后端项目流水线 6.Nexus查看制品上传情况 7.优化代码获取RELEASE分支 8.优化Jenkins流水线项目名称一、实验 …...

编程日记 2023/12/14 6:16:19

一、Description The TPS715 low-dropout (LDO) voltage regulators offer the benefits of high input voltage, low-dropout voltage, low-power operation, and miniaturized packaging. The devices, which operate over an input range of 2.5 V to 24 V, are stable wit…...

编程日记 2023/12/14 6:15:18

Nougat：结合光学神经网络，引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

1.安装

1.2 获取PDF的预测

1.2.1 CLI

1.2.2 API

2.Dataset

2.1 生成数据集

2.2Training

2.3 Evaluation

2.4 FAQ

相关文章：