当前位置：首页 > news >正文

Lucene常用的字段类型lucene检索打分原理

news 2026/3/30 8:12:04

在 Apache Lucene 中，Field 类是文档中存储数据的基础。不同类型的 Field 用于存储不同类型的数据（如文本、数字、二进制数据等）。以下是一些常用的 Field 类型及其底层存储结构：

TextField：
- 用途：用于存储文本数据，并对其进行分词和索引。
- 底层存储结构：文本数据会被分词器（Analyzer）处理，将文本分割成词项（terms）。每个词项会被存储在倒排索引（inverted index）中，映射到包含该词项的文档。
- 示例：
```
import org.apache.lucene.document.Document;
import org.apache.lucene.document.TextField;
import org.apache.lucene.document.Field.Store;Document doc = new Document();
doc.add(new TextField("fieldName", "This is a sample text.", Store.YES));
```

StringField：

用途：用于存储不需要分词的字符串数据，如唯一标识符（ID）等。
底层存储结构：字符串数据作为一个整体存储在倒排索引中，不会进行分词。

示例：

import org.apache.lucene.document.Document;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.Field.Store;Document doc = new Document();
doc.add(new StringField("fieldName", "unique_identifier", Store.YES));

IntPoint、LongPoint、FloatPoint、DoublePoint：

用途：用于存储数值数据，并支持范围查询。
底层存储结构：数值数据会被转换成字节数组，并按照分块（block）的方式存储，以支持高效的范围查询。

示例：

import org.apache.lucene.document.Document;
import org.apache.lucene.document.IntPoint;
import org.apache.lucene.document.StoredField;Document doc = new Document();
int value = 123;
doc.add(new IntPoint("fieldName", value));
doc.add(new StoredField("fieldName", value)); // 如果需要存储原始值

StoredField：
- 用途：用于存储不需要索引的数据，仅用于检索时返回的字段。
- 底层存储结构：数据以原始字节的形式存储在存储字段（stored field）中，不会被索引。
- 示例：
```
import org.apache.lucene.document.Document;
import org.apache.lucene.document.StoredField;Document doc = new Document();
doc.add(new StoredField("fieldName", "This is the stored content."));
```

BinaryField：

用途：用于存储二进制数据。
底层存储结构：二进制数据以原始字节的形式存储在存储字段中，不会被索引。

示例：

import org.apache.lucene.document.Document;
import org.apache.lucene.document.StoredField;
import org.apache.lucene.util.BytesRef;Document doc = new Document();
byte[] byteArray = new byte[] {1, 2, 3, 4, 5};
doc.add(new StoredField("fieldName", new BytesRef(byteArray)));

SortedDocValuesField 和 NumericDocValuesField：

用途：用于存储排序和打分时需要的字段值。
底层存储结构：数据以紧凑的格式存储在文档值（doc values）中，支持高效的排序和打分计算。

示例：

import org.apache.lucene.document.Document;
import org.apache.lucene.document.SortedDocValuesField;
import org.apache.lucene.document.NumericDocValuesField;
import org.apache.lucene.util.BytesRef;Document doc = new Document();
doc.add(new SortedDocValuesField("fieldName", new BytesRef("sortable value")));
doc.add(new NumericDocValuesField("numericField", 12345L));

lucene检索打分原理

在 Apache Lucene 中，"打分"（Scoring）是指在搜索过程中，根据文档与查询的匹配程度，为每个文档分配一个相关性分数（relevance score）。这个分数反映了文档与查询的相关性，分数越高，表示文档越相关。打分用于确定搜索结果的排序，即哪些文档应该排在前面展示给用户。

打分的基本概念

相关性分数：
- 每个文档在搜索结果中都会有一个相关性分数，数值越高，表示文档越符合查询条件。
- 相关性分数是一个浮点数，通常在 0 到 1 之间，但也可以大于 1。
TF-IDF 模型：
- Lucene 使用 TF-IDF（Term Frequency-Inverse Document Frequency）模型来计算相关性分数。
- TF（词频）：在一个文档中某个词的出现频率。词频越高，表示该词对文档的重要性越大。
- IDF（逆文档频率）：某个词在所有文档中出现的频率。文档频率越低，表示该词对区分文档的重要性越大。
BM25 算法：
- BM25 是 Lucene 默认的打分算法，是 TF-IDF 的进化版本，能够更好地处理长查询和长文档。
- BM25 考虑了词频、逆文档频率、文档长度等因素。

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;public class LuceneScoringExample {public static void main(String[] args) throws Exception {// 创建分析器StandardAnalyzer analyzer = new StandardAnalyzer();// 创建索引Directory index = new RAMDirectory();IndexWriterConfig config = new IndexWriterConfig(analyzer);IndexWriter writer = new IndexWriter(index, config);// 添加文档addDoc(writer, "Lucene in Action", "193398817");addDoc(writer, "Lucene for Dummies", "55320055Z");addDoc(writer, "Managing Gigabytes", "55063554A");addDoc(writer, "The Art of Computer Science", "9900333X");writer.close();// 创建查询String querystr = "Lucene";// 解析查询Query query = new QueryParser("title", analyzer).parse(querystr);// 搜索int hitsPerPage = 10;IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));TopDocs docs = searcher.search(query, hitsPerPage);ScoreDoc[] hits = docs.scoreDocs;// 显示结果System.out.println("Found " + hits.length + " hits.");for (int i = 0; i < hits.length; ++i) {int docId = hits[i].doc;Document d = searcher.doc(docId);System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title") + "\t" + hits[i].score);}}private static void addDoc(IndexWriter w, String title, String isbn) throws Exception {Document doc = new Document();doc.add(new TextField("title", title, Field.Store.YES));doc.add(new StringField("isbn", isbn, Field.Store.YES));w.addDocument(doc);}
}

在 Apache Lucene 中，打分（scoring）是一个动态计算的过程，相关性分数并不是预先存储在索引中的，而是根据查询和文档在搜索时实时计算的。因此，打分的值是临时的，不会永久存储在索引中。

动态计算：
- 当你执行一个查询时，Lucene 会根据查询条件和文档内容，动态计算每个匹配文档的相关性分数。
- 这个计算过程基于查询的类型、词频（TF）、逆文档频率（IDF）、文档长度等因素。
不存储在索引中：
- 相关性分数并不会被存储在索引中。存储在索引中的信息包括倒排索引、词项频率、文档值等。
- 每次执行查询时，Lucene 都会重新计算相关性分数，这确保了分数总是根据最新的查询条件和文档内容而更新。

Lucene常用的字段类型lucene检索打分原理

lucene检索打分原理

打分的基本概念

相关文章：

Lucene常用的字段类型lucene检索打分原理

适用于IntelliJ IDEA 2024.1.2部署Tomcat的完整方法，以及笔者踩的坑,避免高血压,保姆级教程

XSS靶场通关详解

Excel 技巧15 - 在Excel中抠图头像，换背景色(★★）

备忘-humanplus相关的代码解析

青少年编程与数学 02-008 Pyhon语言编程基础 01课题、语言概要

XSS （XSS）分类

[Linux]el8安全配置faillock：登录失败达阈值自动锁定账户配置

最新-CentOS 7安装1 Panel Linux 服务器运维管理面板

selenium定位网页元素

積分方程與簡單的泛函分析8.具連續對稱核的非齊次第II類弗雷德霍姆積分算子方程

长理算法复习

机器学习-K近邻算法

使用rsync+inotify简单实现文件实时双机双向同步

Ubuntu 24.04 LTS开机自启动脚本设置方法

谈谈对JavaScript 中的事件冒泡（Event Bubbling）和事件捕获（Event Capturing）的理解

解读2025年生物医药创新技术：展览会与论坛的重要性

【第七天】零基础入门刷题Python-算法篇-数据结构与算法的介绍-一种常见的分治算法（持续更新）

Spring Data JPA 实战：构建高性能数据访问层

Python JSON：深入解析与高效应用

3MF格式终极指南：如何在Blender中轻松导入导出3D打印文件

从‘折半查找’到‘二分答案’：LeetCode实战中如何活用这个O(log n)的经典思想

Waymo Open Dataset Docker部署：环境配置与容器化最佳实践

AI-AGENT概念解析 - LLM领域训练

手把手教你用QEMU+GDB调试RISC-V中断：以蜂鸟E200 ECLIC为例

InfiniTime智能手表固件完全指南：从零开始打造你的开源智能手表

DownKyi：B站视频高效解决方案——如何三步搞定8K资源本地化管理

多策略融合改进蜣螂算法：Fuch混沌初始化与自适应变异优化MATLAB实现

如何突破Windows权限限制？NSudo全方位权限管理方案

OpenClaw技能扩展：基于百川2-13B开发自定义文件处理器