当前位置：首页 > news >正文

第三百零一节 Lucene教程 - Lucene索引文件

news 2025/12/8 19:22:50

Lucene教程 - Lucene索引文件

索引是识别文档并为搜索准备文档的过程。

下表列出了索引过程中常用的类。

类	描述
IndexWriter	在索引过程中创建/更新索引。
Directory	表示索引的存储位置。
Analyzer	分析文档并从文本中获取标记/单词。
Document	带有字段的虚拟文档。分析仪可以处理文档。
Field	索引过程的最低单位。它表示键值对，其中键用于标识索引值。

例子

以下代码显示了如何使用Lucene索引文本文件。

/** Licensed to the Apache Software Foundation (ASF) under one or more* contributor license agreements.  See the NOTICE file distributed with* this work for additional information regarding copyright ownership.* The ASF licenses this file to You under the Apache License, Version 2.0* (the "License"); you may not use this file except in compliance with* the License.  You may obtain a copy of the License at**     http://www.apache.org/licenses/LICENSE-2.0** Unless required by applicable law or agreed to in writing, software* distributed under the License is distributed on an "AS IS" BASIS,* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.* See the License for the specific language governing permissions and* limitations under the License.*/import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.util.Date;/** Index all text files under a directory.* <p>* This is a command-line application demonstrating simple Lucene indexing.* Run it with no command-line arguments for usage information.*/
public class Main {private Main() {}/** Index all text files under a directory. */public static void main(String[] args) {String usage = "java IndexFiles"+ " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"+ "This indexes the documents in DOCS_PATH, creating a Lucene index"+ "in INDEX_PATH that can be searched with SearchFiles";String indexPath = "index";String docsPath = null;boolean create = true;for(int i=0;i<args.length;i++) {if ("-index".equals(args[i])) {indexPath = args[i+1];i++;} else if ("-docs".equals(args[i])) {docsPath = args[i+1];i++;} else if ("-update".equals(args[i])) {create = false;}}if (docsPath == null) {System.err.println("Usage: " + usage);System.exit(1);}final File docDir = new File(docsPath);if (!docDir.exists() || !docDir.canRead()) {System.out.println("Document directory "" +docDir.getAbsolutePath()+ "" does not exist or is not readable, please check the path");System.exit(1);}Date start = new Date();try {System.out.println("Indexing to directory "" + indexPath + ""...");Directory dir = FSDirectory.open(new File(indexPath));// :Post-Release-Update-Version.LUCENE_XY:Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_4_10_0);IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_4_10_0, analyzer);if (create) {// Create a new index in the directory, removing any// previously indexed documents:iwc.setOpenMode(OpenMode.CREATE);} else {// Add new documents to an existing index:iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);}// Optional: for better indexing performance, if you// are indexing many documents, increase the RAM// buffer.  But if you do this, increase the max heap// size to the JVM (eg add -Xmx512m or -Xmx1g)://// iwc.setRAMBufferSizeMB(256.0);IndexWriter writer = new IndexWriter(dir, iwc);indexDocs(writer, docDir);// NOTE: if you want to maximize search performance,// you can optionally call forceMerge here.  This can be// a terribly costly operation, so generally it"s only// worth it when your index is relatively static (ie// you"re done adding documents to it)://// writer.forceMerge(1);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");} catch (IOException e) {System.out.println(" caught a " + e.getClass() +"\n with message: " + e.getMessage());}}/*** Indexes the given file using the given writer, or if a directory is given,* recurses over files and directories found under the given directory.* * NOTE: This method indexes one document per input file.  This is slow.  For good* throughput, put multiple documents into your input file(s).  An example of this is* in the benchmark module, which can create "line doc" files, one document per line,* using the* <a href="../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html"* >WriteLineDocTask</a>.*  * @param writer Writer to the index where the given file/dir info will be stored* @param file The file to index, or the directory to recurse into to find files to index* @throws IOException If there is a low-level I/O error*/static void indexDocs(IndexWriter writer, File file)throws IOException {// do not try to index files that cannot be readif (file.canRead()) {if (file.isDirectory()) {String[] files = file.list();// an IO error could occurif (files != null) {for (int i = 0; i < files.length; i++) {indexDocs(writer, new File(file, files[i]));}}} else {FileInputStream fis;try {fis = new FileInputStream(file);} catch (FileNotFoundException fnfe) {// at least on windows, some temporary files raise this exception with an "access denied" message// checking if the file can be read doesn"t helpreturn;}try {// make a new, empty documentDocument doc = new Document();// Add the path of the file as a field named "path".  Use a// field that is indexed (i.e. searchable), but don"t tokenize // the field into separate words and don"t index term frequency// or positional information:Field pathField = new StringField("path", file.getPath(), Field.Store.YES);doc.add(pathField);// Add the last modified date of the file a field named "modified".// Use a LongField that is indexed (i.e. efficiently filterable with// NumericRangeFilter).  This indexes to milli-second resolution, which// is often too fine.  You could instead create a number based on// year/month/day/hour/minutes/seconds, down the resolution you require.// For example the long value 2011021714 would mean// February 17, 2011, 2-3 PM.doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));// Add the contents of the file to a field named "contents".  Specify a Reader,// so that the text of the file is tokenized and indexed, but not stored.// Note that FileReader expects the file to be in UTF-8 encoding.// If that"s not the case searching for special characters will fail.doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {// New index, so we just add the document (no old document can be there):System.out.println("adding " + file);writer.addDocument(doc);} else {// Existing index (an old copy of this document may have been indexed) so // we use updateDocument instead to replace the old one matching the exact // path, if present:System.out.println("updating " + file);writer.updateDocument(new Term("path", file.getPath()), doc);}} finally {fis.close();}}}}
}

第三百零一节 Lucene教程 - Lucene索引文件

Lucene教程 - Lucene索引文件索引是识别文档并为搜索准备文档的过程。下表列出了索引过程中常用的类。类描述IndexWriter在索引过程中创建/更新索引。Directory表示索引的存储位置。Analyzer分析文档并从文本中获取标记/单词。Document带有字段的虚拟文档。分析仪可以处理…...

编程日记 2024/11/1 1:33:52

动态规划 01背包（算法）

现有四个物品，小偷的背包容量为8，怎么可以偷得价值较多的物品如: 物品编号： 1 2 3 4 物品容量： 2 3 4 5 物品价值： 3 4 5 8 记f(k,w) ,当背包容量为w,可以偷k件物品…...

编程日记 2024/11/1 1:32:51

在main.cpp里输入程序如下： #include <iostream> //使能cin(),cout(); #include <iomanip> //使能setbase(),setfill(),setw(),setprecision(),setiosflags()和resetiosflags(); //setbase( char x )是设置输出数字的基数,如输出进制数则用setbas…...

编程日记 2024/11/1 1:31:50

wps宏代码学习

推荐学习视频：https://space.bilibili.com/363834767/channel/collectiondetail?sid1139008&spm_id_from333.788.0.0 打开宏编辑器和JS代码调试工具-》开发工具-》WPS宏编辑器左边是工程区，当打开多个excel时会有多个，要注意不要把…...

编程日记 2024/11/1 1:30:49

libavdevice.so.58: cannot open shared object file: No such file ordirectory踩坑

博主是将大图切分成小图时遇到问题一、linux编译后，找不到ffmpeg中的一个文件产生原因，各种包集成，然后安装以后乱七八糟，甚至官方的教程也不规范导致没有添加路径到系统文件导致系统执行的时候找不到 1.下载博主进行的离线…...

编程日记 2024/11/1 1:29:47

Rust：Vec＜u8＞与 [u8] 之间的转换

在 Rust 中，Vec<u8> 是一个动态数组，而 &[u8] 是一个指向字节切片的不可变引用。这两者之间经常需要进行转换，因为它们在处理字节数据时非常常见。从 &[u8] 转换为 Vec<u8> 要将一个字节切片 &[u8] 转换为一个 Ve…...

编程日记 2024/11/1 1:24:39

Leetcode 课程表

这段代码的算法思想是基于**深度优先搜索（DFS）**来检测图中的环路，从而判断是否可以完成所有课程。具体来说，我们将每门课程和它的先修关系视为一个有向图，问题的核心就是判断这个有向图中是否存在环路。如果有环路&am…...

编程日记 2024/11/1 1:23:38

Java面试经典 150 题.P55. 跳跃游戏（009）

本题来自：力扣-面试经典 150 题面试经典 150 题 - 学习计划 - 力扣（LeetCode）全球极客挚爱的技术成长平台https://leetcode.cn/studyplan/top-interview-150/ 题解： class Solution {public boolean canJump(int[] nums) {int…...

编程日记 2024/11/1 1:22:37

登录的时候密码使用crypto-js加密解密

首先要下载插件 npm install crypto-js 然后新建一个js文件 crypto.js // 导入 CryptoJS 模块 import CryptoJS from crypto-js; const secretKey"pZsgDSvzaeHWDkhLDxvrrrYvBlAsIHmZ";//一般是后端提供的 /*** description: 加解密函数* param {*} data 需要加密的数…...

编程日记 2024/11/1 1:19:33

LLM大模型部署实战指南：部署简化流程

LLM大模型部署实战指南：Ollama简化流程，OpenLLM灵活部署，LocalAI本地优化，Dify赋能应用开发 1. Ollama 部署的本地模型(🔺) Ollama 是一个开源框架，专为在本地机器上便捷部署和运行大型语言模型（LLM）而设计。，这是 Ollama 的官网地址：https://ollama.com/ 以下是其…...

编程日记 2024/11/1 1:18:30

24年10月Google Play政策更新通知

今天gmail邮箱里收到了google play最新的政策更新通知，这次的通知对于我来说，影响不大，邮件内容主要分为三部分。一、政策更新部分这里更新的政策只有医疗功能相关的。针对健康和医疗应用增加了最新的医疗指南和免责声明要求，并…...

编程日记 2024/11/1 1:16:29

玄机-应急响应- Linux入侵排查

一、web目录存在木马，请找到木马的密码提交到web目录进行搜索 find ./ type f -name "*.php" | xargs grep "eval(" 发现有三个可疑文件 1.php看到密码 1 flag{1} 二、服务器疑似存在不死马，请找到不死马的密码提交被md5加密的…...

编程日记 2024/11/1 1:15:27

数据驱动业务中的BDS对账班牛返款表集成方案

数据驱动业务中的BDS对账班牛返款表集成方案 BDS对账班牛返款表_update：班牛数据集成到MySQL的技术实现在数据驱动的业务环境中，如何高效、准确地将分散在不同系统中的数据进行整合，是每个企业面临的重要挑战。本文将分享一个具体的技术案例…...

编程日记 2024/11/1 1:14:26

【Kubernetes实战】三、资源组件Namespace、Pod、Label、Deployment、Service概述。

目录 1. Namespace1) namespace作用2) namespace资源的具体操作 2. Pod1) Pod概述2) Pod资源的具体操作 3. Label1) Label概述2) Label资源的具体操作 4. Deployment1) Deployment概述2) Deployment控制器的具体操作 5. Service1) Service概述2) Service资源的具体操作 1. Name…...

编程日记 2024/11/1 1:12:24

去中心化的模型训练

去中心化的模型训练（Decentralized Model Training）是一种不依赖单一中心服务器或数据存储中心，而是在多个节点（如设备或数据拥有者）上进行联合训练的方法。这种训练模式可以更好地保护数据隐私、降低数据传输成本&…...

编程日记 2024/11/1 1:11:22

Arthas调试线上代码技巧

1、Arthas概述官网地址：https://arthas.aliyun.com/ 下载地址：https://arthas.aliyun.com/arthas-boot.jar 使用教程：https://arthas.aliyun.com/doc/quick-start.html Arthas（阿尔萨斯）是 Alibaba 开源的一款Java诊断…...

编程日记 2024/11/1 1:10:21

QT访问数据库：应用提示Driver not loaded

在QT中运行完全正确错误截图解决办法1 我用的是MySQL。我把libmysql.dll复制到应用程序的目录下，即可正常访问数据库。解决办法2 bool open_work_db() {QString info "support drivers:";for (int i0; i<QSqlDatabase::drivers().size(); i){inf…...

编程日记 2024/11/1 1:08:19

支持ANC的头戴式蓝牙耳机，更有小金标认证，QCY H3 Pro体验

平时听音乐、看视频，大家都想获得更悦耳的音质体验，这时候蓝牙耳机就是性价比更高的一种方案，同时因其无线束缚、便携性高的特点，随时拿出来就能用。更不用说如今国产品牌的蓝牙耳机升级迭代速度非常快，百元的价位就可…...

编程日记 2024/11/1 1:06:17

net framework 3.5组件更新失败错误代码0x80072f8f怎样解决

浏览器地址栏输入www.dnz9.com远程解决netframework问题当遇到.NET Framework 3.5 组件更新失败，错误代码为 0x80072f8f 时，可以尝试以下几种解决方法： 一、检查网络连接和时间设置网络连接错误代码 0x80072f8f 通常与网络相关问题有关。首…...

编程日记 2024/11/1 1:05:15

C语言初阶：十一.代码调试技巧

❤欢迎各位大佬访问：折枝寄北-CSDN博客折枝寄北擅长C语言初阶,等方面的知识,折枝寄北关注python,c,java,qt,c语言领域.https://blog.csdn.net/2303_80170533?typeblog❤文章所属专栏https://blog.csdn.net/2303_80170533/category_12794764.html?spm1001.2014.300…...

编程日记 2024/11/1 1:04:13

三维GIS开发cesium智慧地铁教程（5）Cesium相机控制

一、环境搭建 <script src"../cesium1.99/Build/Cesium/Cesium.js"></script> <link rel"stylesheet" href"../cesium1.99/Build/Cesium/Widgets/widgets.css"> 关键配置点： 路径验证：确保相对路径.…...

编程新知 2025/12/1 19:23:04

蓝桥杯 2024 15届国赛 A组儿童节快乐

P10576 [蓝桥杯 2024 国 A] 儿童节快乐题目描述五彩斑斓的气球在蓝天下悠然飘荡，轻快的音乐在耳边持续回荡，小朋友们手牵着手一同畅快欢笑。在这样一片安乐祥和的氛围下，六一来了。今天是六一儿童节，小蓝老师为了让大家在节…...

编程新知 2025/12/5 2:40:04

基于当前项目通过npm包形式暴露公共组件

1.package.sjon文件配置其中xh-flowable就是暴露出去的npm包名 2.创建tpyes文件夹，并新增内容 3.创建package文件夹...

编程新知 2025/12/8 12:11:59

【CSS position 属性】static、relative、fixed、absolute 、sticky详细介绍，多层嵌套定位示例

文章目录 ★ position 的五种类型及基本用法 ★ 一、position 属性概述二、position 的五种类型详解（初学者版） 1. static（默认值） 2. relative（相对定位） 3. absolute（绝对定位） 4. fixed（固定定位） 5. sticky（粘性定位）三、定位元素的层级关系（z-i…...

编程新知 2025/12/8 2:32:29