通过下面的代码就可以获取一个pdf文件的基础信息:
代码语言:javascript复制 try{
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("D:/apache_software/solr/solr-7.5.0/example/exampledocs/solr-word.pdf"));
ParseContext pcontext = new ParseContext();
//parsing the document using PDF parser
PDFParser pdfparser = new PDFParser();
pdfparser.parse(inputstream, handler, metadata,pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :" handler.toString());
//getting metadata of the document
System.out.println("Metadata of the PDF:");
String[] metadataNames = metadata.names();
for(String name : metadataNames) {
System.out.println(name " : " metadata.get(name));
}
}catch(Exception ex){
ex.printStackTrace();
}
输出结果:
代码语言:javascript复制Metadata of the PDF:
date : 2008-11-13T13:35:51Z
pdf:PDFVersion : 1.3
xmp:CreatorTool : Microsoft Word
Keywords : solr, word, pdf
subject : solr word
AAPL:Keywords : solr, word, pdf
dc:creator : Grant Ingersoll
dcterms:created : 2008-11-13T13:35:51Z
Last-Modified : 2008-11-13T13:35:51Z
dcterms:modified : 2008-11-13T13:35:51Z
dc:format : application/pdf; version=1.3
title : solr-word
Last-Save-Date : 2008-11-13T13:35:51Z
meta:save-date : 2008-11-13T13:35:51Z
dc:title : solr-word
pdf:encrypted : false
modified : 2008-11-13T13:35:51Z
cp:subject : solr word
Content-Type : application/pdf
creator : Grant Ingersoll
meta:author : Grant Ingersoll
dc:subject : solr, word, pdf
meta:creation-date : 2008-11-13T13:35:51Z
created : Thu Nov 13 21:35:51 CST 2008
xmpTPg:NPages : 1
Creation-Date : 2008-11-13T13:35:51Z
meta:keyword : solr, word, pdf
Author : Grant Ingersoll
producer : Mac OS X 10.5.5 Quartz PDFContext
这也是为什么tika导入pdf文件时会有下面的配置:
代码语言:javascript复制 <entity name="pdf" processor="TikaEntityProcessor"
url="${file.fileAbsolutePath}" format="text">
<field column="Author" name="author" meta="true"/>
<!-- in the original PDF, the Author meta-field name is upper-cased,
but in Solr schema it is lower-cased
-->
<field column="title" name="title" meta="true"/>
<field column="dc:format" name="format" meta="true"/>
<field column="text" name="text"/>
</entity>