tika或pdf基础信息

2022-03-28 19:43:49 浏览数 (1)

通过下面的代码就可以获取一个pdf文件的基础信息:

代码语言:javascript复制
        try{
               BodyContentHandler handler = new BodyContentHandler();
               Metadata metadata = new Metadata();
               FileInputStream inputstream = new FileInputStream(new File("D:/apache_software/solr/solr-7.5.0/example/exampledocs/solr-word.pdf"));
               ParseContext pcontext = new ParseContext();
 
               //parsing the document using PDF parser
               PDFParser pdfparser = new PDFParser(); 
               pdfparser.parse(inputstream, handler, metadata,pcontext);
 
               //getting the content of the document
               System.out.println("Contents of the PDF :"   handler.toString());
 
               //getting metadata of the document
               System.out.println("Metadata of the PDF:");
               String[] metadataNames = metadata.names();
 
               for(String name : metadataNames) {
                  System.out.println(name  " : "   metadata.get(name));
               }            
         }catch(Exception ex){
             ex.printStackTrace();
         }

输出结果:

代码语言:javascript复制
Metadata of the PDF:
 date : 2008-11-13T13:35:51Z
 pdf:PDFVersion : 1.3
 xmp:CreatorTool : Microsoft Word
 Keywords : solr, word, pdf
 subject : solr word
 AAPL:Keywords : solr, word, pdf
dc:creator : Grant Ingersoll
 dcterms:created : 2008-11-13T13:35:51Z
 Last-Modified : 2008-11-13T13:35:51Z
 dcterms:modified : 2008-11-13T13:35:51Z
dc:format : application/pdf; version=1.3
 title : solr-word
 Last-Save-Date : 2008-11-13T13:35:51Z
 meta:save-date : 2008-11-13T13:35:51Z
 dc:title : solr-word
 pdf:encrypted : false
 modified : 2008-11-13T13:35:51Z
 cp:subject : solr word
 Content-Type : application/pdf
 creator : Grant Ingersoll
 meta:author : Grant Ingersoll
 dc:subject : solr, word, pdf
 meta:creation-date : 2008-11-13T13:35:51Z
 created : Thu Nov 13 21:35:51 CST 2008
 xmpTPg:NPages : 1
 Creation-Date : 2008-11-13T13:35:51Z
 meta:keyword : solr, word, pdf
 Author : Grant Ingersoll
 producer : Mac OS X 10.5.5 Quartz PDFContext

这也是为什么tika导入pdf文件时会有下面的配置:

代码语言:javascript复制
      <entity name="pdf" processor="TikaEntityProcessor"
               url="${file.fileAbsolutePath}" format="text">
        <field column="Author" name="author" meta="true"/>
         <!-- in the original PDF, the Author meta-field name is upper-cased,
           but in Solr schema it is lower-cased
          -->
        <field column="title" name="title" meta="true"/>
         <field column="dc:format" name="format" meta="true"/>
        <field column="text" name="text"/>
      </entity>

0 人点赞