org.opencms.search.extractors
Class A_CmsTextExtractorMsOfficeBase

java.lang.Object
  extended byorg.opencms.search.extractors.A_CmsTextExtractor
      extended byorg.opencms.search.extractors.A_CmsTextExtractorMsOfficeBase
All Implemented Interfaces:
I_CmsTextExtractor, org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
Direct Known Subclasses:
CmsExtractorMsExcel, CmsExtractorMsPowerPoint, CmsExtractorMsWord

public abstract class A_CmsTextExtractorMsOfficeBase
extends A_CmsTextExtractor
implements org.apache.poi.poifs.eventfilesystem.POIFSReaderListener

Base class to extract summary information from MS office documents.

Since:
6.0.0
Version:
$Revision: 1.7 $
Author:
Alexander Kandzior

Field Summary
protected static java.lang.String ENCODING_CP1252
          Windows Cp1252 endocing (western europe) is used as default for single byte fields.
protected static java.lang.String ENCODING_UTF16
          UTF-16 encoding is used for double byte fields.
protected static java.lang.String POWERPOINT_EVENT_NAME
          Event event name for a MS PowerPoint document.
protected static int PPT_TEXTBYTE_ATOM
          PPT text byte atom.
protected static int PPT_TEXTCHAR_ATOM
          PPT text char atom.
 
Fields inherited from class org.opencms.search.extractors.A_CmsTextExtractor
m_inputBuffer
 
Constructor Summary
A_CmsTextExtractorMsOfficeBase()
           
 
Method Summary
protected  void cleanup()
          Cleans up some internal memory.
protected  java.util.Map extractMetaInformation()
          Returns a map with the extracted meta information from the document.
 void processPOIFSReaderEvent(org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent event)
           
 
Methods inherited from class org.opencms.search.extractors.A_CmsTextExtractor
extractText, extractText, extractText, extractText, getStreamCopy, removeControlChars
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ENCODING_CP1252

protected static final java.lang.String ENCODING_CP1252
Windows Cp1252 endocing (western europe) is used as default for single byte fields.

See Also:
Constant Field Values

ENCODING_UTF16

protected static final java.lang.String ENCODING_UTF16
UTF-16 encoding is used for double byte fields.

See Also:
Constant Field Values

POWERPOINT_EVENT_NAME

protected static final java.lang.String POWERPOINT_EVENT_NAME
Event event name for a MS PowerPoint document.

See Also:
Constant Field Values

PPT_TEXTBYTE_ATOM

protected static final int PPT_TEXTBYTE_ATOM
PPT text byte atom.

See Also:
Constant Field Values

PPT_TEXTCHAR_ATOM

protected static final int PPT_TEXTCHAR_ATOM
PPT text char atom.

See Also:
Constant Field Values
Constructor Detail

A_CmsTextExtractorMsOfficeBase

public A_CmsTextExtractorMsOfficeBase()
Method Detail

processPOIFSReaderEvent

public void processPOIFSReaderEvent(org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent event)
Specified by:
processPOIFSReaderEvent in interface org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
See Also:
POIFSReaderListener.processPOIFSReaderEvent(org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent)

cleanup

protected void cleanup()
Cleans up some internal memory.


extractMetaInformation

protected java.util.Map extractMetaInformation()
Returns a map with the extracted meta information from the document.

Returns:
a map with the extracted meta information from the document