org.opencms.search.extractors
Class A_CmsTextExtractorMsOfficeBase

java.lang.Object
  extended byorg.opencms.search.extractors.A_CmsTextExtractor
      extended byorg.opencms.search.extractors.A_CmsTextExtractorMsOfficeBase
All Implemented Interfaces:
I_CmsTextExtractor, org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
Direct Known Subclasses:
CmsExtractorMsExcel, CmsExtractorMsPowerPoint, CmsExtractorMsWord

public abstract class A_CmsTextExtractorMsOfficeBase
extends A_CmsTextExtractor
implements org.apache.poi.poifs.eventfilesystem.POIFSReaderListener

Base class to extract summary information from MS office documents.

Since:
6.0.0
Version:
$Revision: 1.9 $
Author:
Alexander Kandzior

Field Summary
protected static java.lang.String ENCODING_CP1252
          Windows Cp1252 endocing (western europe) is used as default for single byte fields.
protected static java.lang.String ENCODING_UTF16
          UTF-16 encoding is used for double byte fields.
protected static java.lang.String POWERPOINT_EVENT_NAME
          Event event name for a MS PowerPoint document.
protected static int PPT_TEXTBYTE_ATOM
          PPT text byte atom.
protected static int PPT_TEXTCHAR_ATOM
          PPT text char atom.
 
Fields inherited from class org.opencms.search.extractors.A_CmsTextExtractor
m_inputBuffer
 
Constructor Summary
A_CmsTextExtractorMsOfficeBase()
           
 
Method Summary
protected  void cleanup()
          Cleans up some internal memory.
protected  I_CmsExtractionResult createExtractionResult(java.lang.String rawContent)
          Creates the extraction result for this MS Office document.
 void processPOIFSReaderEvent(org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent event)
           
 
Methods inherited from class org.opencms.search.extractors.A_CmsTextExtractor
combineContentItem, extractText, extractText, extractText, extractText, getStreamCopy, removeControlChars
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ENCODING_CP1252

protected static final java.lang.String ENCODING_CP1252
Windows Cp1252 endocing (western europe) is used as default for single byte fields.

See Also:
Constant Field Values

ENCODING_UTF16

protected static final java.lang.String ENCODING_UTF16
UTF-16 encoding is used for double byte fields.

See Also:
Constant Field Values

POWERPOINT_EVENT_NAME

protected static final java.lang.String POWERPOINT_EVENT_NAME
Event event name for a MS PowerPoint document.

See Also:
Constant Field Values

PPT_TEXTBYTE_ATOM

protected static final int PPT_TEXTBYTE_ATOM
PPT text byte atom.

See Also:
Constant Field Values

PPT_TEXTCHAR_ATOM

protected static final int PPT_TEXTCHAR_ATOM
PPT text char atom.

See Also:
Constant Field Values
Constructor Detail

A_CmsTextExtractorMsOfficeBase

public A_CmsTextExtractorMsOfficeBase()
Method Detail

processPOIFSReaderEvent

public void processPOIFSReaderEvent(org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent event)
Specified by:
processPOIFSReaderEvent in interface org.apache.poi.poifs.eventfilesystem.POIFSReaderListener
See Also:
POIFSReaderListener.processPOIFSReaderEvent(org.apache.poi.poifs.eventfilesystem.POIFSReaderEvent)

cleanup

protected void cleanup()
Cleans up some internal memory.


createExtractionResult

protected I_CmsExtractionResult createExtractionResult(java.lang.String rawContent)
Creates the extraction result for this MS Office document.

The extraction result contains the raw content, plus additional meta information as content items read from the MS Office document properties.

Parameters:
rawContent - the raw content extracted from the document
Returns:
the extraction result for this MS Office document