org.opencms.search.extractors
Class A_CmsTextExtractor

java.lang.Object
  extended byorg.opencms.search.extractors.A_CmsTextExtractor
All Implemented Interfaces:
I_CmsTextExtractor
Direct Known Subclasses:
A_CmsTextExtractorMsOfficeBase, CmsExtractorHtml, CmsExtractorOpenOffice, CmsExtractorPdf, CmsExtractorRtf

public abstract class A_CmsTextExtractor
extends java.lang.Object
implements I_CmsTextExtractor

Base utility class that allows extraction of the indexable "plain" text from a given document format.

Since:
6.0.0
Version:
$Revision: 1.11 $
Author:
Alexander Kandzior

Field Summary
protected  byte[] m_inputBuffer
          A buffer in case the input stream must be read more then once.
 
Constructor Summary
A_CmsTextExtractor()
           
 
Method Summary
protected  void combineContentItem(java.lang.String itemValue, java.lang.String itemKey, java.lang.StringBuffer content, java.util.Map contentItems)
          Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.
 I_CmsExtractionResult extractText(byte[] content)
          Extracts the text and meta information from the given binary document.
 I_CmsExtractionResult extractText(byte[] content, java.lang.String encoding)
          Extracts the text and meta information from the given binary document, using the specified content encoding.
 I_CmsExtractionResult extractText(java.io.InputStream in)
          Extracts the text and meta information from the document on the input stream.
 I_CmsExtractionResult extractText(java.io.InputStream in, java.lang.String encoding)
          Extracts the text and meta information from the document on the input stream, using the specified content encoding.
 java.io.InputStream getStreamCopy(java.io.InputStream in)
          Creates a copy of the original input stream, which allows to read the input stream more then once, required for certain document types.
protected  java.lang.String removeControlChars(java.lang.String content)
          Removes "unwanted" control chars from the given content.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

m_inputBuffer

protected byte[] m_inputBuffer
A buffer in case the input stream must be read more then once.

Constructor Detail

A_CmsTextExtractor

public A_CmsTextExtractor()
Method Detail

extractText

public I_CmsExtractionResult extractText(byte[] content)
                                  throws java.lang.Exception
Description copied from interface: I_CmsTextExtractor
Extracts the text and meta information from the given binary document.

The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided binary array automatically.

Delivers is the same result as calling I_CmsTextExtractor.extractText(byte[], String) when String == null.

Specified by:
extractText in interface I_CmsTextExtractor
Parameters:
content - the binary content of the document to extract the text from
Returns:
the extracted text
Throws:
java.lang.Exception - if the text extration fails
See Also:
I_CmsTextExtractor.extractText(byte[])

extractText

public I_CmsExtractionResult extractText(byte[] content,
                                         java.lang.String encoding)
                                  throws java.lang.Exception
Description copied from interface: I_CmsTextExtractor
Extracts the text and meta information from the given binary document, using the specified content encoding.

The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.

Specified by:
extractText in interface I_CmsTextExtractor
Parameters:
content - the binary content of the document to extract the text from
encoding - the encoding to use
Returns:
the extracted text
Throws:
java.lang.Exception - if the text extration fails
See Also:
I_CmsTextExtractor.extractText(byte[], java.lang.String)

extractText

public I_CmsExtractionResult extractText(java.io.InputStream in)
                                  throws java.lang.Exception
Description copied from interface: I_CmsTextExtractor
Extracts the text and meta information from the document on the input stream.

The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided input stream automatically.

Delivers is the same result as calling I_CmsTextExtractor.extractText(InputStream, String) when String == null.

Specified by:
extractText in interface I_CmsTextExtractor
Parameters:
in - the input stream for the document to extract the text from
Returns:
the extracted text and meta information
Throws:
java.lang.Exception - if the text extration fails
See Also:
I_CmsTextExtractor.extractText(java.io.InputStream)

extractText

public I_CmsExtractionResult extractText(java.io.InputStream in,
                                         java.lang.String encoding)
                                  throws java.lang.Exception
Description copied from interface: I_CmsTextExtractor
Extracts the text and meta information from the document on the input stream, using the specified content encoding.

The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.

Specified by:
extractText in interface I_CmsTextExtractor
Parameters:
in - the input stream for the document to extract the text from
encoding - the encoding to use
Returns:
the extracted text and meta information
Throws:
java.lang.Exception - if the text extration fails
See Also:
I_CmsTextExtractor.extractText(java.io.InputStream, java.lang.String)

getStreamCopy

public java.io.InputStream getStreamCopy(java.io.InputStream in)
                                  throws java.io.IOException
Creates a copy of the original input stream, which allows to read the input stream more then once, required for certain document types.

Parameters:
in - the inpur stram to copy
Returns:
a copy of the original input stream
Throws:
java.io.IOException - in case of read errors from the original input stream

combineContentItem

protected void combineContentItem(java.lang.String itemValue,
                                  java.lang.String itemKey,
                                  java.lang.StringBuffer content,
                                  java.util.Map contentItems)
Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.

Parameters:
itemValue - the value of the item to store
itemKey - the key in the Map of content items
content - a buffer where to append the content item
contentItems - the Map of individual content items

removeControlChars

protected java.lang.String removeControlChars(java.lang.String content)
Removes "unwanted" control chars from the given content.

Parameters:
content - the content to remove the unwanted control chars from
Returns:
the content with the unwanted control chars removed