org.opencms.search.extractors
Interface I_CmsTextExtractor

All Known Implementing Classes:
A_CmsTextExtractor, A_CmsTextExtractorMsOfficeBase, CmsExtractorHtml, CmsExtractorMsExcel, CmsExtractorMsPowerPoint, CmsExtractorMsWord, CmsExtractorOpenOffice, CmsExtractorPdf, CmsExtractorRtf

public interface I_CmsTextExtractor

Allows extraction of the indexable "plain" text plus (optional) meta information from a given binary input document format.

Since:
6.0.0
Version:
$Revision: 1.9 $
Author:
Alexander Kandzior

Method Summary
 I_CmsExtractionResult extractText(byte[] content)
          Extracts the text and meta information from the given binary document.
 I_CmsExtractionResult extractText(byte[] content, java.lang.String encoding)
          Extracts the text and meta information from the given binary document, using the specified content encoding.
 I_CmsExtractionResult extractText(java.io.InputStream in)
          Extracts the text and meta information from the document on the input stream.
 I_CmsExtractionResult extractText(java.io.InputStream in, java.lang.String encoding)
          Extracts the text and meta information from the document on the input stream, using the specified content encoding.
 

Method Detail

extractText

I_CmsExtractionResult extractText(byte[] content)
                                  throws java.lang.Exception
Extracts the text and meta information from the given binary document.

The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided binary array automatically.

Delivers is the same result as calling extractText(byte[], String) when String == null.

Parameters:
content - the binary content of the document to extract the text from
Returns:
the extracted text
Throws:
java.lang.Exception - if the text extration fails

extractText

I_CmsExtractionResult extractText(byte[] content,
                                  java.lang.String encoding)
                                  throws java.lang.Exception
Extracts the text and meta information from the given binary document, using the specified content encoding.

The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.

Parameters:
content - the binary content of the document to extract the text from
encoding - the encoding to use
Returns:
the extracted text
Throws:
java.lang.Exception - if the text extration fails

extractText

I_CmsExtractionResult extractText(java.io.InputStream in)
                                  throws java.lang.Exception
Extracts the text and meta information from the document on the input stream.

The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided input stream automatically.

Delivers is the same result as calling extractText(InputStream, String) when String == null.

Parameters:
in - the input stream for the document to extract the text from
Returns:
the extracted text and meta information
Throws:
java.lang.Exception - if the text extration fails

extractText

I_CmsExtractionResult extractText(java.io.InputStream in,
                                  java.lang.String encoding)
                                  throws java.lang.Exception
Extracts the text and meta information from the document on the input stream, using the specified content encoding.

The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.

Parameters:
in - the input stream for the document to extract the text from
encoding - the encoding to use
Returns:
the extracted text and meta information
Throws:
java.lang.Exception - if the text extration fails