Class A_CmsTextExtractor

    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected void combineContentItem​(java.lang.String itemValue, java.lang.String itemKey, java.lang.StringBuffer content, java.util.Map<java.lang.String,​java.lang.String> contentItems)
      Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.
      I_CmsExtractionResult extractText​(byte[] content)
      Extracts the text and meta information from the given binary document.
      I_CmsExtractionResult extractText​(byte[] content, java.lang.String encoding)
      Extracts the text and meta information from the given binary document, using the specified content encoding.
      I_CmsExtractionResult extractText​(java.io.InputStream in)
      Extracts the text and meta information from the document on the input stream.
      I_CmsExtractionResult extractText​(java.io.InputStream in, java.lang.String encoding)
      Extracts the text and meta information from the document on the input stream, using the specified content encoding.
      protected CmsExtractionResult extractText​(java.io.InputStream in, org.apache.tika.parser.Parser parser)
      Parses the given input stream with the provided parser and returns the result as a map of content items.
      protected java.lang.String removeControlChars​(java.lang.String content)
      Removes "unwanted" control chars from the given content.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Method Detail

      • extractText

        public I_CmsExtractionResult extractText​(byte[] content)
                                          throws java.lang.Exception
        Description copied from interface: I_CmsTextExtractor
        Extracts the text and meta information from the given binary document.

        The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided binary array automatically.

        Delivers is the same result as calling I_CmsTextExtractor.extractText(byte[], String) when String == null.

        Specified by:
        extractText in interface I_CmsTextExtractor
        Parameters:
        content - the binary content of the document to extract the text from
        Returns:
        the extracted text
        Throws:
        java.lang.Exception - if the text extration fails
        See Also:
        I_CmsTextExtractor.extractText(byte[])
      • extractText

        public I_CmsExtractionResult extractText​(byte[] content,
                                                 java.lang.String encoding)
                                          throws java.lang.Exception
        Description copied from interface: I_CmsTextExtractor
        Extracts the text and meta information from the given binary document, using the specified content encoding.

        The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.

        Specified by:
        extractText in interface I_CmsTextExtractor
        Parameters:
        content - the binary content of the document to extract the text from
        encoding - the encoding to use
        Returns:
        the extracted text
        Throws:
        java.lang.Exception - if the text extration fails
        See Also:
        I_CmsTextExtractor.extractText(byte[], java.lang.String)
      • extractText

        public I_CmsExtractionResult extractText​(java.io.InputStream in,
                                                 java.lang.String encoding)
                                          throws java.lang.Exception
        Description copied from interface: I_CmsTextExtractor
        Extracts the text and meta information from the document on the input stream, using the specified content encoding.

        The encoding is a hint for the text extractor, if the value given is null then the text extractor should try to figure out the encoding itself.

        Specified by:
        extractText in interface I_CmsTextExtractor
        Parameters:
        in - the input stream for the document to extract the text from
        encoding - the encoding to use
        Returns:
        the extracted text and meta information
        Throws:
        java.lang.Exception - if the text extration fails
        See Also:
        I_CmsTextExtractor.extractText(java.io.InputStream, java.lang.String)
      • combineContentItem

        protected void combineContentItem​(java.lang.String itemValue,
                                          java.lang.String itemKey,
                                          java.lang.StringBuffer content,
                                          java.util.Map<java.lang.String,​java.lang.String> contentItems)
        Combines a meta information item extracted from the document with the main content buffer and also stores the individual information as item in the Map of content items.

        Parameters:
        itemValue - the value of the item to store
        itemKey - the key in the Map of content items
        content - a buffer where to append the content item
        contentItems - the Map of individual content items
      • extractText

        protected CmsExtractionResult extractText​(java.io.InputStream in,
                                                  org.apache.tika.parser.Parser parser)
                                           throws java.lang.Exception
        Parses the given input stream with the provided parser and returns the result as a map of content items.

        Parameters:
        in - the input stream for the content to parse
        parser - the parser to use
        Returns:
        the result of the parsing as a map of content items
        Throws:
        java.lang.Exception - in case something goes wrong
      • removeControlChars

        protected java.lang.String removeControlChars​(java.lang.String content)
        Removes "unwanted" control chars from the given content.

        Parameters:
        content - the content to remove the unwanted control chars from
        Returns:
        the content with the unwanted control chars removed