Uploaded image for project: 'ModeShape'
  1. ModeShape
  2. MODE-2497

German character sharp S not handled correctly by query parser

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 4.4.0.Final
    • 4.1.0.Final
    • Query
    • None
    • Hide

      Tests in TokenStreamTest.java that demonstrate the issue:

      @Test
          public void shouldMatchUpperCaseVersionOfßCharacterWhenCaseInsensitive() {
              content = "ß";
              makeCaseInsensitive();
              tokens.consume("SS");
              assertThat(tokens.hasNext(), is(false));
          }
      
          @Test
          public void shouldHandleTokensAfterßCharacterWhenCaseInsensitive() {
              content = "ß and";
              makeCaseInsensitive();
              tokens.consume(TokenStream.ANY_VALUE);
              tokens.consume("AND");
              assertThat(tokens.hasNext(), is(false));
          }
      
      Show
      Tests in TokenStreamTest.java that demonstrate the issue: @Test public void shouldMatchUpperCaseVersionOfßCharacterWhenCaseInsensitive() { content = "ß"; makeCaseInsensitive(); tokens.consume("SS"); assertThat(tokens.hasNext(), is(false)); } @Test public void shouldHandleTokensAfterßCharacterWhenCaseInsensitive() { content = "ß and"; makeCaseInsensitive(); tokens.consume(TokenStream.ANY_VALUE); tokens.consume("AND"); assertThat(tokens.hasNext(), is(false)); }

    Description

      When performing SQL2 queries containing strings with the german ß symbol, the query is not parsed correctly.

      Exception when handling request.: javax.jcr.query.InvalidQueryException: The JCR-SQL2 query "SELECT metadatanode.*, document.'jcr:created' FROM [tresorxml:element] AS metadatanode INNER JOIN [tresorxml:document] AS document ON ISDESCENDANTNODE(metadatanode, document) WHERE NAME(metadatanode) = 'xaip:metaDataSection' AND PATH(document) LIKE '/tresorxml:vault[5]/My Folders/?/%' AND DEPTH(document) = CAST(4 AS LONG) ORDER BY document.'jcr:created' DESC" is not well-formed: Unexpected token 'AND' at line 1, column 250
      

      The reason for this is that the tokeniser parses queries in a case-insensitive manner, and the JVM converts ß to SS in upper-case (see e.g. http://www.the-interweb.com/serendipity/index.php?/archives/80-Converting-strings-to-upper-case-is-tricky.html ).

      The result is the upper-case string is longer than the lower-case version. This sends the indexes out of kilter within the TokenStream class when using case insensitive tokenising.

      The solution is to override the match method in the CaseInsensitiveToken to convert the current token to upper-case, rather than storing an upper-case version of the entire input string, which may not have the same indexes as the lower-case version.

      Attachments

        Activity

          People

            Unassigned Unassigned
            dankelleher_jira Daniel Kelleher (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: