Index configuration

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Index configuration

david
Hi,

One of my document has a field, which type is Text, but contains HTML fragments (not a whole, valid HTML document, but for instance, just a big <div> block containing a mix of HTML tags and content).

I want this field for be indexed (for full text search purpose), but of course, without the HTML markup tags considered as 'normal' text...

How can I configure Jackrabbit SearchIndex configuration ?

My last idea was to declare somewhere the field as text/html (jcr:mimeType), but I'm unable to add this property (violation...)...

Thanks.

--
David MARTIN

_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Ard
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

Ard
On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]> wrote:

> Hi,
>
> One of my document has a field, which type is Text, but contains HTML
> fragments (not a whole, valid HTML document, but for instance, just a big
> <div> block containing a mix of HTML tags and content).
>
> I want this field for be indexed (for full text search purpose), but of
> course, without the HTML markup tags considered as 'normal' text...
>
> How can I configure Jackrabbit SearchIndex configuration ?
>
> My last idea was to declare somewhere the field as text/html (jcr:mimeType),
> but I'm unable to add this property (violation...)...

This is currently not possible. html tags are also just indexed. In
general, it has never been an issue. If it is really an issue, we
could improve here. We could also add all html tags to the list of
stopwords, so they are ignored completely. You can also do this
yourself if you want

Regards Ard

>
> Thanks.
>
> --
> David MARTIN
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

david
Thanks Ard for your answer.
Hum... not possible, right? not exactly the answer I was waiting for to be honest :)
Unfortunately stop words are not enough in my case. For instance a tag may have a 'class' attribute containing class names that can't be added to the stop words list, just because it's not a finite list...

My idea now is if I can't prevent the field from being indexed (which is the best solution IMHO), maybe I can prevent the user from requesting things in it...
If there any way to build a query like this:

final HstQuery hstQuery = manager.createQuery(scope, A.class, B.class, C.class);
final Filter filter = hstQuery.createFilter();
hstQuery.setFilter(filter);

final Filter fullTextFilter = hstQuery.createFilter();
>>>>>> fullTextFilter.addContains(".", something); // in fact "." should be replaced with something meaning 'all but @myns:htmlinside field'
filter.addOrFilter(fullTextFilter);

David


On Tue, Feb 5, 2013 at 12:47 PM, Ard Schrijvers <[hidden email]> wrote:
On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]> wrote:
> Hi,
>
> One of my document has a field, which type is Text, but contains HTML
> fragments (not a whole, valid HTML document, but for instance, just a big
> <div> block containing a mix of HTML tags and content).
>
> I want this field for be indexed (for full text search purpose), but of
> course, without the HTML markup tags considered as 'normal' text...
>
> How can I configure Jackrabbit SearchIndex configuration ?
>
> My last idea was to declare somewhere the field as text/html (jcr:mimeType),
> but I'm unable to add this property (violation...)...

This is currently not possible. html tags are also just indexed. In
general, it has never been an issue. If it is really an issue, we
could improve here. We could also add all html tags to the list of
stopwords, so they are ignored completely. You can also do this
yourself if you want

Regards Ard

>
> Thanks.
>
> --
> David MARTIN
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US <a href="tel:%2B1%20877%20414%204776" value="+18774144776">+1 877 414 4776 (toll free)
Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466">+31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

Bert Leunis
Hi David,

You are worried about the html tags in your Text field. If the user searches for "div" your document turns up as a result, but when you see the document in the site, the hit is unexplained. What Ard tries to say is that you will have the same situation with the regular hippostd:html nodes. They contain html tags resulting from the xinha editor. Just try searching for "html": you get a lot of hits because of the <html> tag that is part of nearly every hippostd:html node.

Just excluding your special text field is not enough in your case.

With kind regards/Met vriendelijke groet,
Bert Leunis

Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com


On Tue, Feb 5, 2013 at 1:32 PM, David Martin <[hidden email]> wrote:
Thanks Ard for your answer.
Hum... not possible, right? not exactly the answer I was waiting for to be honest :)
Unfortunately stop words are not enough in my case. For instance a tag may have a 'class' attribute containing class names that can't be added to the stop words list, just because it's not a finite list...

My idea now is if I can't prevent the field from being indexed (which is the best solution IMHO), maybe I can prevent the user from requesting things in it...
If there any way to build a query like this:

final HstQuery hstQuery = manager.createQuery(scope, A.class, B.class, C.class);
final Filter filter = hstQuery.createFilter();
hstQuery.setFilter(filter);

final Filter fullTextFilter = hstQuery.createFilter();
>>>>>> fullTextFilter.addContains(".", something); // in fact "." should be replaced with something meaning 'all but @myns:htmlinside field'
filter.addOrFilter(fullTextFilter);

David


On Tue, Feb 5, 2013 at 12:47 PM, Ard Schrijvers <[hidden email]> wrote:
On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]> wrote:
> Hi,
>
> One of my document has a field, which type is Text, but contains HTML
> fragments (not a whole, valid HTML document, but for instance, just a big
> <div> block containing a mix of HTML tags and content).
>
> I want this field for be indexed (for full text search purpose), but of
> course, without the HTML markup tags considered as 'normal' text...
>
> How can I configure Jackrabbit SearchIndex configuration ?
>
> My last idea was to declare somewhere the field as text/html (jcr:mimeType),
> but I'm unable to add this property (violation...)...

This is currently not possible. html tags are also just indexed. In
general, it has never been an issue. If it is really an issue, we
could improve here. We could also add all html tags to the list of
stopwords, so they are ignored completely. You can also do this
yourself if you want

Regards Ard

>
> Thanks.
>
> --
> David MARTIN
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US <a href="tel:%2B1%20877%20414%204776" value="+18774144776" target="_blank">+1 877 414 4776 (toll free)
Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466" target="_blank">+31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html


_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

david
Thanks Bert. I now understand the issue... which is not Text specific.
Isn't it possible to use Tika to process such fields, with the help of the HtmlParser? At least for hippostd:html.
I'm trying to exclude these fields from the index configuration, and add some new, technical only, fields, containing a purified content (derived data) and only let them be indexed.

David


On Tue, Feb 5, 2013 at 1:56 PM, Bert Leunis <[hidden email]> wrote:
Hi David,

You are worried about the html tags in your Text field. If the user searches for "div" your document turns up as a result, but when you see the document in the site, the hit is unexplained. What Ard tries to say is that you will have the same situation with the regular hippostd:html nodes. They contain html tags resulting from the xinha editor. Just try searching for "html": you get a lot of hits because of the <html> tag that is part of nearly every hippostd:html node.

Just excluding your special text field is not enough in your case.

With kind regards/Met vriendelijke groet,
Bert Leunis

Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US <a href="tel:%2B1%20877%20414%204776" value="+18774144776" target="_blank">+1 877 414 4776 (toll free)
Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466" target="_blank">+31(0)20 522 4466
www.onehippo.com


On Tue, Feb 5, 2013 at 1:32 PM, David Martin <[hidden email]> wrote:
Thanks Ard for your answer.
Hum... not possible, right? not exactly the answer I was waiting for to be honest :)
Unfortunately stop words are not enough in my case. For instance a tag may have a 'class' attribute containing class names that can't be added to the stop words list, just because it's not a finite list...

My idea now is if I can't prevent the field from being indexed (which is the best solution IMHO), maybe I can prevent the user from requesting things in it...
If there any way to build a query like this:

final HstQuery hstQuery = manager.createQuery(scope, A.class, B.class, C.class);
final Filter filter = hstQuery.createFilter();
hstQuery.setFilter(filter);

final Filter fullTextFilter = hstQuery.createFilter();
>>>>>> fullTextFilter.addContains(".", something); // in fact "." should be replaced with something meaning 'all but @myns:htmlinside field'
filter.addOrFilter(fullTextFilter);

David


On Tue, Feb 5, 2013 at 12:47 PM, Ard Schrijvers <[hidden email]> wrote:
On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]> wrote:
> Hi,
>
> One of my document has a field, which type is Text, but contains HTML
> fragments (not a whole, valid HTML document, but for instance, just a big
> <div> block containing a mix of HTML tags and content).
>
> I want this field for be indexed (for full text search purpose), but of
> course, without the HTML markup tags considered as 'normal' text...
>
> How can I configure Jackrabbit SearchIndex configuration ?
>
> My last idea was to declare somewhere the field as text/html (jcr:mimeType),
> but I'm unable to add this property (violation...)...

This is currently not possible. html tags are also just indexed. In
general, it has never been an issue. If it is really an issue, we
could improve here. We could also add all html tags to the list of
stopwords, so they are ignored completely. You can also do this
yourself if you want

Regards Ard

>
> Thanks.
>
> --
> David MARTIN
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US <a href="tel:%2B1%20877%20414%204776" value="+18774144776" target="_blank">+1 877 414 4776 (toll free)
Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466" target="_blank">+31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html


_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



--
David MARTIN
Ippon Technologies

_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Ard
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

Ard
On Tue, Feb 5, 2013 at 2:05 PM, David Martin <[hidden email]> wrote:
> Thanks Bert. I now understand the issue... which is not Text specific.
> Isn't it possible to use Tika to process such fields, with the help of the
> HtmlParser? At least for hippostd:html.

I've created a long time a issue that is pretty much the same problem.
I think we should still pick it up. I've never liked the html tags
being indexed as text. Just marking some properties in the
indexing_configuration to be treated as html should be simple, and
extracting text from html likewise

Regards Ard

https://issues.onehippo.com/browse/REPO-201

> I'm trying to exclude these fields from the index configuration, and add
> some new, technical only, fields, containing a purified content (derived
> data) and only let them be indexed.
>
> David
>
>
> On Tue, Feb 5, 2013 at 1:56 PM, Bert Leunis <[hidden email]> wrote:
>>
>> Hi David,
>>
>> You are worried about the html tags in your Text field. If the user
>> searches for "div" your document turns up as a result, but when you see the
>> document in the site, the hit is unexplained. What Ard tries to say is that
>> you will have the same situation with the regular hippostd:html nodes. They
>> contain html tags resulting from the xinha editor. Just try searching for
>> "html": you get a lot of hits because of the <html> tag that is part of
>> nearly every hippostd:html node.
>>
>> Just excluding your special text field is not enough in your case.
>>
>> With kind regards/Met vriendelijke groet,
>> Bert Leunis
>>
>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>> Boston - 1 Broadway, Cambridge, MA 02142
>>
>> US +1 877 414 4776 (toll free)
>> Europe +31(0)20 522 4466
>> www.onehippo.com
>>
>>
>> On Tue, Feb 5, 2013 at 1:32 PM, David Martin <[hidden email]> wrote:
>>>
>>> Thanks Ard for your answer.
>>> Hum... not possible, right? not exactly the answer I was waiting for to
>>> be honest :)
>>> Unfortunately stop words are not enough in my case. For instance a tag
>>> may have a 'class' attribute containing class names that can't be added to
>>> the stop words list, just because it's not a finite list...
>>>
>>> My idea now is if I can't prevent the field from being indexed (which is
>>> the best solution IMHO), maybe I can prevent the user from requesting things
>>> in it...
>>> If there any way to build a query like this:
>>>
>>> final HstQuery hstQuery = manager.createQuery(scope, A.class, B.class,
>>> C.class);
>>> final Filter filter = hstQuery.createFilter();
>>> hstQuery.setFilter(filter);
>>>
>>> final Filter fullTextFilter = hstQuery.createFilter();
>>> >>>>>> fullTextFilter.addContains(".", something); // in fact "." should
>>> >>>>>> be replaced with something meaning 'all but @myns:htmlinside field'
>>> filter.addOrFilter(fullTextFilter);
>>>
>>> David
>>>
>>>
>>> On Tue, Feb 5, 2013 at 12:47 PM, Ard Schrijvers
>>> <[hidden email]> wrote:
>>>>
>>>> On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]> wrote:
>>>> > Hi,
>>>> >
>>>> > One of my document has a field, which type is Text, but contains HTML
>>>> > fragments (not a whole, valid HTML document, but for instance, just a
>>>> > big
>>>> > <div> block containing a mix of HTML tags and content).
>>>> >
>>>> > I want this field for be indexed (for full text search purpose), but
>>>> > of
>>>> > course, without the HTML markup tags considered as 'normal' text...
>>>> >
>>>> > How can I configure Jackrabbit SearchIndex configuration ?
>>>> >
>>>> > My last idea was to declare somewhere the field as text/html
>>>> > (jcr:mimeType),
>>>> > but I'm unable to add this property (violation...)...
>>>>
>>>> This is currently not possible. html tags are also just indexed. In
>>>> general, it has never been an issue. If it is really an issue, we
>>>> could improve here. We could also add all html tags to the list of
>>>> stopwords, so they are ignored completely. You can also do this
>>>> yourself if you want
>>>>
>>>> Regards Ard
>>>>
>>>> >
>>>> > Thanks.
>>>> >
>>>> > --
>>>> > David MARTIN
>>>> >
>>>> > _______________________________________________
>>>> > Hippo-cms7-user mailing list and forums
>>>> > http://www.onehippo.org/cms7/support/forums.html
>>>>
>>>>
>>>>
>>>> --
>>>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>>>> Boston - 1 Broadway, Cambridge, MA 02142
>>>>
>>>> US +1 877 414 4776 (toll free)
>>>> Europe +31(0)20 522 4466
>>>> www.onehippo.com
>>>> _______________________________________________
>>>> Hippo-cms7-user mailing list and forums
>>>> http://www.onehippo.org/cms7/support/forums.html
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Hippo-cms7-user mailing list and forums
>>> http://www.onehippo.org/cms7/support/forums.html
>>
>>
>>
>> _______________________________________________
>> Hippo-cms7-user mailing list and forums
>> http://www.onehippo.org/cms7/support/forums.html
>
>
>
>
> --
> David MARTIN
> Ippon Technologies
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

david
Ok, problem is now solved using a simple workaround.
Using a derived data function and a new field, I can now extract only the real content without any HTML tag (using JSoup.clean method).
A modification in indexing_configuration.xml was needed to exclude the 'rich' content field, by just adding a 'nodetype' to 'excludefromnodescope'.

If it can help...


On Tue, Feb 5, 2013 at 3:09 PM, Ard Schrijvers <[hidden email]> wrote:
On Tue, Feb 5, 2013 at 2:05 PM, David Martin <[hidden email]> wrote:
> Thanks Bert. I now understand the issue... which is not Text specific.
> Isn't it possible to use Tika to process such fields, with the help of the
> HtmlParser? At least for hippostd:html.

I've created a long time a issue that is pretty much the same problem.
I think we should still pick it up. I've never liked the html tags
being indexed as text. Just marking some properties in the
indexing_configuration to be treated as html should be simple, and
extracting text from html likewise

Regards Ard

https://issues.onehippo.com/browse/REPO-201

> I'm trying to exclude these fields from the index configuration, and add
> some new, technical only, fields, containing a purified content (derived
> data) and only let them be indexed.
>
> David
>
>
> On Tue, Feb 5, 2013 at 1:56 PM, Bert Leunis <[hidden email]> wrote:
>>
>> Hi David,
>>
>> You are worried about the html tags in your Text field. If the user
>> searches for "div" your document turns up as a result, but when you see the
>> document in the site, the hit is unexplained. What Ard tries to say is that
>> you will have the same situation with the regular hippostd:html nodes. They
>> contain html tags resulting from the xinha editor. Just try searching for
>> "html": you get a lot of hits because of the <html> tag that is part of
>> nearly every hippostd:html node.
>>
>> Just excluding your special text field is not enough in your case.
>>
>> With kind regards/Met vriendelijke groet,
>> Bert Leunis
>>
>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>> Boston - 1 Broadway, Cambridge, MA 02142
>>
>> US <a href="tel:%2B1%20877%20414%204776" value="+18774144776">+1 877 414 4776 (toll free)
>> Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466">+31(0)20 522 4466
>> www.onehippo.com
>>
>>
>> On Tue, Feb 5, 2013 at 1:32 PM, David Martin <[hidden email]> wrote:
>>>
>>> Thanks Ard for your answer.
>>> Hum... not possible, right? not exactly the answer I was waiting for to
>>> be honest :)
>>> Unfortunately stop words are not enough in my case. For instance a tag
>>> may have a 'class' attribute containing class names that can't be added to
>>> the stop words list, just because it's not a finite list...
>>>
>>> My idea now is if I can't prevent the field from being indexed (which is
>>> the best solution IMHO), maybe I can prevent the user from requesting things
>>> in it...
>>> If there any way to build a query like this:
>>>
>>> final HstQuery hstQuery = manager.createQuery(scope, A.class, B.class,
>>> C.class);
>>> final Filter filter = hstQuery.createFilter();
>>> hstQuery.setFilter(filter);
>>>
>>> final Filter fullTextFilter = hstQuery.createFilter();
>>> >>>>>> fullTextFilter.addContains(".", something); // in fact "." should
>>> >>>>>> be replaced with something meaning 'all but @myns:htmlinside field'
>>> filter.addOrFilter(fullTextFilter);
>>>
>>> David
>>>
>>>
>>> On Tue, Feb 5, 2013 at 12:47 PM, Ard Schrijvers
>>> <[hidden email]> wrote:
>>>>
>>>> On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]> wrote:
>>>> > Hi,
>>>> >
>>>> > One of my document has a field, which type is Text, but contains HTML
>>>> > fragments (not a whole, valid HTML document, but for instance, just a
>>>> > big
>>>> > <div> block containing a mix of HTML tags and content).
>>>> >
>>>> > I want this field for be indexed (for full text search purpose), but
>>>> > of
>>>> > course, without the HTML markup tags considered as 'normal' text...
>>>> >
>>>> > How can I configure Jackrabbit SearchIndex configuration ?
>>>> >
>>>> > My last idea was to declare somewhere the field as text/html
>>>> > (jcr:mimeType),
>>>> > but I'm unable to add this property (violation...)...
>>>>
>>>> This is currently not possible. html tags are also just indexed. In
>>>> general, it has never been an issue. If it is really an issue, we
>>>> could improve here. We could also add all html tags to the list of
>>>> stopwords, so they are ignored completely. You can also do this
>>>> yourself if you want
>>>>
>>>> Regards Ard
>>>>
>>>> >
>>>> > Thanks.
>>>> >
>>>> > --
>>>> > David MARTIN
>>>> >
>>>> > _______________________________________________
>>>> > Hippo-cms7-user mailing list and forums
>>>> > http://www.onehippo.org/cms7/support/forums.html
>>>>
>>>>
>>>>
>>>> --
>>>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>>>> Boston - 1 Broadway, Cambridge, MA 02142
>>>>
>>>> US <a href="tel:%2B1%20877%20414%204776" value="+18774144776">+1 877 414 4776 (toll free)
>>>> Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466">+31(0)20 522 4466
>>>> www.onehippo.com
>>>> _______________________________________________
>>>> Hippo-cms7-user mailing list and forums
>>>> http://www.onehippo.org/cms7/support/forums.html
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Hippo-cms7-user mailing list and forums
>>> http://www.onehippo.org/cms7/support/forums.html
>>
>>
>>
>> _______________________________________________
>> Hippo-cms7-user mailing list and forums
>> http://www.onehippo.org/cms7/support/forums.html
>
>
>
>
> --
> David MARTIN
> Ippon Technologies
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US <a href="tel:%2B1%20877%20414%204776" value="+18774144776">+1 877 414 4776 (toll free)
Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466">+31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



--
David MARTIN
Ippon Technologies

_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

Bert Leunis
Thanks for sharing your resolution David!

With kind regards/Met vriendelijke groet,
Bert Leunis

Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com


On Tue, Feb 5, 2013 at 5:00 PM, David Martin <[hidden email]> wrote:
Ok, problem is now solved using a simple workaround.
Using a derived data function and a new field, I can now extract only the real content without any HTML tag (using JSoup.clean method).
A modification in indexing_configuration.xml was needed to exclude the 'rich' content field, by just adding a 'nodetype' to 'excludefromnodescope'.

If it can help...


On Tue, Feb 5, 2013 at 3:09 PM, Ard Schrijvers <[hidden email]> wrote:
On Tue, Feb 5, 2013 at 2:05 PM, David Martin <[hidden email]> wrote:
> Thanks Bert. I now understand the issue... which is not Text specific.
> Isn't it possible to use Tika to process such fields, with the help of the
> HtmlParser? At least for hippostd:html.

I've created a long time a issue that is pretty much the same problem.
I think we should still pick it up. I've never liked the html tags
being indexed as text. Just marking some properties in the
indexing_configuration to be treated as html should be simple, and
extracting text from html likewise

Regards Ard

https://issues.onehippo.com/browse/REPO-201

> I'm trying to exclude these fields from the index configuration, and add
> some new, technical only, fields, containing a purified content (derived
> data) and only let them be indexed.
>
> David
>
>
> On Tue, Feb 5, 2013 at 1:56 PM, Bert Leunis <[hidden email]> wrote:
>>
>> Hi David,
>>
>> You are worried about the html tags in your Text field. If the user
>> searches for "div" your document turns up as a result, but when you see the
>> document in the site, the hit is unexplained. What Ard tries to say is that
>> you will have the same situation with the regular hippostd:html nodes. They
>> contain html tags resulting from the xinha editor. Just try searching for
>> "html": you get a lot of hits because of the <html> tag that is part of
>> nearly every hippostd:html node.
>>
>> Just excluding your special text field is not enough in your case.
>>
>> With kind regards/Met vriendelijke groet,
>> Bert Leunis
>>
>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>> Boston - 1 Broadway, Cambridge, MA 02142
>>
>> US <a href="tel:%2B1%20877%20414%204776" value="+18774144776" target="_blank">+1 877 414 4776 (toll free)
>> Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466" target="_blank">+31(0)20 522 4466
>> www.onehippo.com
>>
>>
>> On Tue, Feb 5, 2013 at 1:32 PM, David Martin <[hidden email]> wrote:
>>>
>>> Thanks Ard for your answer.
>>> Hum... not possible, right? not exactly the answer I was waiting for to
>>> be honest :)
>>> Unfortunately stop words are not enough in my case. For instance a tag
>>> may have a 'class' attribute containing class names that can't be added to
>>> the stop words list, just because it's not a finite list...
>>>
>>> My idea now is if I can't prevent the field from being indexed (which is
>>> the best solution IMHO), maybe I can prevent the user from requesting things
>>> in it...
>>> If there any way to build a query like this:
>>>
>>> final HstQuery hstQuery = manager.createQuery(scope, A.class, B.class,
>>> C.class);
>>> final Filter filter = hstQuery.createFilter();
>>> hstQuery.setFilter(filter);
>>>
>>> final Filter fullTextFilter = hstQuery.createFilter();
>>> >>>>>> fullTextFilter.addContains(".", something); // in fact "." should
>>> >>>>>> be replaced with something meaning 'all but @myns:htmlinside field'
>>> filter.addOrFilter(fullTextFilter);
>>>
>>> David
>>>
>>>
>>> On Tue, Feb 5, 2013 at 12:47 PM, Ard Schrijvers
>>> <[hidden email]> wrote:
>>>>
>>>> On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]> wrote:
>>>> > Hi,
>>>> >
>>>> > One of my document has a field, which type is Text, but contains HTML
>>>> > fragments (not a whole, valid HTML document, but for instance, just a
>>>> > big
>>>> > <div> block containing a mix of HTML tags and content).
>>>> >
>>>> > I want this field for be indexed (for full text search purpose), but
>>>> > of
>>>> > course, without the HTML markup tags considered as 'normal' text...
>>>> >
>>>> > How can I configure Jackrabbit SearchIndex configuration ?
>>>> >
>>>> > My last idea was to declare somewhere the field as text/html
>>>> > (jcr:mimeType),
>>>> > but I'm unable to add this property (violation...)...
>>>>
>>>> This is currently not possible. html tags are also just indexed. In
>>>> general, it has never been an issue. If it is really an issue, we
>>>> could improve here. We could also add all html tags to the list of
>>>> stopwords, so they are ignored completely. You can also do this
>>>> yourself if you want
>>>>
>>>> Regards Ard
>>>>
>>>> >
>>>> > Thanks.
>>>> >
>>>> > --
>>>> > David MARTIN
>>>> >
>>>> > _______________________________________________
>>>> > Hippo-cms7-user mailing list and forums
>>>> > http://www.onehippo.org/cms7/support/forums.html
>>>>
>>>>
>>>>
>>>> --
>>>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>>>> Boston - 1 Broadway, Cambridge, MA 02142
>>>>
>>>> US <a href="tel:%2B1%20877%20414%204776" value="+18774144776" target="_blank">+1 877 414 4776 (toll free)
>>>> Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466" target="_blank">+31(0)20 522 4466
>>>> www.onehippo.com
>>>> _______________________________________________
>>>> Hippo-cms7-user mailing list and forums
>>>> http://www.onehippo.org/cms7/support/forums.html
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Hippo-cms7-user mailing list and forums
>>> http://www.onehippo.org/cms7/support/forums.html
>>
>>
>>
>> _______________________________________________
>> Hippo-cms7-user mailing list and forums
>> http://www.onehippo.org/cms7/support/forums.html
>
>
>
>
> --
> David MARTIN
> Ippon Technologies
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US <a href="tel:%2B1%20877%20414%204776" value="+18774144776" target="_blank">+1 877 414 4776 (toll free)
Europe <a href="tel:%2B31%280%2920%20522%204466" value="+31205224466" target="_blank">+31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html



--
David MARTIN
Ippon Technologies

_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html


_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html
Ard
Reply | Threaded
Open this post in threaded view
|

Re: Index configuration

Ard
On Tue, Feb 5, 2013 at 5:04 PM, Bert Leunis <[hidden email]> wrote:
> Thanks for sharing your resolution David!

Nice job David!

Regards ard

>
> With kind regards/Met vriendelijke groet,
> Bert Leunis
>
> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> Boston - 1 Broadway, Cambridge, MA 02142
>
> US +1 877 414 4776 (toll free)
> Europe +31(0)20 522 4466
> www.onehippo.com
>
>
> On Tue, Feb 5, 2013 at 5:00 PM, David Martin <[hidden email]> wrote:
>>
>> Ok, problem is now solved using a simple workaround.
>> Using a derived data function and a new field, I can now extract only the
>> real content without any HTML tag (using JSoup.clean method).
>> A modification in indexing_configuration.xml was needed to exclude the
>> 'rich' content field, by just adding a 'nodetype' to 'excludefromnodescope'.
>>
>> If it can help...
>>
>>
>> On Tue, Feb 5, 2013 at 3:09 PM, Ard Schrijvers <[hidden email]>
>> wrote:
>>>
>>> On Tue, Feb 5, 2013 at 2:05 PM, David Martin <[hidden email]> wrote:
>>> > Thanks Bert. I now understand the issue... which is not Text specific.
>>> > Isn't it possible to use Tika to process such fields, with the help of
>>> > the
>>> > HtmlParser? At least for hippostd:html.
>>>
>>> I've created a long time a issue that is pretty much the same problem.
>>> I think we should still pick it up. I've never liked the html tags
>>> being indexed as text. Just marking some properties in the
>>> indexing_configuration to be treated as html should be simple, and
>>> extracting text from html likewise
>>>
>>> Regards Ard
>>>
>>> https://issues.onehippo.com/browse/REPO-201
>>>
>>> > I'm trying to exclude these fields from the index configuration, and
>>> > add
>>> > some new, technical only, fields, containing a purified content
>>> > (derived
>>> > data) and only let them be indexed.
>>> >
>>> > David
>>> >
>>> >
>>> > On Tue, Feb 5, 2013 at 1:56 PM, Bert Leunis <[hidden email]>
>>> > wrote:
>>> >>
>>> >> Hi David,
>>> >>
>>> >> You are worried about the html tags in your Text field. If the user
>>> >> searches for "div" your document turns up as a result, but when you
>>> >> see the
>>> >> document in the site, the hit is unexplained. What Ard tries to say is
>>> >> that
>>> >> you will have the same situation with the regular hippostd:html nodes.
>>> >> They
>>> >> contain html tags resulting from the xinha editor. Just try searching
>>> >> for
>>> >> "html": you get a lot of hits because of the <html> tag that is part
>>> >> of
>>> >> nearly every hippostd:html node.
>>> >>
>>> >> Just excluding your special text field is not enough in your case.
>>> >>
>>> >> With kind regards/Met vriendelijke groet,
>>> >> Bert Leunis
>>> >>
>>> >> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>>> >> Boston - 1 Broadway, Cambridge, MA 02142
>>> >>
>>> >> US +1 877 414 4776 (toll free)
>>> >> Europe +31(0)20 522 4466
>>> >> www.onehippo.com
>>> >>
>>> >>
>>> >> On Tue, Feb 5, 2013 at 1:32 PM, David Martin <[hidden email]> wrote:
>>> >>>
>>> >>> Thanks Ard for your answer.
>>> >>> Hum... not possible, right? not exactly the answer I was waiting for
>>> >>> to
>>> >>> be honest :)
>>> >>> Unfortunately stop words are not enough in my case. For instance a
>>> >>> tag
>>> >>> may have a 'class' attribute containing class names that can't be
>>> >>> added to
>>> >>> the stop words list, just because it's not a finite list...
>>> >>>
>>> >>> My idea now is if I can't prevent the field from being indexed (which
>>> >>> is
>>> >>> the best solution IMHO), maybe I can prevent the user from requesting
>>> >>> things
>>> >>> in it...
>>> >>> If there any way to build a query like this:
>>> >>>
>>> >>> final HstQuery hstQuery = manager.createQuery(scope, A.class,
>>> >>> B.class,
>>> >>> C.class);
>>> >>> final Filter filter = hstQuery.createFilter();
>>> >>> hstQuery.setFilter(filter);
>>> >>>
>>> >>> final Filter fullTextFilter = hstQuery.createFilter();
>>> >>> >>>>>> fullTextFilter.addContains(".", something); // in fact "."
>>> >>> >>>>>> should
>>> >>> >>>>>> be replaced with something meaning 'all but @myns:htmlinside
>>> >>> >>>>>> field'
>>> >>> filter.addOrFilter(fullTextFilter);
>>> >>>
>>> >>> David
>>> >>>
>>> >>>
>>> >>> On Tue, Feb 5, 2013 at 12:47 PM, Ard Schrijvers
>>> >>> <[hidden email]> wrote:
>>> >>>>
>>> >>>> On Tue, Feb 5, 2013 at 12:02 PM, David Martin <[hidden email]>
>>> >>>> wrote:
>>> >>>> > Hi,
>>> >>>> >
>>> >>>> > One of my document has a field, which type is Text, but contains
>>> >>>> > HTML
>>> >>>> > fragments (not a whole, valid HTML document, but for instance,
>>> >>>> > just a
>>> >>>> > big
>>> >>>> > <div> block containing a mix of HTML tags and content).
>>> >>>> >
>>> >>>> > I want this field for be indexed (for full text search purpose),
>>> >>>> > but
>>> >>>> > of
>>> >>>> > course, without the HTML markup tags considered as 'normal'
>>> >>>> > text...
>>> >>>> >
>>> >>>> > How can I configure Jackrabbit SearchIndex configuration ?
>>> >>>> >
>>> >>>> > My last idea was to declare somewhere the field as text/html
>>> >>>> > (jcr:mimeType),
>>> >>>> > but I'm unable to add this property (violation...)...
>>> >>>>
>>> >>>> This is currently not possible. html tags are also just indexed. In
>>> >>>> general, it has never been an issue. If it is really an issue, we
>>> >>>> could improve here. We could also add all html tags to the list of
>>> >>>> stopwords, so they are ignored completely. You can also do this
>>> >>>> yourself if you want
>>> >>>>
>>> >>>> Regards Ard
>>> >>>>
>>> >>>> >
>>> >>>> > Thanks.
>>> >>>> >
>>> >>>> > --
>>> >>>> > David MARTIN
>>> >>>> >
>>> >>>> > _______________________________________________
>>> >>>> > Hippo-cms7-user mailing list and forums
>>> >>>> > http://www.onehippo.org/cms7/support/forums.html
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>>> >>>> Boston - 1 Broadway, Cambridge, MA 02142
>>> >>>>
>>> >>>> US +1 877 414 4776 (toll free)
>>> >>>> Europe +31(0)20 522 4466
>>> >>>> www.onehippo.com
>>> >>>> _______________________________________________
>>> >>>> Hippo-cms7-user mailing list and forums
>>> >>>> http://www.onehippo.org/cms7/support/forums.html
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> Hippo-cms7-user mailing list and forums
>>> >>> http://www.onehippo.org/cms7/support/forums.html
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> Hippo-cms7-user mailing list and forums
>>> >> http://www.onehippo.org/cms7/support/forums.html
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > David MARTIN
>>> > Ippon Technologies
>>> >
>>> > _______________________________________________
>>> > Hippo-cms7-user mailing list and forums
>>> > http://www.onehippo.org/cms7/support/forums.html
>>>
>>>
>>>
>>> --
>>> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
>>> Boston - 1 Broadway, Cambridge, MA 02142
>>>
>>> US +1 877 414 4776 (toll free)
>>> Europe +31(0)20 522 4466
>>> www.onehippo.com
>>> _______________________________________________
>>> Hippo-cms7-user mailing list and forums
>>> http://www.onehippo.org/cms7/support/forums.html
>>
>>
>>
>>
>> --
>> David MARTIN
>> Ippon Technologies
>>
>> _______________________________________________
>> Hippo-cms7-user mailing list and forums
>> http://www.onehippo.org/cms7/support/forums.html
>
>
>
> _______________________________________________
> Hippo-cms7-user mailing list and forums
> http://www.onehippo.org/cms7/support/forums.html



--
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com
_______________________________________________
Hippo-cms7-user mailing list and forums
http://www.onehippo.org/cms7/support/forums.html