A SPARQLing new autocomplete algorithm

Users of the ScOT vocabulary are able to find concepts either by browsing through the hierarchy, or by searching for concepts from a search field on the vocabulary pages’ upper-right corner.

Typing in the search field produces a drop-down-style list of suggested concepts, so user’s search is directed to exact matching terms. User’s input in the search field is auto-completed, based on a search of preferred , alternative, and hidden labels of concepts in the ScOT triple-store.

This dynamic updating of user’s input is achieved by AJAX, which sends XMLHttpRequests from the browser to the server and updates a section of the page (namely, the drop-down list) based on the server’s response, without having to refresh the whole page each time.

By default, the PoolParty software uses Apache’s Lucene information retrieval software library as the algorithm for querying the ScOT data and returning results based on user’s input in the search field (that is, the search is “powered by” Lucene). However, a recent release of PoolParty allows customization of the search algorithm by specifying custom SPARQL queries. That is, the search can be “powered by” ScOT’s own SPARQL end-point.

I’ve been working on developing a SPARQL query that will meet our specific preferences in relation to autocompletion of user’s search input. These are the parameters I am working from:

1)      The search text should be sought amongst the preferred-language skos:prefLabel, skos:altLabel, and skos:hiddenLabel elements of the ScOT data

2)      The relevance of all matching concepts should be determined by:

  1. Left-stemming: The distance of the search text from the left-hand side of the label in which the search text is found. That is, labels beginning with the search text are given preference over labels in which the search text is found somewhere within the label. Thus, the label “Achievement” will be rated more highly than “Machinery” for the search text “ach”
  2.  Proportional match: Within labels with equal value on the left-stemming rule, preference should be given to shorter labels. This not only produces a pleasant, tree-like visual effect, but gives preference to labels with a tighter lexical match to the search text. That is, the search text makes up a greater proportion of the total length of the label. For the search text “ach”, the label “Achievement” will be rated more highly than “Achievement testing”

3)      The URI and skos:prefLabel of the top 20 matches should be returned

I developed the following SPARQL query to meet these criteria (the search text “ach” is used in this example):

PREFIX :<http://www.w3.org/2004/02/skos/core#>

SELECT DISTINCT  ?uri ?label

WHERE  {?uri a :Concept.

?uri  :prefLabel ?label.

FILTER (lang(?label) = “en”).

?uri  ?b ?lab.

FILTER (?b= :prefLabel || ?b=:altLabel || ?b= :hiddenLabel)

FILTER regex (str($lab), “ach”, ‘i’)

FILTER (lang (?lab) = “en”).

BIND (STRLEN(STRBEFORE (str(?lab), “ach”)) AS ?place).

BIND (STRLEN(STR(?lab)) AS ?len)}

ORDER BY ?place ?len ?lab

LIMIT 20

Let’s do a walkthrough of this query:

  • Line 1: Specifies that the SKOS namespace is the default namespace for this query
  • Line 2: The data to be returned is the concept’s URI (as defined on Line 3) and skos:prefLabel (as defined on Line 4)
  • Line 3: Selects all elements defined as a skos:Concept
  • Line 4: Include the skos:prefLabel of those concepts in the query
  • Line 5: Filter out any results with a skos:prefLabel that is not in English
  • Line 6 and 7: Includes in the query all labels associated with the concept, whether they are skos:prefLabels, skos:altLabels, or skos:hiddenLabels
  • Line 8 and 9: Filter out of the query any results whose labels (from Line 6 and 7) do not contain the search text (which is case-insensitive) or are not in English
  • Line 10: Create a new variable representing the number of characters in the label that occur to the left of the search text (this is for evaluating left-stemming)
  • Line 11: Create a new variable representing the length of the label (this is for evaluating the proportional match)
  • Line 12: Order the remaining results in order of: the variable defined in Line 10 (left-stemming), the variable defined in Line 11 (proportional match), and the label itself (alphabetical ordering of the matching label)
  • Line 13: Of these results, return only the first 20 (these will be the top-matching concepts because of the ordering performed in Line 12)

So all the matching and ordering of results is done against a label – which may or may not be the preferred label – but the final result set consists of the URI and the preferred label.

I’ve encoded the query into this URL, so you can perform it dynamically and check out the results yourself.

Remember that this query is not ScOT specific — it will provide the same functionality on any SKOS-based vocabulary.

Out of interest, before SPARQL 1.1 came along, my draft autocomplete query was very different. Keywords such as BIND, STRLEN, STRBEFORE and AS were not available, so none for the ranking and ordering capabilities could be performed within SPARQL. Here is my much-more-basic SPARQL 1.0 autocomplete query:

PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
SELECT ?prefLabel ?URI
WHERE
{
?URI a skos:Concept.
?URI ?p ?label.
?URI skos:prefLabel ?prefLabel.
FILTER regex(str(?label), ‘dire’, ‘i’).
FILTER (?p = skos:prefLabel || ?p = skos:hiddenLabel || ?p = skos:altLabel).
FILTER (lang(?label) = “en”).
FILTER (lang(?prefLabel) = “en”).
}
ORDER BY ?label

All it does is returns the URI and English-language preferred label of all concepts whose English-language prefLabel, altLabel or hiddenLabel contain the search string. Results are orders alphabetically according to the matching label, but all ranking and limiting of results is left to the downstream system.

Leave a Reply

Your email address will not be published. Required fields are marked *