search-index

Logo

A persistent, network resilient, full text search library for the browser and Node.js

View the Project on GitHub fergiemcdowall/search-index

API Documentation for search-index

(Convention: it is assumed here that the search-index module is always assigned to the variable si, but you can of course assign it to whatever you want)

Module API

Importing and requiring

This module can be invoked with import and/or require depending on your environment:

import si from 'search-index'

or

const si = require('search-index')

Instantiating an index

Once the search-index module is assigned to a variable you can instantiate an index by invoking the module variable as a Promise:

const idx = await si(options)

When intantiated in a browser search-index will use indexedDB as a keystore by default, when intantiated in node.js it will use levelDB. search-index can also use other keystores via the db parameter.

si(options)

si(options) returns a Promise which creates a search index when invoked

options is an object that can contain the following properties:

Name Type Default Description
caseSensitive boolean false If true, case is preserved (so ‘BaNaNa’ != ‘banana’), if false, text matching will not be case sensitive
db abstract-leveldown store leveldown The underlying data store. If you want to run search-index on a different backend (say for example Redis or Postgres), then you can pass the appropriate abstract-leveldown compatible store
cacheLength Number 1000 Length of the LRU cache. A bigger number will give faster reads but use more memory. Cache is emptied after each write.
name String 'fii' Name of the index- will correspond to a physical folder on a filesystem (default for node) or a namespace in a database (default for web is indexedDB) depending on which backend you use
tokenAppend String '#' The string used to separate language tokens from scores in the underlying index. Should have a higher sort value than all text characters that are stored in the index- however, lower values are more platform independent (a consideration when replicating indices into web browsers for instance)
stopwords Array [] A list of words to be ignored when indexing and querying

Index API

For the purposes of brevity, this document assumes that a search index has been initialized in such a way that the functions below are available as variables:

const { INDEX, QUERY, UPDATE /* etc. */ } = await si()

It may be helpful to check out the tests for more examples.

Tokens

search-index is a text orientated reverse index. This means that documents are retrievable by passing text tokens that they contain into queries. There are various ways to express tokens:

Find anywhere

'<token value>'

Example:

'banana'

Find in named field or fields

'<field name>:<token value>'

Example:

'fruit:banana'

// (can also be expressed as ->)
{
  FIELD: 'fruit',
  VALUE: 'banana'
}
// Find in two or more specified fields:
{
  FIELD: [ 'fruit', 'description' ], // array of field names
  VALUE: 'banana'
}

Find within a range

{
  FIELD: fieldName,
  VALUE: {
    GTE: gte,        // greater than or equal to
    LTE: lte         // less than or equal to
  }
}

Example (get all fruits beginning with ‘a’, ‘b’ or ‘c’):

// this token range would capture 'banana'
{
  FIELD: 'fruit',
  VALUE: {
    GTE: 'a',
    LTE: 'c'
  }
}

Find where a field exists

// Find all documents that contain a 'price' field
{
  FIELD: 'price'
}

Create a tokenization pipeline when querying

Use the PIPELINE option when using QUERY

ALL_DOCUMENTS

See also DOCUMENTS

// Return all documents from index.
const documents = await ALL_DOCUMENTS(limit)
// "limit" is the maximum total of documents to be returned

BUCKETS

// Return the IDs of documents for each given token filtered by the
// query result
const buckets = await BUCKETS(token1, token2, ...)

CREATED

// find out when index was first created
const timestamp = await CREATED()

DELETE

NOTE: for indices to be deleteable documents must be indexed whith storeVectors set to true

See also FLUSH

// Delete documents from the index
const result = await DELETE(id1, id2, id3 /*...*/)

DICTIONARY

See also DISTINCT

// Return each available field value for the given token space.
const dictionary = await DICTIONARY(token)

DISTINCT

See also DICTIONARY

// Return distinct field values from index
const distinct = await DISTINCT(token)

DOCUMENTS

See also ALL_DOCUMENTS

// Return named documents from index.
const documents = await DOCUMENTS(id1, id2, id3 /* ... */)

DOCUMENT_COUNT

// returns the total amount of documents in the index
const totalDocs = await DOCUMENT_COUNT()

EXPORT

// creates a backup/export of an index
const indexExport = await EXPORT()

FACETS

// Return document ids for each distinct field/value combination for
// the given token space.
const facets = await FACETS(token)

FIELDS

// get every document field name that has been indexed:
const fields = await FIELDS()

FLUSH

// Delete everything and start again (including creation metadata)
await FLUSH()

IMPORT

// creates an index from a backup/export
await IMPORT(index)

INDEX

INDEX points to the underlying instance of fergies-inverted-index.

MAX

// get the maxiumum/last value of the given token space
const max = await MAX(token)

MIN

// get the minimum/first value of the given token space
const min = await MIN(token)

PUT

// Put documents into the index
const result = await PUT(documents, options)
// "result" shows the success or otherwise of the insertion
// "documents" is an Array of javascript Objects.
// "options" is an Object that contains indexing options

If any document does not contain an _id field, then one will be generated and assigned

options is an optional object that can contain the following values. These values can also be set when initialising the index rather than in every PUT:

Name Type Default Description
caseSensitive boolean false If true, case is preserved (so ‘BaNaNa’ != ‘banana’), if false, text matching will not be case sensitive
ngrams object <pre lang="javascript">{
lengths: [ 1 ],
join: ‘ ‘,
fields: undefined
}</pre>
An object that describes ngrams. See ngraminator for how to specify ngrams
replace object { fields: [], values: {} } fields is an array that specifies the fields where replacements will happen, values is an array that specifies the tokens to be swapped in, for example: { values: { sheep: [ 'animal', 'livestock' ] } }
skipField Array [] These fields will not be searchable, but they will still be stored
stopwords Array [] A list of words to be ignored when indexing
storeRawDocs boolean true Whether to store the raw document or not. In many cases it may be desirable to store it externally, or to skip storing when indexing if it is going to be updated directly later on
storeVectors boolean false When true, documents will be deletable and overwritable, but will take up more space on disk
tokenizationPipeline Array <pre lang="javascript">[
SPLIT,
SKIP,
LOWCASE,
REPLACE,
NGRAMS,
STOPWORDS,
SCORE_TERM_FREQUENCY
]</pre>
Tokenisation pipeline. Stages can be added and reordered
tokenSplitRegex RegExp /[\p{L}\d]+/gu The regular expression that splits strings into tokens

Tokenization pipeline when indexing

Every field of every document that is indexed is passed through the tokenization pipeline. The tokenization pipeline consists of a sequence of stages that are applied to the field with the result of the preceding stage providing the input for the result of the following stage.

The default tokenization pipeline looks like this:

tokenizer: (tokens, field, ops) =>
  SPLIT([tokens, field, ops])
    .then(SKIP)
    .then(LOWCASE)
    .then(REPLACE)
    .then(NGRAMS)
    .then(STOPWORDS)
    .then(SCORE_TERM_FREQUENCY)
    .then(([tokens, field, ops]) => tokens)

Reorder pipeline

Example: reorder the pipeline to remove stopwords before creating ngrams:

const { PUT, TOKENIZATION_PIPELINE_STAGES } = await si({
  name: 'pipeline-test'
})
await PUT(docs, {
  tokenizer: (tokens, field, ops) =>
  TOKENIZATION_PIPELINE_STAGES.SPLIT([tokens, field, ops])
    .then(TOKENIZATION_PIPELINE_STAGES.SKIP)
    .then(TOKENIZATION_PIPELINE_STAGES.LOWCASE)
    .then(TOKENIZATION_PIPELINE_STAGES.REPLACE)
    .then(TOKENIZATION_PIPELINE_STAGES.STOPWORDS) // <-- order switched
    .then(TOKENIZATION_PIPELINE_STAGES.NGRAMS)    // <-- order switched
    .then(TOKENIZATION_PIPELINE_STAGES.SCORE_TERM_FREQUENCY)
    .then(([tokens, field, ops]) => tokens)
})

Create custom pipeline stages

A custom pipeline stage must be in the following form:

// take tokens (Array of tokens), field (the field name), and options,
// and then return then return the Array of tokens
([ tokens, field, ops ]) => {
  // some processing here...
  return [ tokens, field, ops ]
}

Example: Normalize text characters:

const { PUT, TOKENIZATION_PIPELINE_STAGES } = await si({
  name: 'pipeline-test'
})
await PUT(docs, {
  tokenizer: (tokens, field, ops) =>
    TOKENIZATION_PIPELINE_STAGES.SPLIT([tokens, field, ops])
      .then(TOKENIZATION_PIPELINE_STAGES.SKIP)
      .then(TOKENIZATION_PIPELINE_STAGES.LOWCASE)
      .then(TOKENIZATION_PIPELINE_STAGES.REPLACE)
      .then(TOKENIZATION_PIPELINE_STAGES.NGRAMS)
      .then(TOKENIZATION_PIPELINE_STAGES.STOPWORDS)
      // björn -> bjorn, allé -> alle, etc.
      .then(([tokens, field, ops]) => [
        tokens.map(t => t.normalize("NFD").replace(/[\u0300-\u036f]/g, ""),
        field,
        ops
      ])
      .then(TOKENIZATION_PIPELINE_STAGES.SCORE_TERM_FREQUENCY)
      .then(([tokens, field, ops]) => tokens)
})

Example: stemmer:

const stemmer = require('stemmer')
const { PUT, TOKENIZATION_PIPELINE_STAGES } = await si({
  name: 'pipeline-test'
})
await PUT(docs, {
  tokenizer: (tokens, field, ops) =>
    TOKENIZATION_PIPELINE_STAGES.SPLIT([tokens, field, ops])
      .then(TOKENIZATION_PIPELINE_STAGES.SKIP)
      .then(TOKENIZATION_PIPELINE_STAGES.LOWCASE)
      .then(TOKENIZATION_PIPELINE_STAGES.REPLACE)
      .then(TOKENIZATION_PIPELINE_STAGES.NGRAMS)
      .then(TOKENIZATION_PIPELINE_STAGES.STOPWORDS)
      // björn -> bjorn, allé -> alle, etc.
      .then(([tokens, field, ops]) => [
        tokens.map(stemmer),
        field,
        ops
      ])
      .then(TOKENIZATION_PIPELINE_STAGES.SCORE_TERM_FREQUENCY)
      .then(([tokens, field, ops]) => tokens)
})

PUT_RAW

// Put raw documents into the index
const result = await PUT_RAW(rawDocuments)
// "result" shows the success or otherwise of the insertion
// "rawDocuments" is an Array of javascript Objects that must
// contain an _id field

PUT_RAW writes raw documents to the index. Raw documents are the documents that the index returns. Use raw documents when the documents that are indexed are not the same as the ones that you want the index to return. This can be useful if you want documents to be retrievable for terms that dont appear in the actual document. It can also be useful if you want to store stripped down versions of the document in the index in order to save space.

NOTE: if the documents that the index returns are very different to the corresponding documents that are indexed, it may make sense to set storeRawDocs: false when indexing (making indexing slightly faster), and instead add them with PUT_RAW afterwards.

QUERY

Running queries

QUERY is a function that allows you to run queries on the search index. It is called with a query object and returns a Promise:

const results = await QUERY(query, options)

options is an optional object that can contain the following properties:

Name Type Default Description
BUCKETS Array [] Aggregate on user defined buckets
DOCUMENTS boolean false If true return entire document, if not true return reference to document
FACETS Array [] Aggregate on fields in the index
PAGE object { NUMBER: 0, SIZE: 20 } Pagination
PIPELINE object token => new Promise(resolve => resolve(token)) Query tokenization pipeline
SCORE String 'TFIDF' Calculate a value per document
SORT object { TYPE: 'NUMERIC', DIRECTION: 'DESCENDING', FIELD: '_score' } Sort documents
WEIGHT Array [] Weight fields and/or values

Returning references or documents

QUERY can return both refences to documents and the documents themselves.

References are returned by default. To return documents, pass the DOCUMENTS option:

    const results = await QUERY(query, { DOCUMENTS: true })

Nesting query verbs

Query verbs can be nested to create powerful expressions:

// Example: AND with a nested OR with a nested AND
{
  AND: [ token1, token2, {
    OR: [ token3, {
      AND: [ token4, token5 ]
    }]
  }]
}

Manipulating result sets

Results can be paginated with SCORE, SORT and PAGE

// Example: get the second page of documents ordered by price
QUERY({
  FIELD: 'price'           // Select all documents that have a 'price'
}, {
  SCORE: 'SUM',            // Score on the sum of the price field
  SORT: {
    TYPE: 'NUMERIC',       // sort numerically, not alphabetically
    DIRECTION: 'ASCENDING' // cheapest first
                           // (SORT will sort on _score by default, but can
                           // optionally sort on a field specified by FIELD
                           // that is present in _match)
  },
  PAGE: {
    NUMBER: 1,             // '1' is the second page (pages counted from '0')
    SIZE: 20               // 20 results per page
  }
})

Query options

BUCKETS

See also BUCKETS

// Return the IDs of documents for each given token filtered by the
// query result
{
  BUCKETS: [ token1, token2, /* ... */ ]
}

DOCUMENTS

// Returns full documents instead of just metadata.
{
  DOCUMENTS: true
}

FACETS

See also FACETS

// Return document ids for each distinct field/value combination for
// the given token space, filtered by the query result.
{
  FACETS: token
}

PAGE

// show a single page of the result set
{
  PAGE: {
    NUMBER: pageNumber, // to count from the end of the result set
                        // use negative numbers
    SIZE: pageSize
  }
}

PIPELINE

// Alter a token on the way into a query
{
  PIPELINE: token =>
    new Promise(resolve => {
      // swap out all "ø" with "o"
      token.VALUE.GTE = token.VALUE.GTE.replace(/ø/g, 'o')
      token.VALUE.LTE = token.VALUE.LTE.replace(/ø/g, 'o')
      return resolve(token)
    })
}

SCORE

// show a single page of the result set
{
  SCORE: scoreType // can be 'TFIDF', 'SUM, 'PRODUCT' or 'CONCAT'
}

SORT

SORT will sort on _score by default, or any field in _match (specified by the FIELD parameter). Therefore, if the FIELD parameter is specified, then that field must be present in the query. So, for example, if you want to sort on “price”, you have to include “price” in the query in order for it to appear in _match and therefore be available to sort on.

If performance is not your primary concern, it is also possible to use DOCUMENTS and then sort using Javascript’s sort function.

// Sorts result by _score, or a field in _match
{
  SORT: {
    TYPE: type,              // can be 'NUMERIC' (default) or 'ALPHABETIC'
    DIRECTION: direction,    // can be 'ASCENDING' or 'DESCENDING' (default)
    FIELD: field             // field to sort on (defaults to _score)
  }
}

WEIGHT

// Weights fields and/or values
{
  WEIGHT: [{
    FIELD: fieldName,     // Name of field (matches all field if not present)
    VALUE: fieldValue,    // Value of field (matches all values if not present)
    WEIGHT: weight        // A numeric factor that weights the field/value
  }, /* ... more weights here if required... */ ]
}

Query verbs

ALL_DOCUMENTS

// returns all documents. Use PAGE to limit how many you see
{
  ALL_DOCUMENTS: true
}

AND

// Boolean AND: Return results that contain all tokens
{
  AND: [ token1, token2, /* ... */ ]
}

NOT

{
  INCLUDE: queryExpression1,
  EXCLUDE: queryExpression2
}

OR

// Boolean OR: Return results that contain one or more tokens
{
  OR: [ token1, token2, /* ... */ ]
}
// equivalent to
// QUERY(q, {
//   SCORE: 'TFIDF',
//   SORT: true
// })
const results = await SEARCH(q)

TOKENIZATION_PIPELINE_STAGES

Tokenization pipeline stages can be added, removed or reordered when PUTing by passing them to the tokenizationPipeline option.

Use this functionality when processing text on the way into the index. Typical tasks would be to add stemming, synonym replacement, character normalisation, or phone number normalisation.

It is possible to create your own tokenization pipeline stage. See the PUT section for more info

Name Description
SKIP Skip these fields
LOWCASE Bump all tokens to lower case
NGRAMS create ngrams
SCORE_TERM_FREQUENCY Score frequency of terms
REPLACE Replace terms with other terms (synonyms)
SPLIT Splits string into tokens (note: this is always the first stage, and tokens is a string rather than an array)
SPY print output from precending stage to console.log
STOPWORDS remove stopwords