User Guide
==========

.. _building-an-index:

Building an Index
-----------------

Use :class:`~impact_index.IndexBuilder` to create a sparse index from
document impact vectors. Each document is represented as a set of term
indices with associated impact values.

.. code-block:: python

    import numpy as np
    import impact_index

    builder = impact_index.IndexBuilder("/path/to/index")

    # Add documents: docid, term_indices, impact_values
    terms = np.array([0, 5, 42], dtype=np.uintp)
    values = np.array([1.2, 0.5, 3.1], dtype=np.float32)
    builder.add(0, terms, values)

    # More documents...
    builder.add(1, np.array([2, 5, 8], dtype=np.uintp),
                np.array([0.3, 0.9, 1.1], dtype=np.float32))

    # Finalize and get a searchable index
    index = builder.build(in_memory=True)

Builder options
~~~~~~~~~~~~~~~

Use :class:`~impact_index.BuilderOptions` to control checkpointing
(for crash recovery) and memory usage:

.. code-block:: python

    options = impact_index.BuilderOptions()
    options.checkpoint_frequency = 100000   # checkpoint every N documents
    options.in_memory_threshold = 1000000   # max postings per term before flush

    builder = impact_index.IndexBuilder("/path/to/index", options=options)

    # Resume from a checkpoint (returns None if no checkpoint exists)
    last_docid = builder.get_checkpoint_doc_id()
    if last_docid is not None:
        print(f"Resuming from document {last_docid}")

Storage dtype
~~~~~~~~~~~~~

By default, impact values are stored as ``float32``. You can choose a
different on-disk type to trade precision for space:

.. code-block:: python

    # Use float16 for smaller indices
    builder = impact_index.IndexBuilder("/path/to/index", dtype="float16")

Supported dtypes: ``"float32"`` (default), ``"float16"``, ``"bfloat16"``,
``"float64"``, ``"int32"``, ``"int64"``.


.. _searching:

Searching
---------

Load an existing index and search it with WAND or MaxScore. Both return
a list of :class:`~impact_index.ScoredDocument`:

.. code-block:: python

    import impact_index

    index = impact_index.Index.load("/path/to/index", in_memory=True)

    # Query: {term_index: query_weight}
    query = {5: 1.0, 10: 0.5, 42: 1.5}

    # WAND algorithm
    results = index.search_wand(query, top_k=10)
    for doc in results:
        print(f"Document {doc.docid}: {doc.score}")

    # MaxScore algorithm (often faster on compressed/split indices)
    results = index.search_maxscore(query, top_k=10)

Async search
~~~~~~~~~~~~

For non-blocking retrieval (e.g., in a web server):

.. code-block:: python

    results = await index.aio_search_wand(query, top_k=10)
    results = await index.aio_search_maxscore(query, top_k=10)

Iterating over postings
~~~~~~~~~~~~~~~~~~~~~~~

You can inspect individual posting lists. Each element is a
:class:`~impact_index.TermImpact`:

.. code-block:: python

    iterator = index.postings(term_id)
    print(f"Length: {iterator.length()}")
    print(f"Max impact: {iterator.max_value()}")
    print(f"Max doc ID: {iterator.max_doc_id()}")

    for posting in iterator:
        print(f"Doc {posting.docid}: {posting.value}")


.. _bm25:

BM25 and Bag-of-Words Indexing
------------------------------

For traditional IR with BM25 scoring, use
:class:`~impact_index.BOWIndexBuilder` instead of
:class:`~impact_index.IndexBuilder`. It automatically tracks document
lengths and optionally integrates text analysis (tokenization + stemming).

Pre-tokenized input
~~~~~~~~~~~~~~~~~~~

If you already have term indices and term-frequency values:

.. code-block:: python

    import numpy as np
    import impact_index

    builder = impact_index.BOWIndexBuilder("/path/to/index", dtype="int32")

    # Add documents: docid, term_indices, tf_values
    terms = np.array([0, 5, 42], dtype=np.uintp)
    tf = np.array([3, 1, 2], dtype=np.int32)
    builder.add(0, terms, tf)

    builder.add(1, np.array([2, 5, 8], dtype=np.uintp),
                np.array([1, 4, 1], dtype=np.int32))

    # Build returns searchable Index (doc metadata stored automatically)
    index = builder.build(in_memory=True)

    # Create a BM25-scored index (doc lengths loaded automatically)
    scored = index.with_scoring(impact_index.BM25Scoring(k1=1.2, b=0.75))

    # Search with MaxScore (fastest algorithm)
    query = {0: 1.0, 5: 1.0}
    results = scored.search_maxscore(query, top_k=10)
    for doc in results:
        print(f"Document {doc.docid}: {doc.score}")

Raw text input with stemming
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For direct text indexing with automatic tokenization, stemming, and
vocabulary management:

.. code-block:: python

    import impact_index

    builder = impact_index.BOWIndexBuilder(
        "/path/to/index",
        dtype="int32",
        stemmer="porter",  # Lucene-compatible Porter stemmer
        stop_words=True,   # Lucene default English stop words
    )

    builder.add_text(0, "the quick brown fox jumps over the lazy dog")
    builder.add_text(1, "a quick brown cat jumps high")
    builder.add_text(2, "the lazy dog sleeps all day")

    # Build index (doc metadata and analyzer config saved automatically)
    index = builder.build(in_memory=True)

    # BM25 scoring (doc lengths loaded automatically from index)
    scored = index.with_scoring(impact_index.BM25Scoring())

    # Query analysis (analyzer loaded automatically from index)
    query = index.analyzer().analyze_query("quick fox")
    results = scored.search_maxscore(query, top_k=10)

Stop words
~~~~~~~~~~

Stop words (common words like "the", "is", "a") can be filtered during
indexing and querying to reduce index size and improve search speed.
Built-in stop word lists from Lucene are available for 17 languages.

.. code-block:: python

    import impact_index

    # Use default stop words for the language (matches Lucene/Pyserini)
    builder = impact_index.BOWIndexBuilder(
        "/path/to/index",
        stemmer="snowball",
        language="english",
        stop_words=True,
    )

    # Or provide an explicit list
    builder = impact_index.BOWIndexBuilder(
        "/path/to/index",
        stemmer="snowball",
        stop_words=["the", "a", "is", "in"],
    )

    # Get the stop word list for any supported language
    words = impact_index.get_stop_words("english")   # 33 words (Lucene default)
    words = impact_index.get_stop_words("french")     # 154 words
    words = impact_index.get_stop_words("german")     # 231 words

Supported languages: arabic, danish, dutch, english, finnish, french,
german, greek, hungarian, italian, norwegian, portuguese, romanian,
russian, spanish, swedish, turkish.

.. note::

    For fair comparison with Pyserini/Lucene, always enable stop words.
    Without them, high-frequency terms like "the" create very long
    posting lists that slow down search significantly.

Loading a saved index
~~~~~~~~~~~~~~~~~~~~~

The index automatically detects and loads auxiliary components
(doc metadata, analyzer config, vocabulary) from the directory:

.. code-block:: python

    import impact_index

    index = impact_index.Index.load("/path/to/index", in_memory=True)

    # Doc metadata and analyzer are loaded automatically
    scored = index.with_scoring(impact_index.BM25Scoring())
    analyzer = index.analyzer()
    query = analyzer.analyze_query("quick fox")
    results = scored.search_maxscore(query, top_k=10)


.. _compression:

Compression and Transforms
--------------------------

Compressed indices use PFOR-delta for doc IDs and adaptive bitpacking
for values, with 128-posting blocks that enable block-max pruning.

Quick compression
~~~~~~~~~~~~~~~~~

The simplest way to compress an index:

.. code-block:: python

    import impact_index

    index = impact_index.Index.load("/path/to/raw_index", in_memory=True)

    # Compress with defaults (PFOR doc IDs, lossless integer TF, block_size=128)
    compressed = index.compress("/path/to/compressed")

    # The compressed index is standalone — includes vocab, docmeta, analyzer
    scored = compressed.with_scoring(impact_index.BM25Scoring())

    # For neural IR with float impacts, use quantization:
    compressed = index.compress("/path/to/compressed", nbits=8)

The compressed index is fully standalone: auxiliary files (vocabulary,
analyzer config, document metadata) are automatically copied from the
source index.

The default settings are optimized for BM25:

- **block_size=128**: aligns with SIMD registers and enables effective
  block-max pruning during search.
- **nbits=0** (default): lossless integer bitpacking for TF counts
  (~2-3 bits per value). Use ``nbits=8`` or ``nbits=16`` for quantized
  float compression (neural IR like SPLADE).

Advanced compression
~~~~~~~~~~~~~~~~~~~~

For full control over compressors, use
:class:`~impact_index.CompressionTransform`:

.. code-block:: python

    import impact_index

    index = impact_index.Index.load("/path/to/raw_index", in_memory=True)

    # Choose compressors
    docid_compressor = impact_index.BitPackingCompressor()   # SIMD (recommended)
    # docid_compressor = impact_index.EliasFanoCompressor()  # alternative

    # Fixed-range quantization (if you know the value range)
    impact_compressor = impact_index.ImpactQuantizer(nbits=8, min=0.0, max=10.0)

    # Or auto-ranging quantization (determines range from the index)
    impact_compressor = impact_index.GlobalImpactQuantizer(nbits=8)

    # Apply compression
    transform = impact_index.CompressionTransform(
        max_block_size=128,
        doc_ids_compressor=docid_compressor,
        impacts_compressor=impact_compressor,
    )
    transform.process("/path/to/compressed", index)

    # Load the compressed index
    compressed = impact_index.Index.load("/path/to/compressed", in_memory=True)

Splitting by quantiles
~~~~~~~~~~~~~~~~~~~~~~

:class:`~impact_index.SplitIndexTransform` partitions each term's postings
into sub-lists by value ranges, enabling more aggressive pruning with
MaxScore:

.. code-block:: python

    base_transform = impact_index.CompressionTransform(
        max_block_size=128,
        doc_ids_compressor=impact_index.EliasFanoCompressor(),
        impacts_compressor=impact_index.GlobalImpactQuantizer(nbits=8),
    )

    split_transform = impact_index.SplitIndexTransform(
        quantiles=[0.5, 0.9],   # split at 50th and 90th percentile
        sink=base_transform,
    )
    split_transform.process("/path/to/split_index", index)


.. _bmp:

BMP (Block-Max Pruning)
-----------------------

BMP implements "Faster Learned Sparse Retrieval with Block-Max Pruning"
(SIGIR 2024) for fast approximate search.

Converting to BMP format
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    import impact_index

    index = impact_index.Index.load("/path/to/index", in_memory=True)

    # Streaming conversion (recommended — memory-efficient)
    index.to_bmp_streaming("/path/to/bmp_index.bin", bsize=64, compress_range=True)

    # Or legacy method (loads all postings into memory)
    # index.to_bmp("/path/to/bmp_index.bin", bsize=64, compress_range=True)

Searching with BMP
~~~~~~~~~~~~~~~~~~

Load a BMP index with :class:`~impact_index.BmpSearcher` and search:

.. code-block:: python

    searcher = impact_index.BmpSearcher("/path/to/bmp_index.bin")
    print(f"Documents: {searcher.num_documents()}")

    # Query uses string term IDs
    query = {"term1": 1.0, "term2": 0.5}
    doc_ids, scores = searcher.search(query, k=10, alpha=1.0, beta=1.0)
    for docid, score in zip(doc_ids, scores):
        print(f"{docid}: {score}")

BMP search parameters:

- ``k`` — number of results to return
- ``alpha`` — controls early termination aggressiveness (default: 1.0)
- ``beta`` — controls block skipping (default: 1.0)


.. _document-store:

Document Store
--------------

The document store provides compressed storage for document content and
metadata, using zstd block compression. Documents can be retrieved by
sequential number or by key fields.

Building a store
~~~~~~~~~~~~~~~~

Use :class:`~impact_index.DocumentStoreBuilder` to create a store:

.. code-block:: python

    import impact_index

    builder = impact_index.DocumentStoreBuilder(
        "/path/to/store",
        block_size=4096,    # documents per compressed block
        zstd_level=3,       # compression level
    )

    # Add documents with key-value metadata and binary content
    builder.add({"docno": "DOC001", "url": "http://example.com"}, b"document text here")
    builder.add({"docno": "DOC002", "url": "http://example.com/2"}, b"another document")

    # Finalize (can only be called once)
    builder.build()

Resumable builds (crash recovery)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For long-running ingests, pass ``checkpoint_frequency`` to periodically
persist the in-flight builder state. If the process crashes or exits
before ``build()``, re-instantiating the builder against the same folder
(with the same non-zero ``checkpoint_frequency``) will resume from the
last checkpoint — any documents added after the last checkpoint but
before the crash are discarded, and the output files are rewound to a
consistent state.

.. code-block:: python

    builder = impact_index.DocumentStoreBuilder(
        "/path/to/store",
        checkpoint_frequency=10_000,  # checkpoint every 10k documents
    )

    for doc in documents:
        builder.add(doc.keys, doc.content)

    builder.build()  # clears the checkpoint on success

    # On resume after a crash:
    builder = impact_index.DocumentStoreBuilder(
        "/path/to/store",
        checkpoint_frequency=10_000,
    )
    print(builder.num_documents())  # docs restored from the last checkpoint
    # Continue adding from wherever your source left off, then build().

Call ``builder.checkpoint()`` directly to force a checkpoint at an
arbitrary point (e.g. before exiting normally so that no work is lost).
Passing ``checkpoint_frequency=0`` (the default) disables checkpointing
and truncates any stale checkpoint on open.

Retrieving documents
~~~~~~~~~~~~~~~~~~~~

Load a store with :meth:`~impact_index.DocumentStore.load` and retrieve
:class:`~impact_index.Document` objects by number or key. Each document
has :attr:`~impact_index.Document.keys` (metadata dict) and
:attr:`~impact_index.Document.content` (bytes):

.. code-block:: python

    store = impact_index.DocumentStore.load(
        "/path/to/store",
        content_access="memory",  # or "mmap" or "disk"
    )

    print(f"Total documents: {store.num_documents()}")
    print(f"Key fields: {store.key_names()}")

    # By sequential number (0-based)
    docs = store.get_by_number([0, 1, 2])
    for doc in docs:
        print(doc.keys, doc.content)

    # By key field value
    docs = store.get_by_key("docno", ["DOC001", "DOC002"])
    for doc in docs:
        if doc is not None:
            print(doc.keys, doc.content)

The ``content_access`` parameter controls how content data is accessed:

- ``"memory"`` — loads all content into RAM (fastest, highest memory)
- ``"mmap"`` — memory-mapped I/O (OS manages caching)
- ``"disk"`` — reads from disk on demand (lowest memory)

Async retrieval
~~~~~~~~~~~~~~~

.. code-block:: python

    docs = await store.aio_get_by_number([0, 1, 2])
    docs = await store.aio_get_by_key("docno", ["DOC001", "DOC002"])