bio | home | mobile | papers | videos

QUICK_GUIDE

-------------

Kirara1 (ver 1.5)
Kirara2 (ver 2.2)

look
        http://plan9.aichi-u.ac.jp/netlib/
for new version or bug fix version.

-------------

Kirara is a desktop search engine for Plan 9.

Indexing and search target:
        - alphanumeric words in Latin text including English, German, and etc.
        - alphabetic words in Greek, Coptic, Cyrillic, Armenian
        - CJK Kanji and Japanese Katakana

Personal use: index/retrieve local files.

Kirara is based on the idea similar to Glimpse.

(1) indexing + grep
(2) multi-level indexing

(a) small space for indexing
(b) small update time
(c) quick search

Note that:
small indexing   ->|<-  quick search    # comflicting
Kirara makes more index -> quick search
Glimpse is single-level indexing.

-------------

Query

Kirara2 is similar to Kirara1 but stands on different concept.
Kirara1 supports a sort of phrase search but slow.
Kirara2 improves this problem at the cost of indexing space and updating time.

phrase search of Kirara: search text lines with multiple query words.

Kirara1: The database is index of words that points to directory.
Kirara2: The database is index of words that points to file.

Both support:
QE mode (query expression mode)
        '&', '|', '*', '?', ' '
The example:
        'snoopy&html'
        'sn?o*y htm*'

Look MAN_KFIND for detail.

-------------

Words

Two or more aphanumeric letters including '_' for Latin text.
All words are converted to lower case.
The minimum number of letters is configurable.

Assumption:
Text is composed of words that are separated by spaces and symbols.
This is popular in English and many European Languages, but not in Japanese.
In extracting words from Japanese text, Hiragana is discarded. This is reasonable
strategy in handling Japanese.

-------------

The user's interface

Best match with Rio.
        term% kfind1 snoopy
        G1 'snoopy' /lib/dict/
        G1 'snoopy' /sys/lib/linux/var/lib/apt/lists/
        G1 'snoopy' /sys/lib/man/lookman/
        G1 'snoopy' /sys/lib/sysconfig/auth/
        G1 'snoopy' /sys/lib/sysconfig/proto/
        G1 'snoopy' /sys/man/
        G1 'snoopy' /sys/man/3/
        G1 'snoopy' /sys/man/8/
        G1 'snoopy' /sys/src/cmd/gs/doc/
        G1 'snoopy' /sys/src/cmd/ip/
        ...
This is an example for Kirara1.
In Kirara2, replace: kfind1 -> kfind2 and G1 -> G2
and also, directories -> files

Note that: two steps

Kirara1:
1. find directories
2. find files and the contents
Step 2 is actually 'grep'.

Kirara2:
1. find files
2. find the contents
Step 2 is 'grep' for small files and new tools (look MAN_MKDICT) for large files.

Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.

-------------

The indexing organization

My example

Kirara1: /n/other/kirara1/sysdb
Kirara2: /n/other/kirara2/sysdb
target=(/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc /rc)

Kirara1: /n/other/kirara1/usrdb
Kirara2: /n/other/kirara2/usrdb
target=$home/^(bin/rc lib netlib doc adm issues srclib src)

Now, you may have as many usr*db as you like.

Indexing target is fully configurable.
Excluding target is also supported.
It is wise not to include dictionary such as "/lib/dict" which is
annoying in retrieval.

-------------

Multi-Level Indexing (Kirara1)

(1) Indexing (top level)
word to directory mapping
sysdb/main                      # main index
sysdb/mtoc                      # meta index (table of contents for main)
sysdb/qdir/*            # index of each directory (only for large direcories)
sysdb/qtd                       # map table (qid, mtime, path_to_dir)
sysdb/qd                        # map table (qid, path_to_dir)

main            # word to dir qid
        aa      0000000000014f0a
        aa      000000000001a1e0
        aa      000000000001a26e

mtoc            # word to range in main
        aa 0 126669
        ab 126669 491569
        ac 491569 1258566
        ad 1258566 1852467
        ...

qdir/*/
        where '*' is qid of directories
in qdir/*/ we have
        qtn             # map table of files (qid, mtime, name)
and optionary, word to file mapping:
        ind.gz  # fine index of the directory (gzipped)

usrdb is same as sysdb.

-------------

Multi-Level Indexing (Kirara2)

(1) Indexing (top level)
word to file mapping
sysdb/main                      # main index
sysdb/mtoc                      # meta index (table of contents for main)
sysdb/qdir/*            # index of each file (only for large files)
sysdb/qtf                       # map table (qid, mtime, path_to_file)
sysdb/qf                        # map table (qid, path_to_file)

main            # word to file qid
        aa      000000000001a1e8
        aa      000000000001a26f
        aa      000000000001a277

mtoc            # word to range in main
        aa 5393489 5824028
        ab 5824028 6963148
        ac 6963148 9918376
        ad 9918376 11926357
        ...
qdir/*/
        where '*' is qid of files
in qdir/*/ we have files:
        ind.gz  # fine index of the file (gzipped)
        dict:X  # dictionary of the file
        list:X  # inverted list of the file
        indx:X  # map table of dict:X

usrdb is same as sysdb.

-------------

Experiment

(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 8GB (only 4GB is used in 386 kernel)

(b) software
9front
cwfs64x

-------------

NB: The performance data is of older version.


The performance (compression ratio)

(a) Kirara1

target   target  num_of_dirs        index
sysdb:   380 MB    1790 dirs            100 MB
usrdb:  1900 MB    9000 dirs            230 MB

target: the size is total ammount of text file size.
index: includes "main" + (all other files for indexing).
ratio: 100/380 (sysdb)
usrdb includes many large file which improved the ratio.
current version requies much indexing space than previous version.
this is because current version of word extractor supports non-ascii runes.

*               tiny    small   medium  large
sysdb   31227    313     12      4
usrdb   85246   1509    220     17

where
tiny    <100K
small   <1000K
medium  <10000K
large


(b) Kirara2

target   target   num_of_files      index
sysdb:   380 MB   36000 files           170 MB
usrdb:  1900 MB   92000 files           580 MB


-------------

The performance (retrieval time)
system dependent and cache state dependent

QE search       # kfind foo

latency:
        term% kfind snoopy
        0.2 seconds for Kirara1.
        0.2 seconds for Kirara2.

It is not important to make these times smaller.
(sufficiently small)

Normal QE search does not make much difference between Kirara1 and Kirara2.
The difference is in phrase search. The followings is an example.
(kfind1 and kfind2 are seach command for Kirara1 and Kirara2 respectively.)

term% date -n; kfind2 'snoopy proto' |rc; date -n
1381328072
/sys/lib/man/lookman/index:48401: proto /sys/man/8/snoopy
/sys/lib/sysconfig/proto/standalone:61:                 snoopy
...
1381328077
term% date -n; kfind1 'snoopy proto' |rc; date -n
1381328119
/sys/lib/man/lookman/index:48401: proto /sys/man/8/snoopy
/sys/lib/sysconfig/proto/standalone:61:                 snoopy
...
1381328137
term%

and another example:

term% date -n; kfind2 'include regexp' | rc |wc ; date -n
1381329707
    247    3406   37076
1381329767
term% date -n; kfind1 'include regexp' | rc |wc ; date -n
1381329805
    263    3454   38337
1381329983
term%

Kfind2 is three times faster than kfind1.

NOTE: both Kirara1 and Kirara2 show excessive "matched" lines such as
        /sys/lib/sysconfig/proto/standalone:61:                 snoopy
("proto" is matched to pathname)
This tendancy is stronger in Kirara1 than in Kirara2.

-------------

The performance (construction/update)

(a) Construction time
system dependent

Initial construction

        *                Kirara1        Kirara2
        sysdb           15m                     30m
        usrdb           50m                100m

        m: minutes

(b) Updating time
two commands for update

mkdb -r         # scans all dirs in the target to detect update

        *                Kirara1        Kirara2
        usrdb            2m                      6m

        these times depend much on state of the cache (dirs in the target and the file "main")

        NB: the use of "-r" option for Kirara1 may miss the change because of
        mtime bug in cwfs. (Fossil is free from this bug)

mkdb -L         # consults event log to detect update (only for usrdb)

        *                Kirara1        Kirara2
        usrdb            2m                      4m

        the "-L" option save the time for scanining dirs in the target.
        these times depend much on state of the cache (the file "main").
        the above result is the case that is poorly cached.
        if cached well, the 2m of Kirara1 reduces to 14sec.
        on the other hand, the 4m of Kirara2 does not reduce to such a small value,
        I suspect that is because the size of "main" is too large to be fully cached.

        NB: needs event log

-------------

Scalability

Main factors

(a) retrieval time
QE search: proportional to number of dirs that include the query

(b) initial construction time
proportional to total data

(c) update time
mkdb1 -r
        proportional to number of dirs and the changes.
mkdb1 -L
        proportional to changes and size of index (file named "main").

-------------

Used Tools

(1) rc

(2) grep, sed, awk, sort, diff, gzip, ...

(3) some new tools written in C

-------------

What Kirara means?

Kirara is name of a girl that appeared in a Japanese comic book.
(But I have never read the book.)
The name is seldom used in real world.
From the name we Japanese imagine something that is shining, twincling
or gleaming beautifully and gracefully.
I like the name.

-------------

References

[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http://webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
http://webglimpse.net/gdocs/glimpsehelp.html