bio | home | mobile |
papers | videos
QUICK_GUIDE
-------------
Kirara1 (ver 1.5)
Kirara2 (ver 2.2)
look
http://plan9.aichi-u.ac.jp/netlib/
for new version or bug fix version.
-------------
Kirara is a desktop search engine for Plan 9.
Indexing and search target:
- alphanumeric words in Latin text
including English, German, and etc.
- alphabetic words in Greek, Coptic,
Cyrillic, Armenian
- CJK Kanji and Japanese Katakana
Personal use: index/retrieve local files.
Kirara is based on the idea similar to Glimpse.
(1) indexing + grep
(2) multi-level indexing
(a) small space for indexing
(b) small update time
(c) quick search
Note that:
small indexing ->|<- quick search #
comflicting
Kirara makes more index -> quick search
Glimpse is single-level indexing.
-------------
Query
Kirara2 is similar to Kirara1 but stands on different concept.
Kirara1 supports a sort of phrase search but slow.
Kirara2 improves this problem at the cost of indexing space and
updating time.
phrase search of Kirara: search text lines with multiple query
words.
Kirara1: The database is index of words that points to
directory.
Kirara2: The database is index of words that points to file.
Both support:
QE mode (query expression mode)
'&', '|', '*', '?', ' '
The example:
'snoopy&html'
'sn?o*y htm*'
Look MAN_KFIND for detail.
-------------
Words
Two or more aphanumeric letters including '_' for Latin text.
All words are converted to lower case.
The minimum number of letters is configurable.
Assumption:
Text is composed of words that are separated by spaces and
symbols.
This is popular in English and many European Languages, but not in
Japanese.
In extracting words from Japanese text, Hiragana is discarded. This
is reasonable
strategy in handling Japanese.
-------------
The user's interface
Best match with Rio.
term% kfind1 snoopy
G1 'snoopy' /lib/dict/
G1 'snoopy'
/sys/lib/linux/var/lib/apt/lists/
G1 'snoopy' /sys/lib/man/lookman/
G1 'snoopy'
/sys/lib/sysconfig/auth/
G1 'snoopy'
/sys/lib/sysconfig/proto/
G1 'snoopy' /sys/man/
G1 'snoopy' /sys/man/3/
G1 'snoopy' /sys/man/8/
G1 'snoopy' /sys/src/cmd/gs/doc/
G1 'snoopy' /sys/src/cmd/ip/
...
This is an example for Kirara1.
In Kirara2, replace: kfind1 -> kfind2 and G1 -> G2
and also, directories -> files
Note that: two steps
Kirara1:
1. find directories
2. find files and the contents
Step 2 is actually 'grep'.
Kirara2:
1. find files
2. find the contents
Step 2 is 'grep' for small files and new tools (look MAN_MKDICT)
for large files.
Two-steps search is not a weekness, but a desirable feature.
Because we have so many files that are hit by the query.
-------------
The indexing organization
My example
Kirara1: /n/other/kirara1/sysdb
Kirara2: /n/other/kirara2/sysdb
target=(/lib /sys/lib /sys/src /sys/man /sys/include /sys/doc
/rc)
Kirara1: /n/other/kirara1/usrdb
Kirara2: /n/other/kirara2/usrdb
target=$home/^(bin/rc lib netlib doc adm issues srclib src)
Now, you may have as many usr*db as you like.
Indexing target is fully configurable.
Excluding target is also supported.
It is wise not to include dictionary such as "/lib/dict" which
is
annoying in retrieval.
-------------
Multi-Level Indexing (Kirara1)
(1) Indexing (top level)
word to directory mapping
sysdb/main
# main index
sysdb/mtoc
# meta index (table of contents for main)
sysdb/qdir/* # index of
each directory (only for large direcories)
sysdb/qtd
# map table (qid, mtime, path_to_dir)
sysdb/qd
# map table (qid, path_to_dir)
main # word to dir qid
aa
0000000000014f0a
aa
000000000001a1e0
aa
000000000001a26e
mtoc # word to range in
main
aa 0 126669
ab 126669 491569
ac 491569 1258566
ad 1258566 1852467
...
qdir/*/
where '*' is qid of directories
in qdir/*/ we have
qtn
# map table of files (qid, mtime, name)
and optionary, word to file mapping:
ind.gz # fine index of the
directory (gzipped)
usrdb is same as sysdb.
-------------
Multi-Level Indexing (Kirara2)
(1) Indexing (top level)
word to file mapping
sysdb/main
# main index
sysdb/mtoc
# meta index (table of contents for main)
sysdb/qdir/* # index of
each file (only for large files)
sysdb/qtf
# map table (qid, mtime, path_to_file)
sysdb/qf
# map table (qid, path_to_file)
main # word to file
qid
aa
000000000001a1e8
aa
000000000001a26f
aa
000000000001a277
mtoc # word to range in
main
aa 5393489 5824028
ab 5824028 6963148
ac 6963148 9918376
ad 9918376 11926357
...
qdir/*/
where '*' is qid of files
in qdir/*/ we have files:
ind.gz # fine index of the file
(gzipped)
dict:X # dictionary of the
file
list:X # inverted list of the
file
indx:X # map table of dict:X
usrdb is same as sysdb.
-------------
Experiment
(a) hardware
GA-H61M-USB3-B3
Intel Pentium G860 (3GHz)
DDR3 PC3 8GB (only 4GB is used in 386 kernel)
(b) software
9front
cwfs64x
-------------
NB: The performance data is of older version.
The performance (compression ratio)
(a) Kirara1
target target num_of_dirs
index
sysdb: 380 MB 1790 dirs
100 MB
usrdb: 1900 MB 9000 dirs
230 MB
target: the size is total ammount of text file size.
index: includes "main" + (all other files for indexing).
ratio: 100/380 (sysdb)
usrdb includes many large file which improved the ratio.
current version requies much indexing space than previous
version.
this is because current version of word extractor supports
non-ascii runes.
* tiny
small medium large
sysdb 31227 313 12
4
usrdb 85246 1509 220
17
where
tiny <100K
small <1000K
medium <10000K
large
(b) Kirara2
target target num_of_files
index
sysdb: 380 MB 36000 files
170 MB
usrdb: 1900 MB 92000 files
580 MB
-------------
The performance (retrieval time)
system dependent and cache state dependent
QE search # kfind foo
latency:
term% kfind snoopy
0.2 seconds for Kirara1.
0.2 seconds for Kirara2.
It is not important to make these times smaller.
(sufficiently small)
Normal QE search does not make much difference between Kirara1 and
Kirara2.
The difference is in phrase search. The followings is an
example.
(kfind1 and kfind2 are seach command for Kirara1 and Kirara2
respectively.)
term% date -n; kfind2 'snoopy proto' |rc; date -n
1381328072
/sys/lib/man/lookman/index:48401: proto /sys/man/8/snoopy
/sys/lib/sysconfig/proto/standalone:61:
snoopy
...
1381328077
term% date -n; kfind1 'snoopy proto' |rc; date -n
1381328119
/sys/lib/man/lookman/index:48401: proto /sys/man/8/snoopy
/sys/lib/sysconfig/proto/standalone:61:
snoopy
...
1381328137
term%
and another example:
term% date -n; kfind2 'include regexp' | rc |wc ; date -n
1381329707
247 3406 37076
1381329767
term% date -n; kfind1 'include regexp' | rc |wc ; date -n
1381329805
263 3454 38337
1381329983
term%
Kfind2 is three times faster than kfind1.
NOTE: both Kirara1 and Kirara2 show excessive "matched" lines such
as
/sys/lib/sysconfig/proto/standalone:61:
snoopy
("proto" is matched to pathname)
This tendancy is stronger in Kirara1 than in Kirara2.
-------------
The performance (construction/update)
(a) Construction time
system dependent
Initial construction
*
Kirara1 Kirara2
sysdb
15m
30m
usrdb
50m
100m
m: minutes
(b) Updating time
two commands for update
mkdb -r # scans all dirs in the target
to detect update
*
Kirara1 Kirara2
usrdb
2m
6m
these times depend much on state of the
cache (dirs in the target and the file "main")
NB: the use of "-r" option for Kirara1
may miss the change because of
mtime bug in cwfs. (Fossil is free from
this bug)
mkdb -L # consults event log to detect
update (only for usrdb)
*
Kirara1 Kirara2
usrdb
2m
4m
the "-L" option save the time for
scanining dirs in the target.
these times depend much on state of the
cache (the file "main").
the above result is the case that is
poorly cached.
if cached well, the 2m of Kirara1
reduces to 14sec.
on the other hand, the 4m of Kirara2
does not reduce to such a small value,
I suspect that is because the size of
"main" is too large to be fully cached.
NB: needs event log
-------------
Scalability
Main factors
(a) retrieval time
QE search: proportional to number of dirs that include the
query
(b) initial construction time
proportional to total data
(c) update time
mkdb1 -r
proportional to number of dirs and the
changes.
mkdb1 -L
proportional to changes and size of
index (file named "main").
-------------
Used Tools
(1) rc
(2) grep, sed, awk, sort, diff, gzip, ...
(3) some new tools written in C
-------------
What Kirara means?
Kirara is name of a girl that appeared in a Japanese comic
book.
(But I have never read the book.)
The name is seldom used in real world.
From the name we Japanese imagine something that is shining,
twincling
or gleaming beautifully and gracefully.
I like the name.
-------------
References
[1] GLIMPSE: A Tool to Search Through Entire File Systems
Udi Manber and Sun Wu (1993)
http://webglimpse.net/pubs/glimpse.pdf
[2] Glimpse Documentation
http://webglimpse.net/gdocs/glimpsehelp.html