About dclib's .bin files

Intro

DClib stores data about the files in your share in various .bin files kept in the ~/.dc folder. They are maintained by the CSearchIndex class. However, CSearchIndex is really an internal class only used by CFileManager. It is CFileManager that does much of the work during filelist refresh.

Advantages: Loading the database from disk is just reading the .bin files (not hashleaves) into memory and saving is just dumping the memory to disk. Minimal memory overhead, the data is just stored in memory, no containers are used.

Disadvantages: Working with the data in memory is all pointers / memcmp / memcpy.

Update 14/07/2008

The keyword indexing code has been removed from CSearchIndex and replaced with a new class called CKeywordIndex, which is very simple. CKeywordIndex cannot be saved to disk and must be generated on valknut startup and in the second stage of the filelist refresh process. However the old keyword indexes in CSearchIndex were always re-generated from empty during filelist refresh and this is by far the shortest stage of the refresh process.

Also, CKeywordIndex includes keywords from the full path and filename so that searches are handled the same way as by DC++. Also, all components of the name are indexed. But 1 or 2 character words are only looked up in the index if there were no longer components, usually they are used to filter the results before sending them as before. This is because looking for "1" in the index will not find "01" "101" "1x01" etc. so is not normally useful.

Update 16/07/2008

CKeywordIndex only existed for about two days and has now been removed. Instead the full pathname+filename for every file in your share is just searched for each search keyword. This system may be slower than the old but appears unavoidable. However it may be faster because for searches that would return many results, just the first 5 or 10 results are returned and the search can stop. The bad case is now searches that return no results, causing every filename to be searched. The number of string searches cannot be reduced but it may be possible to reduce the amount of data that is searched by handling path names better since there are usually far fewer paths than files.

The files

Files database

Search indexes - no longer used in 0.3.19svn and later.

The contents of these files was not well understood.

New search lists in 0.3.19svn

Hash database

The files and hash databases are fairly simple, the old search index is not as well understood and was completely replaced for 0.3.19.

What they contain

database.bin is a list of filebaseobjects, which are fixed size structs.

filebase.bin is a list of file names, each of varying length and terminated with \0 (like normal strings).

pathbase.bin is a list of file paths, again of varying length.

Each filebaseobject stores the position of the filename and pathname.

The hash database has similar design.

hashbase.bin stores fixed size hashbaseobjects.

hashfilebase.bin stores variable length filenames, null terminated.

hashpathbase.bin stores variable length pathnames, null terminated.

hashindex.bin stores fixed length TTH roots, in binary, not base32 encoded.

hashleaves.bin stores varying length TTH leaf data. Each entry starts with the length of the leaf data stored as a ulonglong, followed by the data.

The new search index:

casefoldedbase.bin is a list of unsigned longs, storing the offset in casefoldeddata.bin of the file's path+name. Adding this to the filebaseobject struct would have changed the format of database.bin. It uses the same index as database.bin.

casefoldeddata.bin is supposed to store the case-folded path+name of the file needed to make searching case insensitive. However it currently just uses CString::ToUpper() which just uses the C function toupper() so only works on plain ASCII letters. Unfortunately towupper() doesn't seem to change the case of accented characters either so a case-folding class needs to be added.

What happens on filelist refresh

It's a three stage process.

First, database, filebase and pathbase are cleared and recreated, with the data of files currently shared.

Next, the search indexes are cleared and recreated based on the contents of the files database. Essentially, each filename is stored, then each sub component of the filename is also stored. Data is somehow keyed on first letter.

Finally, the existing hash database is updated, it is not cleared first. This is so that once hashed, files can be unshared, then later re-shared, without re-hashing, provided the filename, path and timestamp match.

But: both filebaseobject and hashbaseobject contain a hash index field, which stores the position in hashindex.bin that the TTH root is. When database.bin is created in the first stage, the hash index is unknown and set as -1. The filebaseobject's hash index is filled in during the third stage.

What's changed recently

The original design taken from the unreleased work in dcgui.berlios.de cvs kept the hash leaves in a byte array, in memory. Since the hash leaf data can be quite large (50MiB for a few hundred GiB of share), dclib 0.3.9 kept the hash leaves on disk for read access, but loaded them all into memory during filelist refresh. 0.3.15 will just append new hash leaves to the end of the hashleaves.bin file, saving lots of memory.

To make search results available during filelist refresh, which is required so that uploads using TTHF can start, major changes to CSearchIndex and CFileManager were required. It was necessary to re-create/update independent byte arrays, which then replace the old arrays once all 3 stages of filelist refresh have been completed. However, some CSearchIndex methods are used both to find a file by TTH in response to a search request, and to check if a file already has a hash during hash database update. These methods were just copied and renamed, the original method used for search handling accessing the old byte arrays, and the new method used during filelist refresh accessing the byte arrays that are being updated.

Also, there was an unforseen problem caused by folders being indexed. DClib was stopping the search when "exact matches" were found. Folders are likely to be just a series or artist name, and so would be an exact match, and be the only result returned, preventing files, which have the series or artist plus other information in their name, from also being returned. Fortunately the fix for this was trivial.

What's still todo

One problem with dclib's search system, that has been bugging me since the beginning (but apparently no-one else knows or cares about) is that dclib cannot handle components of search queriers shorter than 3 characters. Components of filenames shorter than 3 chars are not put into the indexes, and components of search queries shorter than 3 chars are ignored.

So, what happens when you put the short bits into the indexes, and stop ignoring them? Well, the result is not particularly useful. Each word in the index is tested for equality with the part of the search query, so the indexes might contain "1x01" or "s01e01" which do not match "1" or "01".

So, what about using CString::Find to search for the query in each index item? Well it didn't work either. Possibly because things were too likely to become an "exact match" and stop the search - although this behaviour has just been changed. But, doing this makes the search indexes redundant - there's no need to store the whole filename plus each component of the filename since if CString::Find is going to match, it will always match in the whole filename.

Any solution is likely to require full understanding of the existing search indexes - well unless they're going to be completely replaced. Also, what DC++ does needs to be checked, because last time I checked, searching for "battle" does not return any results for "battlestar", but you can include "5" or "05" in your search to get the fifth episode.

Update 10/06/2008

Although nothing in CFileManager / CSearchIndex and the .bin files was changed, CQueryManager was modified to handle short (less than 3 chars) components of search queries. They are used to filter out results from CSearchIndex that do not match, in the same place as size and type are checked. A query consisting of only short components will still return no results, but a search for "3" or "in" is unlikely to produce useful results anyway.

Update 14/07/2008

Oh dear, it appears DC++ does handle substring queries differently. DC++ will match "battlestar" for "battle" or "star" but dclib will not. Fixing this makes any keyword index useless and you just have to use string find on every path+filename in the share?

Update 16/07/2008

All keyword indexing code has now been removed and instead dclib just has to do the string searching.