summaryrefslogtreecommitdiff
path: root/bindings/rs-dablooms/dablooms/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'bindings/rs-dablooms/dablooms/README.md')
-rw-r--r--bindings/rs-dablooms/dablooms/README.md259
1 files changed, 259 insertions, 0 deletions
diff --git a/bindings/rs-dablooms/dablooms/README.md b/bindings/rs-dablooms/dablooms/README.md
new file mode 100644
index 0000000..d606233
--- /dev/null
+++ b/bindings/rs-dablooms/dablooms/README.md
@@ -0,0 +1,259 @@
+Dablooms: A Scalable, Counting, Bloom Filter
+----------------------------------
+
+_Note_: this project has been mostly unmaintained for a while.
+
+### Overview
+This project aims to demonstrate a novel Bloom filter implementation that can
+scale, and provide not only the addition of new members, but reliable removal
+of existing members.
+
+Bloom filters are a probabilistic data structure that provide space-efficient
+storage of elements at the cost of possible false positive on membership
+queries.
+
+**dablooms** implements such a structure that takes additional metadata to classify
+elements in order to make an intelligent decision as to which Bloom filter an element
+should belong.
+
+### Features
+**dablooms**, in addition to the above, has several features.
+
+* Implemented as a static C library
+* Memory mapped
+* 4 bit counters
+* Sequence counters for clean/dirty checks
+* Python wrapper
+
+For performance, the low-level operations are implemented in C. It is also
+memory mapped which provides async flushing and persistence at low cost.
+In an effort to maintain memory efficiency, rather than using integers, or
+even whole bytes as counters, we use only four bit counters. These four bit
+counters allow up to 15 items to share a counter in the map. If more than a
+small handful are sharing said counter, the Bloom filter would be overloaded
+(resulting in excessive false positives anyway) at any sane error rate, so
+there is no benefit in supporting larger counters.
+
+The Bloom filter also employs change sequence numbers to track operations performed
+on the Bloom filter. These allow the application to determine if a write might have
+only partially completed (perhaps due to a crash), leaving the filter in an
+inconsistent state. The application can thus determine if a filter is ok or needs
+to be recreated. The sequence number can be used to determine what a consistent but
+out-of-date filter missed, and bring it up-to-date.
+
+There are two sequence numbers (and helper functions to get them): "mem_seqnum" and
+"disk_seqnum". The "mem" variant is useful if the user is sure the OS didn't crash,
+and the "disk" variant is useful if the OS might have crashed since the Bloom filter
+was last changed. Both values could be "0", meaning the filter is possibly
+inconsistent from their point of view, or a non-zero sequence number that the filter
+is consistent with. The "mem" variant is often non-zero, but the "disk" variant only
+becomes non-zero right after a (manual) flush. This can be expensive (it's an fsync),
+so the value can be ignored if not relevant for the application. For example, if the
+Bloom file exists in a directory which is cleared at boot (like `/tmp`), then the
+application can safely assume that any existing file was not affected by an OS crash,
+and never bother to flush or check disk_seqnum. Schemes involving batching up changes
+are also possible.
+
+The dablooms library is not inherently thread safe, this is the clients responsibility.
+Bindings are also not thread safe, unless they state otherwise.
+
+### Installing
+Clone the repo, or download and extract a tarball of a tagged version
+[from github](https://github.com/bitly/dablooms/tags).
+In the source tree, type `make`, `make install` (`sudo` may be needed).
+This will only install static and dynamic versions of the C dablooms library "libdablooms".
+
+To use a specific build directory, install prefix, or destination directory for packaging,
+specify `BLDDIR`, `prefix`, or `DESTDIR` to make. For example:
+`make install BLDDIR=/tmp/dablooms/bld DESTDIR=/tmp/dablooms/pkg prefix=/usr`
+
+Look at the output of `make help` for more options and targets.
+
+Also available are bindings for various other languages:
+
+#### Python (pydablooms)
+To install the Python bindings "pydablooms" (currently only compatibly with python 2.x)
+run `make pydablooms`, `make install_pydablooms` (`sudo` may be needed).
+
+To use and install for a specific version of Python installed on your system,
+use the `PYTHON` option to make. For example: `make install_pydablooms PYTHON=python2.7`.
+You can override the module install location with the `PY_MOD_DIR` option to make,
+and the `BLDDIR` and `DESTDIR` options also affect pydablooms.
+
+The Makefile attempts to determine the python module location `PY_MOD_DIR`
+automatically. It prefers a location in `/usr/local`, but you can specify
+`PY_MOD_DIR_ARG=--user` to try to use the location which `pip install --user`
+would use in your HOME dir. You can instead specify `PY_MOD_DIR_ARG=--system`
+to prefer the normal/central system python module dir.
+
+See pydablooms/README.md for more info.
+
+#### Go (godablooms)
+The Go bindings "godablooms" are not integrated into the Makefile.
+Install libdablooms first, then look at `godablooms/README.md`
+
+### Contributing
+If you make changes to C portions of dablooms which you would like merged into the
+upstream repository, it would help to have your code match our C coding style. We use
+[astyle](http://astyle.sourceforge.net/), svn rev 353 or later, on our code, with the
+following options:
+
+ astyle --style=1tbs --lineend=linux --convert-tabs --preserve-date \
+ --fill-empty-lines --pad-header --indent-switches \
+ --align-pointer=name --align-reference=name --pad-oper -n <files>
+
+### Testing
+To run a quick and dirty test, type `make test`. This test uses a list of words
+and defaults to `/usr/share/dict/words`. If your path differs, you can use the
+`WORDS` flag to specific its location, such as `make test WORDS=/usr/dict/words`.
+
+This will run a simple test that iterates through a word list and
+adds each word to dablooms. It iterates again, removing every fifth
+element. Lastly, it saves the file, opens a new filter, and iterates a third time
+checking the existence of each word. It prints results of the true negatives,
+false positives, true positives, and false negatives, and the false positive rate.
+
+The false positive rate is calculated by "false positives / (false positivies + true negatives)".
+That is, what rate of real negatives are false positives. This is the interesting
+statistic because the rate of false negatives should always be zero.
+
+The test uses a maximum error rate of .05 (5%) and an initial capacity of 100k. If
+the dictionary is near 500k, we should have created 4 new filters in order to scale to size.
+
+A second test adds every other word in the list, and removes no words, causing each
+used filter to stay at maximum capacity, which is a worse case for accuracy.
+
+Check out the performance yourself, and checkout the size of the resulting file!
+
+## Bloom Filter Basics
+Bloom filters are probabilistic data structures that provide
+space-efficient storage of elements at the cost of occasional false positives on
+membership queries, i.e. a Bloom filter may state true on query when it in fact does
+not contain said element. A Bloom filter is traditionally implemented as an array of
+`M` bits, where `M` is the size of the Bloom filter. On initialization all bits are
+set to zero. A filter is also parameterized by a constant `k` that defines the number
+of hash functions used to set and test bits in the filter. Each hash function should
+output one index in `M`. When inserting an element `x` into the filter, the bits
+in the `k` indices `h1(x), h2(x), ..., hk(X)` are set.
+
+In order to query a Bloom filter, say for element `x`, it suffices to verify if
+all bits in indices `h1(x), h2(x), ..., hk(x)` are set. If one or more of these
+bits is not set then the queried element is definitely not present in the
+filter. However, if all these bits are set, then the element is considered to
+be in the filter. Given this procedure, an error probability exists for positive
+matches, since the tested indices might have been set by the insertion of other
+elements.
+
+### Counting Bloom Filters: Solving Removals
+The same property that results in false positives *also* makes it
+difficult to remove an element from the filter as there is no
+easy means of discerning if another element is hashed to the same bit.
+Unsetting a bit that is hashed by multiple elements can cause **false
+negatives**. Using a counter, instead of a bit, can circumvent this issue.
+The bit can be incremented when an element is hashed to a
+given location, and decremented upon removal. Membership queries rely on whether a
+given counter is greater than zero. This reduces the exceptional
+space-efficiency provided by the standard Bloom filter.
+
+### Scalable Bloom Filters: Solving Scale
+Another important property of a Bloom filter is its linear relationship between size
+and storage capacity. If the maximum allowable error probability and the number of elements to store
+are both known, it is relatively straightforward to dimension an appropriate
+filter. However, it is not always possible to know how many elements
+will need to be stored a priori. There is a trade off between over-dimensioning filters or
+suffering from a ballooning error probability as it fills.
+
+Almeida, Baquero, Preguiça, Hutchison published a paper in 2006, on
+[Scalable Bloom Filters](http://www.sciencedirect.com/science/article/pii/S0020019006003127),
+which suggested a means of scalable Bloom filters by creating essentially
+a list of Bloom filters that act as one large Bloom filter. When greater
+capacity is desired, a new filter is added to the list.
+
+Membership queries are conducted on each filter with the positives
+evaluated if the element is found in any one of the filters. Naively, this
+leads to an increasing compounding error probability since the probability
+of the given structure evaluates to:
+
+ 1 - 𝚺(1 - P)
+
+It is possible to bound this error probability by adding a reducing tightening
+ratio, `r`. As a result, the bounded error probability is represented as:
+
+ 1 - 𝚺(1 - P0 * r^i) where r is chosen as 0 < r < 1
+
+Since size is simply a function of an error probability and capacity, any
+array of growth functions can be applied to scale the size of the Bloom filter
+as necessary. We found it sufficient to pick .9 for `r`.
+
+## Problems with Mixing Scalable and Counting Bloom Filters
+Scalable Bloom filters do not allow for the removal of elements from the filter.
+In addition, simply converting each Bloom filter in a scalable Bloom filter into
+a counting filter also poses problems. Since an element can be in any filter, and
+Bloom filters inherently allow for false positives, a given element may appear to
+be in two or more filters. If an element is inadvertently removed from a filter
+which did not contain it, it would introduce the possibility of **false negatives**.
+
+If however, an element can be removed from the correct filter, it maintains
+the integrity of said filter, i.e. prevents the possibility of false negatives. Thus,
+a scaling, counting, Bloom filter is possible if upon additions and deletions
+one can correctly decide which Bloom filter contains the element.
+
+There are several advantages to using a Bloom filter. A Bloom filter gives the
+application cheap, memory efficient set operations, with no actual data stored
+about the given element. Rather, Bloom filters allow the application to test,
+with some given error probability, the membership of an item. This leads to the
+conclusion that the majority of operations performed on Bloom filters are the
+queries of membership, rather than the addition and removal of elements. Thus,
+for a scaling, counting, Bloom filter, we can optimize for membership queries at
+the expense of additions and removals. This expense comes not in performance,
+but in the addition of more metadata concerning an element and its relation to
+the Bloom filter. With the addition of some sort of identification of an
+element, which does not need to be unique as long as it is fairly distributed, it
+is possible to correctly determine which filter an element belongs to, thereby able
+to maintain the integrity of a given Bloom filter with accurate additions
+and removals.
+
+## Enter dablooms
+dablooms is one such implementation of a scaling, counting, Bloom filter that takes
+additional metadata during additions and deletions in the form of a (generally)
+monotonically increasing integer to classify elements (possibly a timestamp).
+This is used during additions/removals to easily determine the correct Bloom filter
+for an element (each filter is assigned a range). Checking an item against the Bloom
+filter, which is assumed to be the dominant activity, does not use the id (it works
+like a normal scaling Bloom filter).
+
+dablooms is designed to scale itself using these identifiers and the given capacity.
+When a Bloom filter is at capacity, dablooms will create a new Bloom filter which
+starts at the next id after the greatest id of the previous Bloom filter. Given the
+fact that the identifiers monotonically increase, new elements will be added to the
+newest Bloom filter. Note, in theory and as implemented, nothing prevents one from
+adding an element to any "older" filter. You just run the increasing risk of the
+error probability growing beyond the bound as it becomes "overfilled".
+
+You can then remove any element from any Bloom filter using the identifier to intelligently
+pick which Bloom filter to remove from. Consequently, as you continue to remove elements
+from Bloom filters that you are not continuing to add to, these Bloom filters will become
+more accurate.
+
+The "id" of an element does not need to be known to check the Bloom filter, but does need
+to be known when the element is removed (and the same as when it was added). This might
+be convenient if the item already has an appropriate id (almost always increasing for new
+items) associated with it.
+
+### Example use case
+There is a database with a collection of entries. There is a series of items, each of which
+you want to look up in the database; most will have no entry in the database, but some
+will. Perhaps it's a database of spam links. If you use dablooms in front of the database,
+you can avoid needing to check the database for almost all items which won't be found in
+it anyway, and save a lot of time and effort. It's also much easier to distribute the
+Bloom filter than the entire database. But to make it work, you need to determine an "id"
+whenever you add to or remove from the Bloom filter. You could store the timestamp when
+you add the item to the database as another column in the database, and give it to
+`scaling_bloom_add()` as well. When you remove the item, you look it up in the database
+first and pass the timestamp stored there to `scaling_bloom_remove()`. The timestamps for
+new items will be equal or greater, and definitely greater over time. Instead of
+timestamps, you could also use an auto-incrementing index. Checks against the Bloom
+don't need to know the id and should be quick. If a check comes back negative, you can be
+sure the item isn't in the database, and skip that query completely. If a check comes
+back positive, you have to query the database, because there's a slight chance that the
+item isn't actually in there.