This perl module implements a peer-to-peer filesystem, with local caching and POSIX filesystem semantics.
Every inode of the filesystem has one or more owners. An “owner” is simply a node which has stored the complete contents of an inode (including all xattrs) in its local cache. Owners respond to queries from other nodes, which send multicast packets asking where to find a copy of the file.
If a node has extra disk space, it will look for useful things to do with that disk space, following a specific strategy. I expect this strategy to evolve with time and experience, but my current ideas follow: it will first attempt to complete partially-downloaded inodes, so it can become an owner. This is because partially downloaded inodes are most likely to have been used locally, at least once in the past. If there are no partially-downloaded inodes, the node will begin to search for inodes which have only one owner, which is claiming to be under disk pressure, or (failing that) which is far away. It will begin to download them, in an attempt to provide maximum data redundancy in case of a network outage.
If a node has exceeded its normal maximum disk usage, it will look for things to free. It will begin by cancelling any outstanding prefetches (see the above paragraph), and freeing any partially downloaded inodes which have not been recently used. If still under storage pressure, it will begin searching for inodes it owns, yet is not the only owner of. It will free those, after a suitable handshake (so they don't free it themselves). Finally, if it can't find anything suitable to free, it will start returning -ENOSPC to write requests.
Disk pressure is calculated according to 3 settings: min, max, and hardmax. “min” > “max”, and “max” > “hardmax”. “hardmax” is not
to be exceeded under any circumstances; running into it will result in -ENOSPC. “max” is the reasonable maximum storage size; exceeding it will cause something else to get freed, as above. “min” is a minimum size the filesystem will seek to fill; if the storage layer is using less than “min”, the filesystem will seek something off-node to fill it.
The distance between “min” and “max”, and the distance between “max” and “hardmax”, should each be at least as large as any file you expect to store in the filesystem. (This could therefore be as large as several gigs, in many cases.) Having too little room between max and hardmax will result in -ENOSPC errors occurring more often, because the disk-freeing process is asynchronous. Having too little room between min and max will result in cache grinding, and added bandwidth consumption.
All read requests are served out of local storage if possible, and wait for the data to be fetched from the network otherwise. If fetching the requested data would bump the disk-usage above “max”, presumably something else will get knocked out of the cache. If fetching the requested data would bump the disk- usage above “hardmax”, the data is served in “degraded mode”; the data is fetched over the network and returned directly to the application, without being cached (similar in concept to NFS's normal mode of operation).
Showing changes from previous revision. |