diff --git a/docs/directories.rst b/docs/directories.rst index 985d9ba38..92f3646ac 100644 --- a/docs/directories.rst +++ b/docs/directories.rst @@ -1,19 +1,21 @@ Directories and files =========================== -When talking about file systems, many people would assume directories, list files under a directory, etc. These are expected if we want to hook up Seaweed File System with linux by FUSE, or with Hadoop, etc. +When talking about file systems, many people would assume directories, +list files under a directory, etc. These are expected if we want to hook up +Seaweed File System with linux by FUSE, or with Hadoop, etc. Sample usage ##################### -Two ways to start a weed filer +Two ways to start a weed filer in standalone mode: .. code-block:: bash - + # assuming you already started weed master and weed volume weed filer - # Or assuming you have nothing started yet, - # this command starts master server, volume server, and filer in one shot. + # Or assuming you have nothing started yet, + # this command starts master server, volume server, and filer in one shot. # It's strictly the same as starting them separately. weed server -filer=true @@ -80,10 +82,10 @@ This assumed differences between directories and files lead to the design that t * efficient to move/rename/list_directories * Store files in a sorted string table in format - + * efficient to list_files, just simple iterator * efficient to locate files, binary search - + Complexity ################### @@ -131,4 +133,4 @@ Helps Wanted This is a big step towards more interesting Seaweed-FS usage and integration with existing systems. -If you can help to refactor and implement other directory meta data, or file meta data storage, please do so. \ No newline at end of file +If you can help to refactor and implement other directory meta data, or file meta data storage, please do so. diff --git a/docs/distributed_filer.rst b/docs/distributed_filer.rst new file mode 100644 index 000000000..278347481 --- /dev/null +++ b/docs/distributed_filer.rst @@ -0,0 +1,101 @@ +Distributed Filer +=========================== + +The default weed filer is in standalone mode, storing file metadata on disk. +It is quite efficient to go through deep directory path and can handle +millions of files. + +However, no SPOF is a must-have requirement for many projects. + +Luckily, SeaweedFS is so flexible that we can use a completely different way +to manage file metadata. + +This distributed filer uses Cassandra to store the metadata. + +Cassandra Setup +##################### +Here is the CQL to create the table.CassandraStore. +Optionally you can adjust the keyspace name and replication settings. +For production server, you would want to set replication_factor to 3. + +.. code-block:: bash + + create keyspace seaweed WITH replication = { + 'class':'SimpleStrategy', + 'replication_factor':1 + }; + + use seaweed; + + CREATE TABLE seaweed_files ( + path varchar, + fids list, + PRIMARY KEY (path) + ); + + +Sample usage +##################### + +To start a weed filer in distributed mode: + +.. code-block:: bash + + # assuming you already started weed master and weed volume + weed filer -cassandra.server=localhost + +Now you can add/delete files, and even browse the sub directories and files + +.. code-block:: bash + + # POST a file and read it back + curl -F "filename=@README.md" "http://localhost:8888/path/to/sources/" + curl "http://localhost:8888/path/to/sources/README.md" + # POST a file with a new name and read it back + curl -F "filename=@Makefile" "http://localhost:8888/path/to/sources/new_name" + curl "http://localhost:8888/path/to/sources/new_name" + +Limitation +############ +List sub folders and files are not supported because Cassandra does not support +prefix search. + +Flat Namespace Design +############ +In stead of using both directory and file metadata, this implementation uses +a flat namespace. + +If storing each directory metadata separatedly, there would be multiple +network round trips to fetch directory information for deep directories, +impeding system performance. + +A flat namespace would take more space because the parent directories are +repeatedly stored. But disk space is a lesser concern especially for +distributed systems. + +Complexity +################### + +For one file retrieval, the full_filename=>file_id lookup will be O(logN) +using Cassandra. But very likely the one additional network hop would +take longer than the Cassandra internal lookup. + +Use Cases +######################### + +Clients can assess one "weed filer" via HTTP, list files under a directory, create files via HTTP POST, read files via HTTP POST directly. + +Although one "weed filer" can only sits in one machine, you can start multiple "weed filer" on several machines, each "weed filer" instance running in its own collection, having its own namespace, but sharing the same Seaweed-FS storage. + +Future +################### + +The Cassandra implementation can be switched to other distributed hash table. + +Helps Wanted +######################## + +Please implement your preferred metadata store! + +Just follow the cassandra_store/cassandra_store.go file and send me a pull +request. I will handle the rest. diff --git a/docs/index.rst b/docs/index.rst index bf696558a..429e76d64 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -6,14 +6,14 @@ Welcome to weed-fs's documentation! =================================== -An official mirror of code.google.com/p/weed-fs . -Moving to github.com to make cooperations easier. +An official mirror of code.google.com/p/weed-fs . +Moving to github.com to make cooperations easier. This repo and the google code repo will be kept synchronized. For documents and bug reporting, Please visit http://weed-fs.googlecode.com - + For pre-compiled releases, https://bintray.com/chrislusf/Weed-FS/seaweed @@ -30,6 +30,7 @@ Contents: ttl failover directories + distributed_filer usecases optimization benchmarks