The Big Data Tools plugin seamlessly integrates HDFS into your IDE and provides access to different cloud storage systems (AWS S3, Minio, Linode, Digital Open Space, GS, Azure). But is this the end? Have we implemented everything and now progress has stopped? Of course not.
In this short digest, we'll take a look at 15 popular distributed file systems available on the market and try to get a sense of their individual advantages.
Almost all of these systems are free or open-source, and you can find the sources on GitHub. The sites of these projects, their documentation, and online reviews provide most of the information we’ll consider here. Other than HDFS, none of these technologies have been implemented yet in Big Data Tools. But who knows? Perhaps someday we'll see them in our plugin.
Hadoop Distributed File System (HDFS)
HDFS is a default distributed file system for Big Data projects, and our story starts here. It's highly fault-tolerant and is designed to be deployed on low-cost commodity hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
So it's designed to work with hardware failures. One of the core architectural goals of HDFS is the detection of database faults and the quick, automatic recovery from them. It's also fine-tuned to work in a streaming model: the emphasis is on high throughput of data access rather than low latency. Interactive use by users and POSIX semantics is clearly traded to high throughput rates.
HDFS can store gigabytes to terabytes of data, large files included, and tens of millions of files in a single instance. But this all comes with its cost. HDFS applications need a write-once-read-many access model for files, which means files need not be changed except for with appends and truncates.
Ambry is a distributed database designed to store and serve photos, videos, pdfs, and other media types for web companies. Ambry combines the writes of small objects into a sequential log to ensure that all writes are batches, that they are flushed asynchronously, and that disk fragmentation is minimal. Ambry can store trillions of small immutable objects (50K-100K) and billions of larger objects.
Ambry is a highly available and eventually consistent system. Writes are made to a local data center and asynchronous replication copies the data to the other data centers. You can configure what to do if a machine is not available locally, how to proxy requests to other data centers, and so on. For example, objects can be written to the same partition in any data center and can be read from any other data center.
Apache Kudu completes Hadoop's storage layer to enable fast analytics on fast data.
A Kudu cluster stores tables that look just like the tables used in relational SQL databases. It's very easy to port legacy applications. Now you're able to build new applications with an SQL mindset. There’s no need to no mess around with binary blobs or catchy JSONs. You can just work with plain old PRIMARY KEYs and columns, and the data model is fully typed. Tables are self-describing, so you can use standard tools like SQL engines or Spark to analyze your data.
It's a live storage system that supports low-latency millisecond-scale access to individual rows. Despite the fact that Kudu isn't designed to be an OLTP system, if a subset of the data fits in your local memory, it can perform OLTP tasks pretty well.
The most important thing about Kudu is that it was designed to fit in with the Hadoop ecosystem. You can stream data from live real-time data sources using the Java client and then process it immediately using Spark, Impala, or MapReduce. You can even transparently join Kudu tables with data stored in other Hadoop storage such as HDFS or HBase.
BeeGFS, formerly FhGFS, is a parallel distributed file system. It works with lightly weighted, high performant service daemon(s) in the user space over the arbitrated filesystem, such as ext4, xfs, zfs, and Hadoop.
The userspace architecture of BeeGFS makes it possible to keep metadata access latency (e.g., directory lookups) to a minimum and distributes the metadata across multiple servers so that each of the metadata servers stores a part of the global file system namespace. BeeGFS is built on highly efficient and scalable multithreaded core components with native RDMA support. File system nodes can serve RDMA (InfiniBand, (Omni-Path), RoCE and TCP/IP) network connections at the same time and automatically switch to a redundant connection path in the event that any of them fail.
Hadoop can be configured to use BeeGFS as its distributed file system, and in some cases it may be a more convenient and faster option than HDFS. You can use it as a Hadoop connector (for systems that lack low-latency, high-throughput networks and could therefore benefit from data locality) or as a POSIX mountpoint, as if it were a local file system (for systems where data locality is not as important as parallelism).
Ceph provides a traditional file system interface with POSIX semantics. It can be used as a drop-in replacement for the Hadoop File System (HDFS). This page describes how to install Ceph and configure it for use with Hadoop.
Ceph's file system runs on top of the same system responsible for object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and file names of the file system to objects stored within RADOS clusters. The metadata server cluster can expand or contract, and it can rebalance the file system dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.
You can integrate Ceph with Hadoop using the Java CephFS Hadoop plugin. You just need to add the JAR dependencies to the Hadoop installation CLASSPATH and update core-site.xml.
Disco itself is an implementation of MapReduce for distributed computing. Disco supports parallel computations over large data sets, but these sets are stored on an unreliable cluster of computers. Disco Distributed Filesystem (DDFS) provides a distributed storage layer for Disco. It can store massive amounts of immutable data, for instance log data, large binary objects (photos, videos, indices), or incrementally collected raw data, such as web crawls.
It's not a general-purpose POSIX-compatible filesystem. Rather, it is a special purpose storage layer similar to Hadoop Distributed Filesystem (HDFS). DDFS is schema-free, so you can use it to store arbitrary, non-normalized data. But the most exciting thing about DDFS is that it is a tag-based file system.
Instead of having to organize data in directory hierarchies, you can tag sets of objects with arbitrary names and retrieve them later based on the given tags. For instance, tags can be used to timestamp different versions of data or to denote its source or owner. Tags can contain links to other tags, and data can be referred to by multiple tags. Tags thus form a network or a directed graph of the metadata. This provides a flexible way to manage terabytes of data assets. DDFS also provides a mechanism for storing arbitrary attributes with the tags, for instance to denote data type.
GridGain File System (GGFS)
GGFS is a plug-and-play alternative to the disk-based HDFS. It improves performance for IO, CPU, or network intensive Hadoop MapReduce jobs running on tens and hundreds of computers in a typical Hadoop cluster. It's a part of GridGain In-Memory Accelerator.
GridGain claims that ETLs are no longer necessary anymore. GGFS makes it possible to work with data that is stored directly in Hadoop. Whether the in-memory file system is used in primary mode, or in secondary mode acting as an intelligent caching layer over the primary disk-based HDFS, it may eliminate the time consuming and costly process of extracting, loading and transforming (ETL) data to and from Hadoop. You can find details on the GridGain website or in this whitepaper.
Lustre file system
Many databases claim they are distributed, parallel, scalable, high-performance, high-availability, and so on. The exciting thing about Lustre is that it has exascale capabilities. The majority of the top 100 fastest computers, as measured by top500.org, use Lustre for their high performance, scalable storage.
Lustre is purpose-built to provide a coherent, global POSIX-compliant namespace for very large scale computer infrastructure, including the world's largest supercomputer platforms. It can support hundreds of petabytes of data storage and hundreds of gigabytes per second in simultaneous, aggregate throughput. Some of the largest current installations have individual file systems in excess of fifty petabytes of usable capacity and have reported throughput speeds exceeding one terabyte/sec.
Lustre has a very complex architecture, but this document can help you become familiar with it.
Microsoft Azure Data Lake Store
Azure is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through data centers managed by Microsoft.
Azure Data Lake Storage Gen1 is a repository for big data analytic workloads. It enables you to capture data of any size, type, and ingestion speed in one place for operational and exploratory analytics. It can be accessed from Hadoop using the WebHDFS-compatible REST API. It's designed to make it possible to perform high-performance analytics on stored data.
ADL support for Hadoop is provided as a separate JAR module. You can read and write data stored in an Azure Data Lake Storage account and reference the data with an adl scheme (adl://<Account Name>.azuredatalakestore.net/directory). Also, you can use it as a source of data in a MapReduce job, or a sink.
Quantcast File System
QFS is an alternative to the Hadoop Distributed File System (HDFS) for large-scale batch data processing. It is a production hardened, 100% open-source distributed file system. it is fully integrated with Hadoop and delivers significantly improved performance while consuming 50% less disk space (or at least that's what their research says).
QFS is implemented in C++ and carefully manages its own memory within a fixed footprint. It uses low-level direct I/O APIs for big sequential bursts instead of using the operating system’s ineffective default file I/O.
QFS uses Reed-Solomon (RS) error correction. HDFS uses triple replication, which expands data 3x. QFS uses half as much disk space by leveraging the same error correction technique CDs and DVDs do, providing better recovery power with only a 1.5x expansion.
Red Hat GlusterFS
Gluster is a scalable, distributed file system that aggregates disk storage resources from multiple servers into a single global namespace. It scales to several petabytes, handles thousands of clients, maintains POSIX compatibility, provides replication, quotas, geo-replication. And you can access it over NFS and SMB!
GlusterFS is a userspace filesystem. Its developers opted for this approach in order to avoid the need to have modules in the Linux kernel, and as a result it is quite safe and easy to use.
GlusterFS provides compatibility for Apache Hadoop and it uses the standard file system APIs available in Hadoop to provide a new storage option for Hadoop deployments. Existing MapReduce-based applications can use GlusterFS seamlessly. You can use it to provide simultaneous file-based and object-based access within Hadoop and eliminate the need to use a centralized metadata server.
Alluxio enables data orchestration for compute in any cloud. It unifies data silos on-premise and across any cloud to give you the data locality, accessibility, and elasticity needed to reduce the complexities associated with orchestrating data for today’s big data and AI/ML workloads.
Alluxio's architecture is quite simple. Master nodes manage file and object metadata. Workers work on the node's local space, with file and object blocks using HDFS underneath. Clients provide an interface for analytics and AI/ML applications.
It is meant to be used with Spark, Hive, HBase, MapReduce and even Presto (a distributed SQL query engine), covering all popular technology stacks. It also has a feature called "zero-copy bursting", which enables you to burst, or move, your remote data closer, to leverage some computing capacity in the cloud. You can watch this talk from Ashish Tadose on how Walmart uses all these features in.
Tahoe-LAFS is a decentralized storage system with what they call “provider-independent security”.
You run a client program on your computer, which talks to one or more storage servers on other computers. When you tell your client to store a file, it will encrypt that file, encode it into multiple pieces, then spread those pieces out among multiple servers. The pieces are all encrypted and protected against modifications. Later, when you ask your client to retrieve the file, it will find the necessary pieces, make sure they haven't been corrupted, reassemble them, and decrypt the result.
A crucial aspect of this process is that the service provider never has the ability to read or modify your data. You're invulnerable to security breaches of your cloud storage providers, because you never trusted them. Not only is it easy and inexpensive for the service provider to maintain the security of your data, but in fact they couldn't violate its security if they tried. This is what they mean by provider-independent security.
The Baidu File System
The Baidu File System (BFS) is a distributed file system that is able to handle Baidu-scale projects. Together with Galaxy and Tera, BFS supports many real-time products in Baidu, including its web page database, incremental indexing system, and user behavior analysis system.
Technically it should support real-time applications. Like many other distributed file systems, BFS is highly fault-tolerant, but it's designed to provide low read/write latency while maintaining high throughput rates.
Its biggest problem is lack of documentation, or at least public documentation. First impression is that it's like HDFS with a distributed NameNode/Nameserver, but there's obviously not enough information to understand how it works under the hood.
I’d like to finish this digest with a small filesystem that isn't developed by giants like Microsoft or Baidu. Seaweed FS is a one-man operation supported by almost a hundred committers and several backers on Patreon. It has 4,324 commits and 123 releases on GitHub, and the latest commit was made just a few hours ago. SeaweedFS is a simple and highly scalable distributed file system with two main objectives: to store billions of files, and serve those files fast.
SeaweedFS implements Facebook's Haystack design paper and erasure coding with ideas from f4, Facebook’s Warm BLOB Storage System. With hot data on local clusters, and warm data on the cloud with O(1) access time, SeaweedFS can achieve both fast local access times and elastic cloud storage capacity, without any client side changes. Each file’s metadata only has 40 bytes of disk storage overhead.