Overview of how HDFS works with OneFS

This chapter provides information about how the Hadoop Distributed File System (HDFS) can be implemented with Isilon OneFS.

How Hadoop is implemented on OneFS

In a Hadoop implementation on an Isilon cluster, Isilon OneFS serves as the file system for Hadoop compute clients. The Hadoop distributed file system (HDFS) is supported as a protocol, which is used by Hadoop compute clients to access data on the HDFS storage layer.

Hadoop compute clients can access the data that is stored on an Isilon cluster by connecting to any node over the HDFS protocol, and all nodes that are configured for HDFS provide NameNode and DataNode functionality as shown in the following illustration.

Figure 1. EMC Isilon Hadoop Deployment
EMC Isilon Hadoop Deployment

Each node boosts performance and expands the cluster's capacity. For Hadoop analytics, the Isilon scale-out distributed architecture minimizes bottlenecks, rapidly serves Big Data, and optimizes performance.

How an Isilon OneFS Hadoop implementation differs from a traditional Hadoop deployment

A Hadoop implementation with OneFS differs from a typical Hadoop implementation in the following ways:

  • The Hadoop compute and HDFS storage layers are on separate clusters instead of the same cluster.
  • Instead of storing data within a Hadoop distributed file system, the storage layer functionality is fulfilled by OneFS on an Isilon cluster. Nodes on the Isilon cluster function as both a NameNode and a DataNode.
  • The compute layer is established on a Hadoop compute cluster that is separate from the Isilon cluster. The Hadoop MapReduce framework and its components are installed on the Hadoop compute cluster only.
  • Instead of a storage layer, HDFS is implemented on OneFS as a native, lightweight protocol layer between the Isilon cluster and the Hadoop compute cluster. Clients from the Hadoop compute cluster connect over HDFS to access data on the Isilon cluster.
  • In addition to HDFS, clients from the Hadoop compute cluster can connect to the Isilon cluster over any protocol that OneFS supports such as NFS, SMB, FTP, and HTTP. Isilon OneFS is the only non-standard implementation of HDFS offered that allows for multi-protocol access. Isilon makes for an ideal alternative storage system to native HDFS by marrying HDFS services with enterprise-grade data management features.
  • Hadoop compute clients can connect to any node on the Isilon cluster that functions as a NameNode instead of being routed by a single NameNode.

Hadoop distributions supported by OneFS

You can run most common Hadoop distributions with the Isilon cluster.

OneFS supports many distributions of the Hadoop Distributed File System (HDFS). These distributions are updated independently of OneFS and on their own schedules.

For the latest information about Hadoop distributions that OneFS supports, see the Hadoop Distributions and Products Supported by OneFS page on the Isilon Community Network.

HDFS files and directories

You must configure one HDFS root directory in each OneFS access zone that will contain data accessible to Hadoop compute clients. When a Hadoop compute client connects to the cluster, the user can access all files and sub-directories in the specified root directory. The default HDFS directory is /ifs.

Note the following:

  • Associate each IP address pool on the cluster with an access zone. When Hadoop compute clients connect to the Isilon cluster through a particular IP address pool, the clients can access only the HDFS data in the associated access zone. This configuration isolates data within access zones and allows you to restrict client access to the data.
  • Unlike NFS mounts or SMB shares, clients connecting to the cluster through HDFS cannot be given access to individual folders within the root directory. If you have multiple Hadoop workflows that require separate sets of data, you can create multiple access zones and configure a unique HDFS root directory for each zone.
  • When you set up directories and files under the root directory, make sure that they have the correct permissions so that Hadoop clients and applications can access them. Directories and permissions will vary by Hadoop distribution, environment, requirements, and security policies.

For more information about access zones, refer to the OneFS CLI Administration Guide or OneFS Web Administration Guide for your version of OneFS.

Hadoop user and group accounts

Before implementing Hadoop, ensure that the user and groups accounts that you will need to connect over HDFS are configured on the Isilon cluster.

Additionally, ensure that the user accounts that your Hadoop distribution requires are configured on the Isilon cluster on a per-zone basis. The user accounts that you need and the associated owner and group settings vary by distribution, requirements, and security policies. The profiles of the accounts, including UIDs and GIDS, on the Isilon cluster should match the profiles of the accounts on your Hadoop compute clients.

OneFS must be able to look up a local Hadoop user or group by name. If there are no directory services, such as Active Directory or LDAP, that can perform a user lookup, you must create a local Hadoop user or group. If directory services are available, a local user account or user group is not required.

HDFS and SmartConnect

You can configure a SmartConnect DNS zone to manage connections from Hadoop compute clients.

SmartConnect is a module that specifies how the DNS server on an Isilon cluster handles connection requests from clients. For each IP address pool on the Isilon cluster, you can configure a SmartConnect DNS zone which is a fully qualified domain name (FQDN).

For more information on SmartConnect, refer to the OneFS CLI Administration Guide or OneFS Web Administration Guide for your version of OneFS.

Note the following:

  • Hadoop compute clients can connect to the cluster through the SmartConnect DNS zone name, and SmartConnect evenly distributes NameNode requests across IP addresses and nodes in the pool.
  • When a Hadoop compute client makes an initial DNS request to connect to the SmartConnect zone, the Hadoop client is routed to the IP address of an Isilon node that serves as a NameNode. Subsequent requests from the Hadoop compute client go the same node. When a second Hadoop client makes a DNS request for the SmartConnect zone, SmartConnect balances traffic and routes the client connection to a different node than that used by the previous Hadoop compute client.
  • If you specify a SmartConnect DNS zone that you want Hadoop compute clients to connect though, you must add a Name Server (NS) record as a delegated domain to the authoritative DNS zone that contains the Isilon cluster.
  • On the Hadoop compute cluster, you must set the value of the fs.defaultFS property to the SmartConnect DNS zone name in the core-site.xml file.