Posts Tagged: 分布式文件系统


25
六 09

MogileFS安装学习记录

从这里开始

这不是原创,只是对一个过程的记录。
网上已经有不少关于MogileFS的文章,有营养的内容会出现在下面。

我的平台
操作系统:CentOS release 5.3。
硬件架构:i386。
其它:最小化安装,安装了“开发工具”组。

参考

重点参考这篇文章http://durrett.net/mogilefs_setup.html。
可以去看官方的wiki:http://mogilefs.pbwiki.com/。(有可能被GFW了,那么你就安装一个Firefox的gladder插件就可以看了)

MogileFS的特性

MogileFS是一个分布式文件存储的解决方案,他由Six Apart开发下面列出了他的一些特性(由mogileFS页面http://www.danga.com/mogilefs/ 介绍翻译而来)

  • 应用层——不需要特殊的核心组件
  • 无单点失败——MogileFS安装的三个组件(存储节点、跟踪器、跟踪用的数据库),均可运行在多个 机器上,因此没有单点失败。(你也可以将跟踪器和存储节点运行在同一台机器上,这样你就没有必要用4台机器)推荐至少两台机器。
  • 自 动的文件复制——基于不同的文件“分类”,文件可以被自动的复制到多个有足够存储空间的存储节点上,这样可以满足这个“类别”的最少复制要求。比如你有一个图片网站,你可以设置原始的JPEG图片需要复制 至少三份,但实际只有1or2份拷贝,如果丢失了数据,那么Mogile可以重新建立遗失的拷贝数。用这种办法,MogileFS(不做RAID)可以节约 磁盘,否则你将存储同样的拷贝多份,完全没有必要。
  • “比RAID好多了”——在一个非存储区域网络的RAID(non-SAN RAID)的建立中,磁盘是冗余的,但主机不是,如果你整个机器坏了,那么文件也将不能访问。 MogileFS在不同的机器之间进行文件复制,因此文件始终是可用的。
  • 传输中立,无特殊协议——MogileFS客户端可以通过NFS或HTTP来和MogileFS的存储节点来通信,但首先需要告知跟踪器一下。
  • 简单的命名空间——文件通过一个给定的key来确定,是一个全局的命名空间。你可以自己生成多个命名空间,只要你愿意,不过这样可能在同一MogileFS中会造成key冲突。
  • 不用共享任何东西——MogileFS不需要依靠昂贵的SAN来共享磁盘,每个机器只用维护好自己的磁盘。
  • 不需要RAID——在MogileFS中的磁盘可以是做了RAID的也可以是没有,如果是为了安全性着想的话RAID没有必要买了,因为MogileFS已经提供了。
  • 不会碰到文件系统本身的不可知情况——在MogileFS中的存储节点的磁盘可以被格式化成多种格式(ext3,reiserFS等等)。MogilesFS会做自己内部目录的哈希,所以它不会碰到文件系统本身的一些限制,比如一个目录中的最大文件数。你可以放心的使用。

组成MogileFS的组件

1) 数据库(MySQL)部分
你可以用mogdbsetup程序来初始化数据库。数据库保存了Mogilefs的所有元数据,你可以单独拿数据库服务器来做,也可以跟其他程序跑在一起,数据库部分非常重要,类似邮件系统的认证中心那么重要,如果这儿挂了,那么整个Mogilefs将处于不可用状态。因此最好是HA结构。
2)存储节点
mogstored程序的启动将使本机成为一个存储节点。启动时默认去读/etc/mogilefs/mogstored.conf ,具体配置可以参考配置部分。mogstored启动后,便可以通过mogadm增加这台机器到cluster中。一台机器可以只运行一个mogstored作为存储节点即可,也可以同时运行其他程序。
3)trackers(跟踪器)
mogilefsd即trackers程序,类似mogilefs的wiki上介绍的,trackers做了很多工作,Replication ,Deletion,Query,Reaper,Monitor等等。mogadm,mogtool的所有操作都要跟trackers打交道,Client的一些操作也需要定义好trackers,因此最好同时运行多个trackers来做负载均衡。trackers也可以只运行在一台机器上,也可以跟其他程序运行在一起,只要你配置好他的配置文件即可,默认在/etc/mogilefs/mogilefsd.conf。
4)工具
主要就是mogadm,mogtool这两个工具了,用来在命令行下控制整个mogilefs系统以及查看状态等等。
5)Client
Client实际上是一个Perl的pm,可以写程序调用该pm来使用mogilefs系统,对整个系统进行读写操作。

MogileFS的php 扩展
http://www.capoune.net/mogilefs/ 提供了一个php扩展用来在php中使用mogileFS。
这儿也有一个地址,svn的源码库 http://svn.usrportage.de/php-mogilefs/trunk/

MogileFS应用中的几个重要概念

domain:最高域,在一个域下key是唯一的。
class:包含在domain中,可以针对每一个class定义保存的份数。
key:对文件的唯一标识。
file:文件。

MogileFS的适用性

由于Mogilefs不支持对一个文件的随机读写,因此注定了只适合做一部分应用。比如图片服务,静态HTML服务。即文件写入后基本上不需要修改的应用,当然你也可以生成一个新的文件覆盖上去。

MogileFS的工作方式(译)

MogileFS由如下一些部分构成:

  • Application: 想要 保存/加载 文件的应用
  • Tracker (the mogilefsd process): 基于事件的(event-based) 父 进程/消息 总线来管理所有来之于客户端应用的交互(requesting operations to be performed), 包括将请求负载平衡到 “query workers” 中,让mogilefsd的子进程去处理. 你可以在不同的机器上运行两个Tracker, 为了高可用性, 或使用更多的Tracker为了负载平衡(你需要运行多于两个的Tracker). mogilefsd的子进程有:
    • Replication — 个机器间复制文件
    • Deletion — 从命名空间删除是立即的,从文件系统删除是异步的
    • Query — 响应客户端的请求
    • Reaper — 在磁盘失败后将文件复制请求重新放到队列中
    • Monitor — 监测主机和设配的健康度和状态
  • Database — 数据库用来存放MogileFS的元数据 (命名空间, 和文件在哪里). 这应该设置一个高可用性(HA)的环境以防止单点失败.
  • Storage Nodes — 实际文件存放的地方. 存储节点是一个HTTP服务器,用来做 删除,存放等事情,任何WebDAV服务器都可以, 不过推荐使用 mogstored 。 mogilefsd 可以配置到两个机器上使用不同端口… mogstored 为所有 DAV 操作 (和流量监测), 并且你自己选择的快速的HTTP服务器用来做 GET 操作(给客户端提供文件). 典型的用户没一个加载点有一个大容量的 SATA 磁盘,他们被加载到 /var/mogdata/devNN.

High-level 流程:

  • 应用程序请求打开一个文件 (通过RPC 通知到 tracker, 找到一个可用的机器). 做一个 “create_open” 请求.
  • tracker 做一些负载均衡(load balancing)处理,决定应该去哪儿,然后给应用程序一些可能用的位置。
  • 应用程序写到其中的一个位置去 (如果写失败,他会重新尝试并写到另外一个位置去).
  • 应用程序 (client) 通过”create_close” 告诉tracker文件写到哪里去了.
  • tracker 将该名称和域命的名空间关联 (通过数据库来做的)
  • tracker, 在后台, 开始复制文件,知道他满足该文件类别设定的复制规则
  • 然后,应用程序通过 “get_paths” 请求 domain+key (key == “filename”) 文件, tracker基于每一位置的I/O繁忙情况回复(在内部经过 database/memcache/etc 等的一些抉择处理), 该文件可用的完整 URLs地址列表.
  • 应用程序然后按顺序尝试这些URL地址. (tracker’持续监测主机和设备的状态,因此不会返回死连接,默认情况下他对返回列表中的第一个元素做双重检查,除非你不要他这么做..)

13
九 08

分布式、集群文件系统小结

顺序不分先后:

Lustre
Lustre is a scalable, secure, robust, highly-available cluster file system. It is designed, developed and maintained by Sun Microsystems, Inc.
Designed to meet the demands of the world’s largest high-performance compute clusters, the Lustre file system redefines scalability and provides groundbreaking I/O and metadata throughput. An object-based cluster, Lustre currently supports tens of thousands of nodes, petabytes of data, and billions of files — and development is underway to support one million nodes, trillions of files, and zetta to yotta bytes.
http://www.sun.com/software/products/lustre/
http://wiki.huihoo.com/index.php?title=Lustre

AFS
AFS Reference Page

OpenAFS
What is AFS?
AFS is a distributed filesystem product, pioneered at Carnegie Mellon University and supported and developed as a product by Transarc Corporation (now IBM Pittsburgh Labs). It offers a client-server architecture for file sharing, providing location independence, scalability and transparent migration capabilities for data. OpenAFS is the Transarc source code released as it looked like around AFS3.6 under IBM Public License IPL.

Arla
Arla is a free AFS implementation.
The main goal is to make a fully functional client with all capabilities of AFS as formerly sold by Transarc and today available as OpenAFS. Other stuff, such as servers and management tools are being developed, but currently not considered stable.

Coda
Coda分布式文件系统:http://www.bsdmap.com/diary/coda.php
Coda File System http://www.coda.cs.cmu.edu/
Coda is a forked of version of AFS that support disconnected and weakly connected mode better then AFS.

InterMezzo
InterMezzo is a new distributed file system with a focus on high availability. InterMezzo will be suitable for replication of servers, mobile computing, managing system software on large clusters, and for maintenance of high availability clusters.

xFS
xFS is a Serverless Network File Service.

CFS
Cluster File Systems, Inc. is the leading developer of next generation technology for scalable high-performance file systems. Our Lustre® file system redefines scalability and has been designed from the ground up to meet the demands of the world’s largest high-performance computer clusters.

GlusterFS
GlusterFS is a cluster file-system capable of scaling to several peta-bytes. It aggregates various storage bricks over Infiniband RDMA or TCP/IP interconnect into one large parallel network file system. GlusterFS is based on a stackable user space design without compromising performance.

Scalable File Share
HP StorageWorks Scalable File Share
A high-bandwidth, scalable storage appliance for Linux clusters
http://h20311.www2.hp.com/HPC/cache/276636-0-0-0-121.html

MogileFS
MogileFS is our open source distributed filesystem. Its properties and features include:
-1. Application level
-2. No single point of failure
-3. Autumaic file replication
-4. “Better than RAID”
-5. Flat Namespace
-6. Shared-Nothing
-7. No RAID required
-8. Local filesystem agnostic

Hadoop
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing, including:
* Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor.
* HBase builds on Hadoop Core to provide a scalable, distributed database.
* ZooKeeper is a highly available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates for critical shared state.

PVFS
http://www.pvfs.org/
http://www.parl.clemson.edu/pvfs/
PVFS is designed to provide high performance for parallel applications, where concurrent, large IO and many file accesses are common. PVFS provides dynamic distribution of IO and metadata, avoiding single points of contention, and allowing for scaling to high-end terascale and petascale systems.

GFS

http://en.wikipedia.org/wiki/Global_File_System
http://www.redhat.com/docs/manuals/csgfs/
GFS (Global File System) is a cluster file system. It allows a cluster of computers to simultaneously use a block device that is shared between them (with FC, iSCSI, NBD, etc…). GFS reads and writes to the block device like a local filesystem, but also uses a lock module to allow the computers coordinate their I/O so filesystem consistency is maintained. One of the nifty features of GFS is perfect consistency — changes made to the filesystem on one machine show up immediately on all other machines in the cluster.

See also

External links About GFS

1. HP OpenVMS
————–
The first to work with a CFS is HP OpenVMS. Oracle Parallel Server and RAC always used
the OpenVMS filesystem (RMS) for its database.

2 HP Tru64
————
CFS is a layer on top of Advfs the filesystem of HP Tru64. Oracle uses
the Direct I/O feature available in CFS. Direct I/O enables Oracle to bypass
the buffer cache (no caching at filesystem level). Oracle manages the
concurrent access to the file itself; as it does on raw devices. On CFS,
without Direct I/O enabled on files – file access goes through a CFS server.
A CFS server runs on a cluster member and serves a file domain. A file
domain can be relocated from one cluster member to another cluster member
online. A file domain may contain one or more filesystems.

Direct I/O does not go through the CFS server, but file creation and resizing
is seen as metadata operation by advfs and this has to be done by the CFS
server.  The consequence is to run file creations and resizing on the node
where the CFS server is located. File operations might take longer when the
CFS server is remote.

Oracle recommends not using the tempfile option, as tempfiles might not be
allocated until the tempfile blocks are accessed and so cause
‘remote metadata operations’for advfs.

3 Veritas
———–
VERITAS Database EditionTM / Advanced Cluster for Oracle9i RAC enables Oracle
to use the CFS.  The VERITAS Cluster File System is an extension of the VERITAS
File System (VxFS).  Veritas CFS allows the same filesystem to be simultaneously
mounted on multiple nodes.  Veritas CFS is designed with a master/slave
architecture.  Any node can initiate a metadata operation (create, delete, or
resize data), the actual operation is carried out by the master node. All other
(non metadata) IO goes directly to the disk.

CFS is used in DBE/AC to manage a filesystem in a large database environment.
When used in DBE/AC for Oracle9i RAC, Oracle accesses data files stored on CFS
filesystems by bypassing the filesystem buffer and filesystem locking for data.

4 Oracle Cluster File System
——————————
Oracle Cluster File System (OCFS) is a shared filesystem designed specifically
for Oracle Real Application Clusters. OCFS eliminates the requirement for Oracle
database files to be linked to logical drives and enables all nodes to share a
single Oracle_Home (current capabilities are detailed in section 2.8) instead
of requiring each node to have its own local copy. OCFS volumes can span one
shared disk or multiple shared disks for redundancy and performance
enhancements.

5. Netapp(R) Filer
——————-
Netapp Filer offers CFS functionality via NFS to the server machines. These
filesystems are mounted using special mount options. For details please see
Netapp documentation.

Netapp certifications can be found at:

http://www.netapp.com/part…

To understand the architecture and Oracle installation please see these
documents:

Note 210889.1: RAC Installation with a NetApp Filer in Red Hat Linux Environment
and
Oracle9i RAC Installation with a NetApp Filer on Fujitsu-Siemens Primepower
(Solaris8 Operating System) at http://www.netapp.com/tech…

6 AIX
——-
IBM’s General Parallel File System (GPFS) allows users shared access to files
that may span multiple disk drives on multiple nodes. GPFS provides access to
all data from all nodes of the cluster.  It can be configured with multiple
copies of metadata allowing continued operation should the paths to a disk or
the disk itself be broken. Metadata is the filesystem data that describes
the user data.  GPFS allows the use of RAID or other hardware redundancy
capabilities to enhance reliability.

In Oracle9i GPFS is only supported with HACMP/ES in a RAC configuration.
When placing datafiles on GPFS no CRM (Concurrent Resource Manager) needs to be
installed. Starting with Oracle10g HACMP is no longer required to use GPFS.

Metalink contains certification information and information about required
patches for having a cluster database on a GPFS.

7 Sun GFS
———–
Global File Service (GFS or Cluster File System) is a filesystem that is
accessible from all nodes in the cluster. GFS is based on global devices and
has a client/server architecure. GFS provides transparent and concurrent file
access.

Note that Sun GFS is not supported for Oracle datafiles, see section 3.10.

8 Sun StorEdge QFS
——————–
QFS software is a file manager that provides a shared filesystem where mutiple
servers can read and write simultanuously to the same file in the same filesystem.

9 Other Linux Cluster Filesystems
———————————–
There are various third party cluster filesystems available on Linux.
Consult the Oracle Certify website for the policy regarding support for third party
cluster file systems on Linux. Also, consult the RAC Technology Compatibility Matrix (RTCM)
for Linux (http://www.oracle.com/tech… … generic_linux.html)
for the latest information on which third party cluster file systems are supported
by RAC release and platform.

10 Which Platforms support what?
———————————-

Platform and                         Storage for                      Storage for
[Cluster Software]                Oracle installation             datafiles

AIX [HACMP]                          LFS (1) or CFS (2)           CFS and/or Raw devices
AIX [CRS]                                  LFS or CFS            CFS and/or Raw devices
HP/UX [MC/Service Guard]             LFS or CFS (3)        CFS (3) and/or Raw Devices
HP/UX PA-Risc [Veritas DBE/AC)       LFS or CFS            CFS and/or Raw Devices
Linux [oracm, CRS]                   LFS                   OCFS (4) and/or Raw
Devices, also NFS (5)
OpenVMS                              CFS                   CFS
Sun Solaris [Fujitsu Siemens         LFS                   Raw Devices/NFS (5)
Primecluster]
Sun Solaris [Sun Cluster]            LFS or CFS        (6,7)             CFS (7) Raw Devices/NFS (5)
Sun Solaris [Veritas DBE/AC]         LFS or CFS                         CFS and/or Raw Devices
Tru64 Unix                           LFS or CFS                         CFS and/or Raw Devices
Windows NT/2000 [oracm, CRS]         LFS or CFS                   OCFS and/or Raw Devices
Windows 2003 (32/64bit) [oracm, CRS] LFS or CFS            OCFS and/or Raw Devices

(1) LFS is the abbreviation for local filesystem and is only accessible directly
by the node that mounted the disk
(2) CFS is the abbreviation for Cluster FileSystem. The implementation
depends on the operating software vendor or cluster software vendor.
(3) MC ServiceGuard 11.17 includes a CFS which is supported with Oracle 10gR2
(4) OCFS: Oracle Cluster FileSystem
(5) NFS is supported with Netapp(R) Filer, see Metalink certification
(6) Sun GFS can only be used for Oracle_Home and archivelogs.
(7) Sun StorEdge QFS

Local Filesystem means that the Oracle Universal Installer replicates the
RAC software installation automatically to every private filesystem of the
selected nodes in the cluster. The Oracle installation products
are cluster aware and will not install the Oracle software to over-write itself.

Oracm is the Oracle Cluster manager, which is available on Linux and Windows
NT/2000. No other cluster manager is needed to setup Real Application Cluster.

Cluster Ready Services (CRS) are new in Oracle10g and provide also clustermanager
functionality.

Oracle will validate cluster filesystems of other vendors when they become
available. Oracle will support the Oracle software when running on a validated
cluster filesystem.

11 Cluster File System names
——————————

PLatform or Cluster Vendor        CFS name

AIX                                                  GPFS
HP/UX MC/ServiceGuard       CFS
Linux [oracm, CRS]                OCFS
OpenVMS                                              RMS
Tru64 Unix                                    CFS
SunCluster                  GFS, QFS
Veritas DBE/AC                                CFS
Windows NT/2000                                OCFS
Windows 2003 (32/64bit)           OCFS

For more information on certified configuration please see the certification
matrix available on Metalink.  Instructions for accessing the certification
matrix can be found in the following note:

Note 184875.1
How To Check The Certification Matrix for Real Application Clusters

12 When to use CFS over raw?
——————————
This option is very dependent on the availability of a CFS on your platform.
A CFS offers:
- Simpler management
- Use of Oracle Managed Files with RAC
- Single Oracle Software installation
- Autoextend enabled on Oracle datafiles
- Uniform accessibility to archive logs in case of physical node failure
- With Oracle_Home on CFS, when you apply Oracle patches CFS guarantees that
the updated Oracle_Home is visible to all nodes in the cluster.