第 12 章高级管理

本章以不同的视角回到之前已经讲过的几个方面：现在我们将学习大规模部署系统，而非仅仅安装单台计算机。我们将学习手动创建 RAID 和 LVM 卷而非在安装系统时创建，以防将来时情况有变需要修改。最后，我们将讨论监控工具与虚拟化技术。因此，本章主要面向专业管理员，而不太面向使用家庭网络的个人。

12.1. RAID 和 LVM

第 4 章安装 presented these technologies from the point of view of the installer, and how it integrated them to make their deployment easy from the start. After the initial installation, an administrator must be able to handle evolving storage space needs without having to resort to an expensive re-installation. They must therefore understand the required tools for manipulating RAID and LVM volumes.

RAID 和 LVM 都是将已挂载的卷从它们的硬件层面对应物（实际的硬盘驱动器或分区）抽象化出来的技术。前者通过引入冗余来保证数据在硬件故障时的安全性和可用性，后者可让卷管理更加灵活而不受底层硬盘的实际大小的限制。两者实际上都是无需映射到单个物理磁盘就可以创建文件系统或交换空间的新的块设备（block device）。RAID 和 LVM 诞生的背景大相径庭，然而却有某些相同的功能，因此人们经常同时提起它们。

观点 Btrfs 融合 LVM 和 RAID

While LVM and RAID are two distinct kernel subsystems that come between the disk block devices and their filesystems, btrfs is a filesystem, initially developed at Oracle, that purports to combine the feature sets of LVM and RAID and much more.

→ https://btrfs.wiki.kernel.org/

它值得一提的特性之一是它可以在任意时间点创建文件系统树的快照。该快照一开始不会占用任何磁盘空间，而它会在副本发生更改时复制数据。该文件系统还会处理文件的透明压缩，以及保证存储数据完整性的校验和。

在 RAID 和 LVM 的场合下，内核提供一个块设备文件，它类似于那些对应硬盘驱动器或分区的块设备文件。当应用程序或内核的另一部分请求访问该设备的一个块时，对应的子系统会把块关联到相关的硬件层面。这个块可以据配置需要存储到一个或几个实体磁盘上，它的实际位置可能不会直接反映块在逻辑设备中的位置。

12.1.1. 软 RAID

RAID 意为磁盘阵列（Redundant Array of Independent Disks）。此系统的目标是防止数据丢失，并确保硬盘故障情况下的可用性。其主要原理十分简单：把数据存储在多个而非单个物理磁盘上，而其冗余性可以调整。取决于冗余性的大小，即使发生了意外的磁盘故障，数据也可以从剩下的磁盘中无损地还原出来。

RAID 可以通过专用硬件（集成到 SCSI 或 SATA 控制器卡中的 RAID 模块）或通过软件抽象（内核）实现。无论是硬件还是软件，具有足够冗余的 RAID 系统都可以在磁盘发生故障时透明地保持运行状态；堆栈（应用程序）的上层甚至可以在发生错误时继续访问数据。当然，这种"降级模式"会对性能产生影响，并且冗余会降低，进一步的磁盘故障可能会导致数据丢失。因此，在实践中，只需要更换发生故障的磁盘，人们就会努力保持这种降级模式。一旦新磁盘就位，RAID 系统就可以重建所需的数据，以便返回到安全模式。当阵列处于降级模式或重建阶段时，除了可能降低访问速度之外，应用程序不会注意到任何情况。

当 RAID由硬件实现时，RAID配置通常在BIOS设置工具中进行，内核会把RAID阵列当成一个单独的磁盘对待，根据驱动的不同设备名称可能会有所不同，但其使用方式和标准物理磁盘一样。

在本书里面，我们只专注于软 RAID。

12.1.1.1. 不同的 RAID 级别

RAID 实际上不是单个系统，而是由其级别标识的一系列系统；级别因布局和冗余量而有所不同。冗余的越多，防故障性越高，因为系统将能够继续处理更多故障磁盘。对应的一种是，一组给定磁盘的可用空间收减少；另一方面，需要更多的磁盘来存储给定数量的数据。

线性 RAID: 即使内核的 RAID 子系统允许创建"线性 RAID"，但这不是正确的 RAID，因为此设置不涉及任何冗余。内核只是端到端地聚合多个磁盘，并作为一个虚拟磁盘（一个块设备）提供生成的聚合卷。这是关于它的唯一功能。此设置很少自行使用（请参阅稍后的异常），特别是因为缺少冗余意味着一个磁盘故障使整个聚合所有数据不可用。
RAID-0: 此级别也不提供任何冗余，但磁盘不会简单一个接一个工作：它们被分成 stripes（条带），虚拟设备上的块存储在交替物理磁盘上的条带中。例如，在双磁盘 RAID-0 设置中，虚拟设备的偶数块将存储在第一个物理磁盘上，而奇数块将最终存储在第二个物理磁盘上。
此系统的目的不是提高可靠性，因为（如线性情况下）在任一个磁盘发生故障时所有数据的可用性都会受到损害，只是提高性能：在连续访问大量连续数据期间，内核能够从两个磁盘并行读取（或写入），从而提高数据传输速率。磁盘完全由 RAID 设备使用，因此它们的大小应该相同，不会损失性能。
RAID-0 的使用正在减少，其优点由 LVM 填补（请参阅后面的内容）。
RAID-1: 此级别也称为"RAID 镜像"，是最简单和最广泛使用的设置。在其标准形式中，它使用两个大小相同的物理磁盘，并提供相同大小的逻辑卷。数据以相同方式存储在两个磁盘上，因此称为"镜像"。当一个磁盘发生故障时，仍然可使用另一个磁盘上的数据。对于真正关键的数据，RAID-1 当然可以在超过两个磁盘上设置，这直接影响到硬件成本与可用有效负载空间的比率。
注释磁盘和卷大小
如果在镜像中设置了两个大小不同的磁盘，则不会完全使用较大的磁盘，因为它将包含与最小磁盘相同的数据，仅此一项。因此，RAID-1 卷提供的可用空间与阵列中最小磁盘的大小相匹配。这仍然适合具有较高 RAID 级别的 RAID 卷，即使冗余存储方式不同。
因此，在设置 RAID 阵列（RAID-0 和"线性 RAID"除外）时，必须使用大小相同或非常接近的磁盘，以避免浪费资源。
注释备用磁盘
包含冗余的 RAID 级别允许为阵列分配比所需更多的磁盘。当其中一个主磁盘发生故障时，多余的磁盘用作备用磁盘。例如，在两个磁盘加一个备用磁盘的镜像中，如果前两个磁盘中的一个发生故障，内核将自动（并立即）使用备用磁盘重建镜像，以便冗余在重建后保持可用。这可用作关键数据的另一种保护。
人们可能想知道，这怎么比简单地使用三个磁盘镜像开始更好呢。"备用磁盘"配置的优点是，备用磁盘可以在多个 RAID 卷之间共享。例如，有3个镜像卷，即使在一个磁盘发生故障时，也可以有冗余保证，只要7个磁盘（3对，外加一个共享备用磁盘），而不是3组3个共需要9个磁盘。
此 RAID 级别虽然昂贵（因为最多只有物理存储空间的一半可用），但在实践中被广泛使用。它易于理解，允许非常简单的备份：由于两个磁盘具有相同的内容，可以暂时提取其中一个磁盘，对工作系统没有任何影响。读取性能通常提高，因为内核可以并行读取每个磁盘上一半的数据，而写入性能不会严重下降。如果 RAID-1 阵列包含 N 个磁盘，即使 N-1 磁盘发生故障，数据也保持可用。
CAUTION RAID is not Backup
RAID systems are not backup mechanisms. While RAID increases the redundancy - and therefore the availability of a system - and protects against disk failures, backups are done to protect data from being altered, deleted, getting corrupted, etc., and to be able to restore them if necessary. To demonstrate this: If you remove one or all files by accident, a RAID will mirror this change, but it will not provide the means to restore the file(s). So while there is clearly an overlap, they are not the same and should be used in conjunction with each other.
RAID-4: 此 RAID 级别未广泛使用，使用 N 个磁盘来存储有用的数据，以及额外的磁盘来存储冗余信息。如果磁盘发生故障，系统可以从其他 N 个磁盘重构数据。如果其中一个 N 数据磁盘发生故障，则剩余的 N-1 与"奇偶校验"磁盘相结合，包含足够的信息来重建所需的数据。
RAID-4 并不昂贵，因为它只涉及成本的N分之一的增加，对读取性能没有明显影响，但写入速度会减慢。此外，由于写入任何 N 磁盘也涉及写入奇偶校验磁盘，因此后者的写入量比前者多，因此其寿命会大幅缩短。RAID-4 阵列上的数据仅在最多只有一个磁盘（N+1中）发生故障时保证安全。
RAID-5: RAID-5 解决了 RAID-4 的不对称问题：奇偶校验块分布在所有 N+1 磁盘上，没有一个磁盘具有特定角色。
读取和写入性能与 RAID-4 相同。同样，系统在最多一个磁盘（N+1中）发生故障时保持运行状态，但是不能有更多磁盘故障。
RAID-6: RAID-6 可视为 RAID-5 的扩展，其中每 N 块产生2个冗余块，并且 N+2 块分布在 N+2 个磁盘上。
此 RAID 级别比前两个级别稍微昂贵一些，但它带来了一些额外的安全性，因为最多两个驱动器（N+2）出现故障，而不会影响数据可用性。对应的是，写入操作现在涉及写入一个数据块和两个冗余块，这使得它们更慢。
RAID-1+0: 严格来说，这不是 RAID 级别，而是两个 RAID 分组的堆叠。从 2 × N 个磁盘开始，首先将它们按对将它们设置到 N 个 RAID-1 卷中；然后，将 N 个卷通过"线性 RAID"或（越来越多地）由 LVM 聚合为一个卷。最后一个案例比纯 RAID 更进一步，但没有问题。
RAID-1+0 可以经受多个磁盘故障而幸存：如果每个 RAID-1 对中至少有一个磁盘继续工作，则上述 2×N 阵列中最多可以有 N 个磁盘故障。
进阶 RAID-10
RAID-10 通常被认为是 RAID-1+0 的同义词，但 Linux 特异性使其实际上成为一种泛化。此设置允许将每个块存储在两个不同的磁盘上的系统，即使磁盘数为奇数，副本也可配置模型分布。
性能将因所选的重新分配模型、冗余级别以及逻辑卷的工作负载而异。

显然，RAID 级别将根据每个应用程序的约束和要求进行选择。请注意，一台计算机可以有多个具有不同配置的不同 RAID 阵列。

12.1.1.2. 创建 RAID

设置 RAID 卷需要 mdadm 软件包；它提供了 mdadm 命令，允许创建和操作 RAID 阵列，以及将其集成到系统的其余部分（包括监控系统）的脚本和工具。

下面的示例是具有多个磁盘的服务器，其中一些磁盘已被使用，其余磁盘可用于设置 RAID。我们最初有以下磁盘和分区：

sdb 磁盘，4 GB，完全可用;
sdc 磁盘，4 GB，也是完全可用;
sdd 磁盘，只有分区 sdd2（大约4 GB）可用;
最后，sde 磁盘，4 GB，完全可用。

我们将使用这些物理元素来构建两个卷，一个 RAID-0 和一个镜像（RAID-1）。从 RAID-0 卷开始：

# mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sdb /dev/sdc
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
# mdadm --query /dev/md0
/dev/md0: 7.99GiB raid0 2 devices, 0 spares. Use mdadm --detail for more detail.
# mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Mon Feb 28 01:54:24 2022
        Raid Level : raid0
        Array Size : 8378368 (7.99 GiB 8.58 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 01:54:24 2022
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

            Layout : -unknown-
        Chunk Size : 512K

Consistency Policy : none

              Name : debian:0  (local to host debian)
              UUID : a75ac628:b384c441:157137ac:c04cd98c
            Events : 0

    Number   Major   Minor   RaidDevice State
       0       8        0        0      active sync   /dev/sdb
       1       8       16        1      active sync   /dev/sdc
# mkfs.ext4 /dev/md0
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done                            
Creating filesystem with 2094592 4k blocks and 524288 inodes
Filesystem UUID: ef077204-c477-4430-bf01-52288237bea0
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

# mkdir /srv/raid-0
# mount /dev/md0 /srv/raid-0
# df -h /srv/raid-0
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        7.8G   24K  7.4G   1% /srv/raid-0

mdadm --create 命令需要多个参数：要创建卷的名称（/dev/md*，MD 代表 Multiple Device），RAID 级别，磁盘数量（尽管只在 RAID-1 及以上级别时才有意义），以及要使用的物理驱动器。设备创建后，可以像使用普通分区一样，创建文件系统、挂载文件系统等。请注意，我们创建 RAID-0 卷到 md0 只是巧合，阵列的编号不需要与所选的冗余量相关。还可以使用 mdadm 参数如 /dev/md/linear 代替 /dev/md0 来命名 RAID 阵列。

创建 RAID-1 的方式十分类似，仅在创建后有明显差异：

# mdadm --create /dev/md1 --level=1 --raid-devices=2 /dev/sdd2 /dev/sde
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: largest drive (/dev/sdc2) exceeds size (4189184K) by more than 1%
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md1 started.
# mdadm --query /dev/md1
/dev/md1: 4.00GiB raid1 2 devices, 0 spares. Use mdadm --detail for more detail.
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:08:09 2022
             State : clean, resync
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : resync

    Rebuild Status : 13% complete

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 17

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       1       8       48        1      active sync   /dev/sde
# mdadm --detail /dev/md1
/dev/md1:
[...]
          State : clean
[...]

顺序的几个注意事项。首先，mdadm 提示物理磁盘具有不同的大小；这意味着在较大的磁盘上会丢失一些空间，因此需要确认。

更重要的是，注意镜像的状态。RAID 镜像的正常状态是两个磁盘的内容完全相同。但是，在首次创建卷时，不保证是这种情况。因此，RAID 子系统将提供该保证，并且一旦创建 RAID 设备，就会有一个同步阶段。一段时间后（确切的数量将取决于磁盘的实际大小...），RAID 阵列将切换到"active"或"clean"状态。请注意，在重建阶段，镜像处于降级模式，并且无法保证冗余。该风险窗口期间磁盘故障可能导致丢失所有数据。但是，在初始同步之前，大量关键数据很少存储在新创建的 RAID 阵列上。请注意，即使在降级模式下，/dev/md1 也可用，并且可以在其上创建文件系统以及复制的一些数据。

提示设置不同步的镜像

RAID-1 卷通常作为新磁盘创建，通常视为空白磁盘。因此，磁盘的实际初始内容不太相关，因为用户只需要知道创建卷后写入的数据，特别是文件系统，以后可以访问。

因此，人们可能会怀疑在创建时同步两个磁盘的要点。为什么要关心只有在写入信息之后才能读取卷区域的内容是否相同呢？

幸运的是，通过将 --assume-clean 选项传递到 mdadm 可以避免同步。但是，在读取初始数据的情况下（例如，如果物理磁盘上已存在文件系统），此选项可能会导致意外，这是默认情况下未启用该选项的原因。

现在，看看当 RAID-1 阵列的一个磁盘发生故障时会发生什么。mdadm 的 --fail 选项，可以模拟这样的磁盘故障：

# mdadm /dev/md1 --fail /dev/sde
mdadm: set /dev/sde faulty in /dev/md1
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:15:34 2022
             State : clean, degraded 
    Active Devices : 1
   Working Devices : 1
    Failed Devices : 1
     Spare Devices : 0

Consistency Policy : resync

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 19

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       -       0        0        1      removed

       1       8       48        -      faulty   /dev/sde

卷的内容仍然可以访问（如果已经挂载，应用程序不会注意到任何变化），但数据安全不再得到保证：如果接下来 sdd 磁盘发生故障，数据将丢失。我们希望避免这种风险，因此我们将用新的磁盘 sdf 替换发生故障的磁盘：

# mdadm /dev/md1 --add /dev/sdf
mdadm: added /dev/sdf
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:25:34 2022
             State : clean, degraded, recovering 
    Active Devices : 1
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 1

Consistency Policy : resync

    Rebuild Status : 47% complete

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 39

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       2       8       64        1      spare rebuilding   /dev/sdf

       1       8       48        -      faulty   /dev/sde
# [...]
[...]
# mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Mon Feb 28 02:07:48 2022
        Raid Level : raid1
        Array Size : 4189184 (4.00 GiB 4.29 GB)
     Used Dev Size : 4189184 (4.00 GiB 4.29 GB)
      Raid Devices : 2
     Total Devices : 3
       Persistence : Superblock is persistent

       Update Time : Mon Feb 28 02:25:34 2022
             State : clean
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 1
     Spare Devices : 0

Consistency Policy : resync

              Name : debian:1  (local to host debian)
              UUID : 2dfb7fd5:e09e0527:0b5a905a:8334adb8
            Events : 41

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       2       8       64        1      active sync   /dev/sdf

       1       8       48        -      faulty   /dev/sde

同样，内核会自动触发一个重建阶段，在此期间，卷虽然仍然可以访问，但处于降级模式。重建完成后，RAID 阵列将恢复正常状态。然后，可以告诉系统，sde磁盘即将从阵列中删除，以便最终在两个磁盘上使用经典 RAID 镜像：

# mdadm /dev/md1 --remove /dev/sde
mdadm: hot removed /dev/sde from /dev/md1
# mdadm --detail /dev/md1
/dev/md1:
[...]
    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdd2
       2       8       64        1      active sync   /dev/sdf

从此，当服务器下次关闭时，可以物理删除驱动器，当硬件配置允许热插拔时，甚至可以热删除驱动器。此类配置包括某些 SCSI 控制器、大多数 SATA 磁盘以及使用 USB 或 Firewire 操作的外部驱动器。

12.1.1.3. 备份配置

Most of the meta-data concerning RAID volumes are saved directly on the disks that make up these arrays, so that the kernel can detect the arrays and their components and assemble them automatically when the system starts up. However, backing up this configuration is encouraged, because this detection isn't fail-proof, and it is only expected that it will fail precisely in sensitive circumstances. In our example, if the sde disk failure had been real (instead of simulated) and the system had been restarted without removing this sde disk, this disk could start working again due to having been probed during the reboot. The kernel would then have three physical elements, each claiming to contain half of the same RAID volume. In reality this leads to the RAID starting from the individual disks alternately - distributing the data also alternately, depending on which disk started the RAID in degraded mode. Another source of confusion can come when RAID volumes from two servers are consolidated onto one server only. If these arrays were running normally before the disks were moved, the kernel would be able to detect and reassemble the pairs properly; but if the moved disks had been aggregated into an md1 on the old server, and the new server already has an md1, one of the mirrors would be renamed.

因此，备份配置非常重要，如果仅供参考的话。这样做的标准方式是编辑/etc/mdadm/mdadm.conf文件，下面列出了一个示例：

例 12.1. mdadm 配置文件

# mdadm.conf
#
# !NB! Run update-initramfs -u after updating this file.
# !NB! This will ensure that initramfs has an uptodate copy.
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
DEVICE /dev/sd*

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays
ARRAY /dev/md/0  metadata=1.2 UUID=a75ac628:b384c441:157137ac:c04cd98c name=debian:0
ARRAY /dev/md/1  metadata=1.2 UUID=2dfb7fd5:e09e0527:0b5a905a:8334adb8 name=debian:1
# This configuration was auto-generated on Mon, 28 Feb 2022 01:53:48 +0100 by mkconf

最有用的详细信息之一是 DEVICE 选项，它列出了系统将在启动时自动查找 RAID 卷组件的设备。在我们的示例中，我们用设备文件的显式列表替换了默认值partitions containers，因为我们选择对某些卷使用整个磁盘，而不是仅使用分区。

示例中的最后两行是允许内核安全地选择分配给哪个阵列的卷号。存储在磁盘本身的元数据足以重新组装卷，但无法确定卷号（以及匹配的 /dev/md*名称）。

幸运的是，这些行可以自动生成：

# mdadm --misc --detail --brief /dev/md?
ARRAY /dev/md/0  metadata=1.2 UUID=a75ac628:b384c441:157137ac:c04cd98c name=debian:0
ARRAY /dev/md/1  metadata=1.2 UUID=2dfb7fd5:e09e0527:0b5a905a:8334adb8 name=debian:1

最后两行的内容不依赖于卷中包含的磁盘列表。因此，在用新磁盘替换故障磁盘时，不需要重新生成这些行。另一方面，在创建或删除 RAID 阵列时，必须注意更新文件。

12.1.2. LVM（逻辑卷管理）

LVM Logical Volume Manager（逻辑卷管理器），在物理支持上实现逻辑卷的另一种方式，侧重于提高灵活性而不是提高可靠性。就应用程序而言，LVM 允许透明地更改逻辑卷；如可以添加新磁盘、将数据迁移到磁盘并删除旧磁盘，而无需卸载卷。

12.1.2.1. LVM 概念

这种灵活性是由涉及三个概念的抽象级别实现的。

首先，PV（Physical Volume，物理卷）是最接近硬件的实体：它可以是磁盘上的分区、完整磁盘，甚至任何其他块设备（包括 RAID 阵列）。请注意，当物理元素设置为 LVM 的 PV 时，应仅通过 LVM 访问它，否则系统会混淆。

许多 PV 可以在 VG（Volume Group，卷组）中进行群集，可以与虚拟磁盘和可扩展磁盘进行比较。VG 是抽象的，不会显示在 /dev 中的设备文件中，因此没有直接使用它们的风险。

第三种对象是LV（Logical Volume，逻辑卷），是VG的一个区块；如果我们保持VG作为磁盘类比，LV则类比一个分区。LV 显示为具有 /dev 中条目的块设备，它可用作任何其他物理分区（通常用于托管文件系统或交换空间）。

重要的是，VG拆分为 LV 完全独立于其物理组件（PV）。只有单个物理组件（例如磁盘）的 VG 可以拆分为十几个逻辑卷；同样，VG 可以使用多个物理磁盘，并显示为单个大型逻辑卷。显然，唯一的约束是分配给 GV 的总大小不能大于卷组中 PV 的总容量。

但是，在 VG 的物理组件之间具有某种同质性，并将 VG 拆分为具有类似使用模式的逻辑卷通常有意义。例如，如果可用硬件包括快速磁盘和较慢的磁盘，则快速磁盘可以聚类到一个 VG 中，而慢速磁盘可以聚类到另一个 VG 中；然后，第一个区块可以分配给需要快速数据访问的应用程序，而第二个块将保留为要求较低的任务。

在任何情况下，请记住，LV 不是特别附加到任何一个 PV。可以影响来自 LV 的数据的物理存储位置，但日常使用不需要这种可能性。相反：当 VG 的物理组件集发生变化时，与特定 LV 对应的物理存储位置可以跨磁盘迁移（当然，在分配给 VG 的PV 中）。

12.1.2.2. 搭建 LVM

现在，我们一步一步地遵循为典型用例设置 LVM 的过程：我们希望简化复杂的存储情况。这种情况通常发生在一些长期和错综复杂的临时措施的历史之后。为了便于说明，我们将考虑一个存储需求随着时间而变化的服务器，最终进入一个可用分区拆分为几个部分使用的磁盘中。更具体地而言，有以下分区可用：

sdb 磁盘，sdb2 分区，4 GB;
sdc 磁盘，sdc3 分区，3 GB;
sdd 磁盘，4 GB，完全可用;
sdf 磁盘，sdf1 分区，4 GB; 以及 sdf2 分区，5 GB。

此外，假设磁盘 sdb 和 sdf 色速度比另外两个更快。

我们的目标是为3个不同的应用程序设置3个逻辑卷：需要 5 GB 存储空间的文件服务器、数据库（1 GB）和一些备份空间（12 GB）。前两个需要良好的性能，但备份在访问速度方面不太重要。所有这些限制都阻止使用分区本身；使用 LVM 可以抽象设备的物理大小，因此唯一的限制是总可用空间。

所需的工具在 lvm2 软件包及其依赖项中。安装 LVM 时，需要三个步骤，与三个级别的概念相匹配。

首先，使用 pvcreate 准备物理卷：

# pvcreate /dev/sdb2
  Physical volume "/dev/sdb2" successfully created.
# pvdisplay
  "/dev/sdb2" is a new physical volume of "4.00 GiB"
  --- NEW Physical volume ---
  PV Name               /dev/sdb2
  VG Name               
  PV Size               4.00 GiB
  Allocatable           NO
  PE Size               0   
  Total PE              0
  Free PE               0
  Allocated PE          0
  PV UUID               yK0K6K-clbc-wt6e-qk9o-aUh9-oQqC-k1T71B

# for i in sdc3 sdd sdf1 sdf2 ; do pvcreate /dev/$i ; done
  Physical volume "/dev/sdc3" successfully created.
  Physical volume "/dev/sdd" successfully created.
  Physical volume "/dev/sdf1" successfully created.
  Physical volume "/dev/sdf2" successfully created.
# pvdisplay -C
  PV         VG Fmt  Attr PSize PFree
  /dev/sdb2     lvm2 ---  4.00g 4.00g
  /dev/sdc3     lvm2 ---  3.00g 3.00g
  /dev/sdd      lvm2 ---  4.00g 4.00g
  /dev/sdf1     lvm2 ---  4.00g 4.00g
  /dev/sdf2     lvm2 ---  5.00g 5.00g

到目前为止，都很好；请注意，可以在完整磁盘以及其各个分区上设置 PV。如上所述，pvdisplay列出了现有的PV，有两种输出格式。

现在，使用 vgcreate 将这些物理元素组合。将快速磁盘放到一个VG vg_critical；其他 VG vg_normal 包含较慢的元素。

# vgcreate vg_critical /dev/sdb2 /dev/sdf1
  Volume group "vg_critical" successfully created
# vgdisplay
  --- Volume group ---
  VG Name               vg_critical
  System ID             
  Format                lvm2
  Metadata Areas        2
  Metadata Sequence No  1
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                0
  Open LV               0
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               7.99 GiB
  PE Size               4.00 MiB
  Total PE              2046
  Alloc PE / Size       0 / 0   
  Free  PE / Size       2046 / 7.99 GiB
  VG UUID               JgFWU3-emKg-9QA1-stPj-FkGX-mGFb-4kzy1G

# vgcreate vg_normal /dev/sdc3 /dev/sdd /dev/sdf2
  Volume group "vg_normal" successfully created
# vgdisplay -C
  VG          #PV #LV #SN Attr   VSize   VFree  
  vg_critical   2   0   0 wz--n-   7.99g   7.99g
  vg_normal     3   0   0 wz--n- <11.99g <11.99g

同样，命令相当简单（vgdisplay 有两种输出格式）。请注意，将同一物理磁盘的两个分区用于两个不同的 VG 是有可能的。请注意，我们使用 vg_ 前缀来命名我们的 VG，但它只不过是一个约定。

我们现在有两个"虚拟磁盘"，大小分别约为 8 GB 和 12 GB。现在，让我们将它们分成"虚拟分区"（LV）。这涉及到lvcreate命令，以及稍微复杂的语法：

# lvdisplay
# lvcreate -n lv_files -L 5G vg_critical
  Logical volume "lv_files" created.
# lvdisplay
  --- Logical volume ---
  LV Path                /dev/vg_critical/lv_files
  LV Name                lv_files
  VG Name                vg_critical
  LV UUID                Nr62xe-Zu7d-0u3z-Yyyp-7Cj1-Ej2t-gw04Xd
  LV Write Access        read/write
  LV Creation host, time debian, 2022-03-01 00:17:46 +0100
  LV Status              available
  # open                 0
  LV Size                5.00 GiB
  Current LE             1280
  Segments               2
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:0

# lvcreate -n lv_base -L 1G vg_critical
  Logical volume "lv_base" created.
# lvcreate -n lv_backups -L 11.98G vg_normal
  Rounding up size to full physical extent 11.98 GiB
  Rounding up size to full physical extent 11.98 GiB
  Logical volume "lv_backups" created.
# lvdisplay -C
  LV         VG          Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_base    vg_critical -wi-a-----  1.00g                                                    
  lv_files   vg_critical -wi-a-----  5.00g                                                    
  lv_backups vg_normal   -wi-a----- 11.98g

创建逻辑卷时需要两个参数；必须传递到 lvcreate 作为选项。使用 -n 选项指定要创建的 LV 的名称，通常使用 -L 选项指定其大小。当然，还需要告诉命令要对哪个 VG 进行操作，作为命令行上的最后一个参数。

进阶 lvcreate 选项

lvcreate 命令有几个选项，允许调整LV的创建方式。

先描述一下 -l 选项，其中 LV 的大小可以作为多个块（而不是上面使用的"人类"单位）。这些块（用 LVM 术语表示是 PE，physical extents，物理区域）是 PV 中存储空间的连续单位，它们不能跨 LV 拆分。当想要精确地定义 LV 的存储空间（例如使用完全可用空间）时，-l 选项可能比 -L 更可取。

也可以提示 LV 的物理位置，以便其扩展存储区到特定的 PV 上（当然，在分配给 VG 的 PV 内）。由于我们知道 sdb比 sdf 快，因此如果我们想要为数据库服务器提供与文件服务器更好的性能，可能希望将 lv_base 存储在那里。命令行为：lvcreate -n lv_base -L 1G vg_critical /dev/sdb2。请注意，如果 PV 没有足够的可用区域，此命令可能会失败。在上面的示例中的情况，可能需要创建 lv_base 到 lv_files 之前 – 或者使用 pvmove 命令在 sdb2 上释放一些空间。

逻辑卷创建成功，最终作为块设备文件 /dev/mapper/：

# ls -l /dev/mapper
total 0
crw------- 1 root root 10, 236 Mar  1 00:17 control
lrwxrwxrwx 1 root root       7 Mar  1 00:19 vg_critical-lv_base -> ../dm-1
lrwxrwxrwx 1 root root       7 Mar  1 00:17 vg_critical-lv_files -> ../dm-0
lrwxrwxrwx 1 root root       7 Mar  1 00:19 vg_normal-lv_backups -> ../dm-2 
# ls -l /dev/dm-*
brw-rw---- 1 root disk 253, 0 Mar  1 00:17 /dev/dm-0
brw-rw---- 1 root disk 253, 1 Mar  1 00:19 /dev/dm-1
brw-rw---- 1 root disk 253, 2 Mar  1 00:19 /dev/dm-2

注释自动检测 LVM 卷

当计算机启动时，lvm2-activation 系统服务单元执行 vgchange-aay 以"激活"卷组：扫描可用设备；已初始化为 LVM 物理卷的卷将注册到 LVM 子系统中，那些属于卷组的数据将组装，相关逻辑卷将启动并可用。因此，在创建或修改 LVM 卷时无需编辑配置文件。

但是请注意，LVM 元素（物理和逻辑卷以及卷组）的布局备份到 /etc/lvm/backup，这对于出现问题（或只是看一下）非常有用。

为了使事情变得更容易，在与 VG 匹配的目录中还创建了方便的符号链接：

# ls -l /dev/vg_critical
total 0
lrwxrwxrwx 1 root root 7 Mar  1 00:19 lv_base -> ../dm-1
lrwxrwxrwx 1 root root 7 Mar  1 00:17 lv_files -> ../dm-0 
# ls -l /dev/vg_normal
total 0
lrwxrwxrwx 1 root root 7 Mar  1 00:19 lv_backups -> ../dm-2

然后，LV 可以与标准分区完全一样使用：

# mkfs.ext4 /dev/vg_normal/lv_backups
mke2fs 1.47.1 (20-May-2024)
Discarding device blocks: done                            
Creating filesystem with 3140608 4k blocks and 786432 inodes
Filesystem UUID: 7eaf0340-b740-421e-96b2-942cdbf29cb3
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done 

# mkdir /srv/backups
# mount /dev/vg_normal/lv_backups /srv/backups
# df -h /srv/backups
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_normal-lv_backups   12G   24K   12G   1% /srv/backups
# [...]
[...]
# cat /etc/fstab
[...]
/dev/vg_critical/lv_base    /srv/base       ext4 defaults 0 2
/dev/vg_critical/lv_files   /srv/files      ext4 defaults 0 2
/dev/vg_normal/lv_backups   /srv/backups    ext4 defaults 0 2

从应用程序的角度来看，多个小分区现在已经抽象成一个大的 12 GB 卷，具有更友好的名称。

12.1.2.3. LVM 的发展

尽管聚合分区或物理磁盘的能力很方便，但这不是 LVM 带来的主要优势。随着时间推移，当需求发生变化时，它带来的灵活性尤其值得注意。在以上示例中，我们假设必须存储新的大型文件，并且专用于文件服务器的 LV 太小，无法包含它们。由于我们尚未使用整个 vg_critical 的可用空间，我们可以扩展 lv_files。为此，将使用命令 lvresize，然后使用 resize2fs 相应地调整文件系统：

# df -h /srv/files/
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_files  4.9G  4.2G  485M  90% /srv/files
# lvdisplay -C vg_critical/lv_files
  LV       VG          Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_files vg_critical -wi-ao---- 5.00g                                                    
# vgdisplay -C vg_critical
  VG          #PV #LV #SN Attr   VSize VFree
  vg_critical   2   2   0 wz--n- 7.99g 1.99g
# lvresize -L 6G vg_critical/lv_files
  Size of logical volume vg_critical/lv_files changed from 5.00 GiB (1280 extents) to 6.00 GiB (1536 extents).
  Logical volume vg_critical/lv_files successfully resized.
# lvdisplay -C vg_critical/lv_files
  LV       VG          Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_files vg_critical -wi-ao---- 6.00g                                                    
# resize2fs /dev/vg_critical/lv_files
resize2fs 1.47.1 (20-May-2024)
Filesystem at /dev/vg_critical/lv_files is mounted on /srv/files; on-line resizing required
old_desc_blocks = 1, new_desc_blocks = 1
The filesystem on /dev/vg_critical/lv_files is now 1572864 (4k) blocks long.

# df -h /srv/files/
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_files  5.9G  4.2G  1.5G  75% /srv/files

注意调整文件系统的大小

并非所有文件系统都可以在线调整大小；因此，调整卷的大小可能需要先卸载文件系统，然后重新挂载文件系统。当然，如果想要缩小分配给 LV 的空间，则必须先缩小文件系统；当要增加大小则顺序将相反：逻辑卷必须在其上的文件系统之前增大。这相当简单，因为文件系统的大小绝不能大于它所在的块设备（无论该设备是物理分区还是逻辑卷）。

ext3、ext4 和 xfs 文件系统可以在线扩展，无需卸载；缩小则需要卸载。reiserfs 文件系统允许在线调整大小。较旧的 ext2 文件系统不允许在线调整大小，始终需要卸载。

我们可以用类似的方式扩展托管数据库的卷，除非已达到 VG 的可用空间限制：

# df -h /srv/base/
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_base  974M  883M   25M  98% /srv/base
# vgdisplay -C vg_critical
  VG          #PV #LV #SN Attr   VSize VFree   
  vg_critical   2   2   0 wz--n- 7.99g 1016.00m

No matter, since LVM allows adding physical volumes to existing volume groups. For instance, maybe we've noticed that the sdb3 partition, which was so far used outside of LVM, only contained archives that could be moved to lv_backups. We can now recycle it and integrate it to the volume group, and thereby reclaim some available space. This is the purpose of the vgextend command. Of course, the partition must be prepared as a physical volume beforehand. Once the VG has been extended, we can use similar commands as previously to grow the logical volume then the filesystem:

# pvcreate /dev/sdb3
  Physical volume "/dev/sdb3" successfully created.
# vgextend vg_critical /dev/sdb3
  Volume group "vg_critical" successfully extended
# vgdisplay -C vg_critical
  VG          #PV #LV #SN Attr   VSize   VFree 
  vg_critical   3   2   0 wz--n- <12.99g <5.99g 
# lvresize -L 2G vg_critical/lv_base
[...]
# resize2fs /dev/vg_critical/lv_base
[...]
# df -h /srv/base/
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/vg_critical-lv_base  2.0G  886M  991M  48% /srv/base

12.1.3. RAID 还是 LVM？

RAID 和 LVM 都带来了无可争辩的优势，只要将台式计算机的一个硬盘保留，而使用模式不会随着时间而变化。然而，RAID 和 LVM 的目标不同，因此有理由怀疑应该采用哪一个。最适当的答案当然取决于当前和可预见的要求。

有几个简单的案例，但不是真实的情况。如果要求是保护数据免受硬件故障的影响，那么很明显，在冗余磁盘阵列上设置 RAID，因为 LVM 并没有真正解决此问题。另一方面，如果需要一种灵活的存储方案，其中卷独立于磁盘的物理布局，RAID 不会帮上什么忙，LVM 将是自然的选择。

注释如果性能很重要…

If input/output speed is of the essence, especially in terms of access times, using LVM and/or RAID in one of the many combinations may have some impact on performances, and this may influence decisions as to which to pick. However, these differences in performance are really minor, and will only be measurable in a few use cases. If performance matters, the best gain to be obtained would be to use non-rotating storage media (solid-state drives or SSDs); their cost per megabyte is higher than that of standard hard disk drives, and their capacity is usually smaller, but they provide excellent performance for random accesses. If the usage pattern includes many input/output operations scattered all around the filesystem, for instance for databases where complex queries are routinely being run, then the advantage of running them on an SSD far outweigh whatever could be gained by picking LVM over RAID or the reverse. In these situations, the choice should be determined by other considerations than pure speed, since the performance aspect is most easily handled by using SSDs.

第三个值得注意的用例是，只想将两个磁盘聚合到一个卷中，或者出于性能原因，或者有一个比任何可用磁盘都大的文件系统。可以通过 RAID-0（甚至线性 RAID）和 LVM 卷解决此情况。在这种情况下，除非存在额外的限制（例如计算机仅要与其余计算机保持一致使用 RAID，通常选择是配置 LVM。初始设置几乎要复杂得多，如果需求发生变化或需要添加新磁盘，LVM 带来的复杂性略有增加，弥补 LVM 带来的额外灵活性。

当然，还有非常有趣的用例，即存储系统既需要抵抗硬件故障，又需要灵活地进行卷分配。RAID 和 LVM 都无法自己满足这两个要求；不管怎样，这是我们同时使用这两个的地方 — 或者更确切地说，一个放在另一个上面。自 RAID 和 LVM 成熟以来，这一方案已完全成为标准，其方案是首先通过将磁盘分组到少量大型 RAID 阵列中，然后使用这些 RAID 阵列作为 LVM 物理卷来确保数据冗余；然后，逻辑分区将从这些 LV 中配置为文件系统。此设置的卖点是，当磁盘发生故障时，只需重建少量 RAID 阵列，从而减少管理员用于恢复的时间。

Let's take a concrete example: the public relations department at Falcot Corp needs a workstation for video editing, but the department's budget doesn't allow investing in high-end hardware from the bottom up. A decision is made to favor the hardware that is specific to the graphic nature of the work (monitor and video card), and to stay with generic hardware for storage. However, as is widely known, digital video does have some particular requirements for its storage: the amount of data to store is large, and the throughput rate for reading and writing this data is important for the overall system performance (more than typical access time, for instance). These constraints need to be fulfilled with generic hardware, in this case two 960 GB SATA hard disk drives; the system data must also be made resistant to hardware failure, as well as some of the user data. Edited video clips must indeed be safe, but video rushes pending editing are less critical, since they're still on the videotapes.

RAID-1 和 LVM 组合在一起以满足这些条件。磁盘连接到两个不同的 SATA 控制器，以优化并行访问并降低同时发生故障的风险，因此它们显示为 sda 和 sdc。它们沿以下方案进行分区：

# sfdisk -l /dev/sda
Disk /dev/sda: 894.25 GiB, 960197124096 bytes, 1875385008 sectors
Disk model: SAMSUNG MZ7LM960
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BB14C130-9E9A-9A44-9462-6226349CA012

Device         Start        End   Sectors   Size Type
/dev/sda1        2048       4095      2048     1M BIOS boot
/dev/sda2        4096  100667391 100663296    48G Linux RAID
/dev/sda3   100667392  134221823  33554432    16G Linux RAID
/dev/sda4   134221824  763367423 629145600   300G Linux RAID
/dev/sda5   763367424 1392513023 629145600   300G Linux RAID
/dev/sda6  1392513024 1875384974 482871951 230.3G Linux LVM

The first partitions of both disks are BIOS boot partitions.
The next two partitions sda2 and sdc2 (about 48 GB) are assembled into a RAID-1 volume, md0. This mirror is directly used to store the root filesystem.
The sda3 and sdc3 partitions are assembled into a RAID-0 volume, md1, and used as swap partition, providing a total 32 GB of swap space. Modern systems can provide plenty of RAM and our system won't need hibernation. So with this amount added, our system will unlikely run out of memory.
The sda4 and sdc4 partitions, as well as sda5 and sdc5, are assembled into two new RAID-1 volumes of about 300 GB each, md2 and md3. Both these mirrors are initialized as physical volumes for LVM, and assigned to the vg_raid volume group. This VG thus contains about 600 GB of safe space.
The remaining partitions, sda6 and sdc6, are directly used as physical volumes, and assigned to another VG called vg_bulk, which therefore ends up with roughly 460 GB of space.

创建 VG 后，可以非常灵活地对 VG 进行分区。必须记住，即使其中一个磁盘故障，vg_raid 中创建的 LV 都会保留，而在 vg_bulk 中创建的 LV 则不会保留；另一方面，后者将并行分配给两个磁盘，这允许大文件有更高的读取或写入速度。

因此，我们将在 vg_raid 中创建 lv_var 和 lv_home，以存放对应的文件系统；另一个大型LV lv_movies，将用于存放编辑后的视频。另一个 VG 将拆分为 lv_rushes，用于直接保存数字摄像机中输出的数据，lv_tmp 存放临时文件。工作区的位置是一个不太简单的选择：虽然该卷需要良好的性能，但如果磁盘在编辑会话期间发生故障，是否值得冒失去工作的风险？根据该问题的答案，相关 LV 将在一个 VG 或另一个 VG 上创建。

现在，我们既对重要数据有一些冗余，又在如何跨应用程序之间拆分可用空间方面具有极大的灵活性。

注释为什么选择三个 RAID-1 卷？

我们本可以只设置一个 RAID-1 卷，物理卷作为 vg_raid。那么，为什么要创建其中三个呢？

第一次拆分（md0 与其他）的基本原理是关于数据安全：写入 RAID-1 镜像的两个元素的数据完全相同，因此可以绕过 RAID 层并直接装载其中一个磁盘。例如，如果内核错误，或者 LVM 元数据已损坏，仍然可以启动最小系统来访问关键数据，例如 RAID 和 LVM 卷中的磁盘布局；然后可以重建元数据，并再次访问文件，以便系统可以恢复其正常状态。

The rationale for the second split (md2 vs. md3) is less clear-cut, and more related to acknowledging that the future is uncertain. When the workstation is first assembled, the exact storage requirements are not necessarily known with perfect precision; they can also evolve over time. In our case, we can't know in advance the actual storage space requirements for video rushes and complete video clips. If one particular clip needs a very large amount of rushes, and the VG dedicated to redundant data is less than halfway full, we can re-use some of its unneeded space. We can remove one of the physical volumes, say md3, from vg_raid and either assign it to vg_bulk directly (if the expected duration of the operation is short enough that we can live with the temporary drop in performance), or undo the RAID setup on md3 and integrate its components sda5 and sdc5 into the bulk VG (which grows by 600 GB instead of 300 GB); the lv_rushes logical volume can then be grown according to requirements.

第 12 章 高级管理