Managing and rebuilding software RAID on servers using legacy boot (BIOS) mode
Objective
Redundant Array of Independent Disks (RAID) is a technology that mitigates data loss on a server by replicating data across two or more disks.
The default RAID level for OVHcloud server installations is RAID 1, which doubles the space taken up by your data, effectively halving the useable disk space.
This guide explains how to manage and rebuild a software RAID in the event of a disk replacement on your server in legacy boot mode (BIOS).
Before we begin, please note that this guide focuses on Dedicated servers that use legacy boot (BIOS) mode. If your server uses the UEFI mode (newer motherboards), refer to this guide Managing and rebuilding software RAID on servers in UEFI boot mode.
To check whether a server runs on legacy BIOS or UEFI mode, run the following command:
Requirements
- A Dedicated server with a software RAID configuration
- Administrative (sudo) access to the server via SSH
- Understanding of RAID and partitions
Instructions
When you purchase a new server, you may feel the need to perform a series of tests and actions. One such test could be to simulate a disk failure to understand the RAID rebuild process and prepare yourself in case it ever happens.
Content overview
Basic Information
In a command line session, type the following code to determine the current RAID status:
This command shows us that we have two software RAID devices currently set up, with md4 being the largest one. The md4 RAID device consists of two partitions, which are known as nvme0n1p4 and nvme1n1p4.
The [UU] means that all the disks are working normally. A _ would indicate a failed disk.
If you have a server with SATA disks, you would get the following results:
Although this command returns our RAID volumes, it doesn't tell us the size of the partitions themselves. We can find this information using fdisk -l:
The fdisk -l command also allows you to identify your partition type. This is important information when rebuilding your RAID after a disk failure.
For GPT partitions, line 6 will display: Disklabel type: gpt. This information can only be seen when the server is in normal mode.
Still going by the results of fdisk -l, we can see that /dev/md2 consists of 888.8GB and /dev/md4 contains 973.5GB.
Alternatively, the lsblk command offers a different view of the partitions:
Note the devices, partitions, and mount points, as this is important, especially after replacing a disk. This will allow you to verify that the partitions are correctly mounted on their respective mount points on the new disk.
In our example, we have:
- Partitions part of md2 (
/): sda2 and sdb2. - Partitions part of md4 (
/home): sda4 and sdb4. - Swap partitions: sda3 and sdb3.
- BIOS boot partitions: sda1 and sdb1.
The sda5 partition is a config drive, i.e. a read-only volume that provides the server with its initial configuration data. It is only read once during initial boot and can be removed afterwards.
Simulating a disk failure
We now have all the necessary information to simulate a disk failure. In this example, we will fail the disk sda.
The preferred way to do this is via the OVHcloud rescue mode environment.
First reboot the server in rescue mode and log in with the provided credentials.
To remove a disk from the RAID, the first step is to mark it as Failed and remove the partitions from their respective RAID arrays.
From the above output, sda consists of two partitions in RAID which are sda2 and sda4.
Removing the failed disk
First we mark the partitions sda2 and sda4 as Failed.
We have now simulated a failure of the RAID, when we run the cat /proc/mdstat command, we have the following output:
As we can see above, the [F] next to the partitions indicates that the disk has failed or is faulty.
Next, we remove these partitions from the RAID arrays.
To simulate a clean disk, run the following command. Replace sda with your own values:
The disk now appears as a new, empty drive:
If we run the following command, we see that our disk has been successfully "wiped":
Our RAID status should now look like this:
From the results above, we can see that only two partitions now appear in the RAID arrays. We have successfully failed the disk sda and we can proceed with the disk replacement.
For more information on how to prepare and request for a disk replacement, consult this guide.
If you run the following command, you can have more details on the RAID array(s):
Rebuilding the RAID
This process may vary depending on the operating system installed on your server. We recommend that you consult the official documentation for your operating system to obtain the appropriate commands.
For most servers in software RAID, after a disk replacement, the server is able to boot in normal mode (on the healthy disk) to rebuild the RAID. However, if the server is not able to boot in normal mode, it will be rebooted in rescue mode to proceed with the RAID rebuild.
Rebuilding the RAID in normal mode
In our example, we have replaced the disk sda.
Once the replacement is done, the next step is to copy the partition table from the healthy disk (in this example, sdb) to the new one (sda).
The command should be in this format: sgdisk -R /dev/newdisk /dev/healthydisk.
The command should be in this format: sfdisk -d /dev/healthydisk | sfdisk /dev/newdisk.
Once this is done, the next step is to randomise the GUID of the new disk to prevent GUID conflicts with other disks:
If you receive the following message:
Run partprobe. If you still cannot see the newly-created partitions (e.g. with lsblk), you need to reboot the server before continuing.
Next, we add the partitions to the RAID:
Use the following command to monitor the RAID rebuild:
Lastly, we add a label and mount the [SWAP] partition (if applicable).
To add a label to the SWAP partition:
Next, retrieve the UUIDs of both SWAP partitions:
We replace the old UUID of the SWAP partition (sda4) with the new one in the /etc/fstab file.
Example:
Based on the above results, the old UUID is b7b5dd38-9b51-4282-8f2d-26c65e8d58ec and should be replaced with the new one b3c9e03a-52f5-4683-81b6-cc10091fcd15.
Make sure you replace the correct UUID.
Next, we verify that everything is properly mounted with the following command:
We run the following command to enable the SWAP partition:
Then reload the system with the following command:
The RAID rebuild is complete.
Rebuilding the RAID in rescue mode
If your server is unable to reboot in normal mode after a disk replacement, it will be rebooted in rescue mode by our datacentre team.
In this example, we have replaced the disk sdb.
Once the disk has been replaced, we need to copy the partition table from the healthy disk (in this example, sda) to the new one (sdb).
The command should be in this format: sgdisk -R /dev/newdisk /dev/healthydisk
Example:
The command should be in this format: sfdisk -d /dev/healthydisk | sfdisk /dev/newdisk
Example:
Once this is done, the next step is to randomise the GUID of the new disk to prevent GUID conflicts with other disks:
If you receive the following message:
You can simply run the partprobe command.
We can now rebuild the RAID array by adding the new partitions (sdb2 and sdb4) back:
Use the cat /proc/mdstat command to monitor the RAID rebuild:
Lastly, we add a label and mount the [SWAP] partition (if applicable).
Once the RAID rebuild is complete, we mount the partition containing the root of our operating system on /mnt. In our example, that partition is md2.
We add the label to our SWAP partition with the command:
Next, we mount the following directories to make sure any manipulation we make in the chroot environment works properly:
Next, we access the chroot environment:
We retrieve the UUIDs of both swap partitions:
Example:
Next, we replace the old UUID of the swap partition (sdb4) with the new one in /etc/fstab:
Example:
In our example above, the UUID to replace is d6af33cf-fc15-4060-a43c-cb3b5537f58a with the new one b3c9e03a-52f5-4683-81b6-cc10091fcd15.
Make sure you replace the proper UUID.
Next, we make sure everything is properly mounted:
Enable the SWAP partition with the following command:
We exit the chroot environment with exit and reload the system:
We unmount all the disks:
The RAID rebuild is complete. Reboot the server in normal mode.
Go further
For specialised services (SEO, development, etc.), contact OVHcloud partners.
If you would like assistance using and configuring your OVHcloud solutions, please refer to our support offers.
If you need training or technical assistance to implement our solutions, contact your sales representative or click on this link to get a quote and ask our Professional Services experts to assist with your specific use case.
Join our community of users.
-
Secure Shell (SSH) : un protocole de réseau sécurisé utilisé pour établir des connexions entre un client et un serveur. Il permet d'exécuter des commandes à distance de manière sécurisée. ↩