JUN0.DEV
JUN0.DEV

Longhorn Mount Conflict and Possible multipathd Interference

Published on
  • avatarJunyoung Yang
GitHubpnu-code-place/code-placeRepository for Code Place, Pusan National University's coding practice platform

While operating Code Place, I ran into a mount error while recovering a PostgreSQL instance and Longhorn volume after WAL growth. The overall recovery flow is covered in the Longhorn and CNPG recovery post. This post focuses only on the part where the Longhorn volume could not be mounted on the actual node.

At first, it looked like the database was failing to start. But Kubernetes events showed mount failures, and Longhorn also showed unhealthy share-manager state.

At first glance, it looked like a Longhorn problem. But while following the logs, I had to check already mounted or mount point busy, ext4-related messages, AppArmor logs, and the possibility of multipathd interference together.

This post is a record of tracing the mount error during the recovery process beyond the Kubernetes and Longhorn screens, down to Linux device management.

Problem

The first messages that caught my attention were related to mount failures.

  • already mounted or mount point busy
  • format of disk failed
  • ext4 messages saying that a device was already in use
  • Longhorn share-manager not coming up normally
  • Logs that looked like AppArmor denial

These messages did not all seem to point to the same cause at first. Some looked like Kubernetes mount problems, some looked like filesystem problems, and some looked like security policy problems.

But there was one common point. Longhorn expected to take the device and mount it in a certain way, but the system was saying that the device was already in use or not ready.

Longhorn UI Check

At first, I looked at the volume state and replica state in the Longhorn UI. Since the system used Longhorn, that was the natural place to start.

But there were parts that Longhorn UI alone could not explain.

The PVC existed, and the Longhorn volume also existed in some state. But the workload could not use that volume normally. There was a gap between the state shown by Longhorn and the actual attach and mount process used by the Pod.

From that point, I had to check not only Longhorn state but also Kubernetes events and node logs. Storage problems often do not end at the operator UI.

Mount Busy Check

already mounted or mount point busy looks simple, but it can come from several situations.

  • Was a previous mount not cleaned up correctly?
  • Is another process holding the same device?
  • Did the path expected by share-manager differ from the actual mount state?
  • Did the node assign the device name or link differently than expected?

At first, I suspected leftover mounts or Longhorn process state. But as the same symptom repeated, I started thinking that this might not be solved just by deleting a mount point.

The question I had to check was "Why does the system see this device as busy?" Instead of treating the Kubernetes event line itself as the thing to fix, I had to check the device state underneath it.

ext4 Message Check

I also saw format failure and ext4-related messages. If I looked only at those messages, it would be easy to suspect filesystem corruption or a formatting problem.

But in this case, ext4 seemed closer to a later failure than the starting point. Something else already seemed to be holding the device, and because of that, format or mount was failing.

So the direction changed.

Instead of asking "Why did ext4 fail?", I started asking "Who is already using the device that ext4 is trying to access?"

After looking at it this way, I could not treat the problem as only a filesystem issue. I had to check the node-level device management state as well.

share-manager State Check

In Longhorn, share-manager is needed for RWX volume handling. If this component cannot come up normally, workloads above it may fail to attach the volume correctly.

But the share-manager failure could also be a result, not the root cause. If share-manager tried to mount the device but the device was already busy or was handled differently by the system, share-manager could also fail.

So I did not want to suspect only the share-manager component just because it was failing. The reason share-manager failed could be below it, in the node device state.

AppArmor Log Check

I also saw logs that looked like AppArmor denial or ptrace-related messages. Security-related logs stand out, so at first they can look like the cause.

But when I looked at the full flow, the messages that explained the symptom more directly were mount busy, ext4 in use, and format failed. AppArmor logs still needed to be checked, but I could not immediately treat them as the center of this mount failure.

When looking at failures, I had to separate logs that only looked noticeable from logs that actually explained the symptom better.

multipathd Check

In the end, the question I had to answer was simple.

"Why does the system think this device is already in use?"

If another component on the system is managing the device separately from what Longhorn expects, this kind of problem can happen. One of the candidates at that point was multipathd.

multipathd is a process that manages storage devices with multiple paths. Depending on the environment, it can touch or interfere with block devices that Longhorn expects to manage, or make the device appear differently from what Longhorn expects.

I did not assume from the beginning that multipathd was the only cause. But when I looked at messages such as mount busy, ext4 in use, and format failure together, it was reasonable to suspect that Longhorn was not able to manage the device exclusively.

So instead of checking only Longhorn state, I also checked which process was holding or managing the device on the node, what processes were involved, and how the system was seeing that device.

Disabling multipathd

In the end, the mount problem was resolved after stopping and disabling multipathd.

This was different from simply rebooting the node and resetting state. There was a possibility that multipathd was recognizing or touching the block device that Longhorn needed to manage. From Longhorn's point of view, that could make it fail to format or mount the device normally.

So I did not treat the problem only as "Longhorn mount failed." I checked which process on the node could be managing the device. After disabling multipathd, the mount proceeded, and the issue looked closer to a conflict with node-level device management than a Longhorn-only problem.

This does not mean that multipathd should always be disabled. When using a storage system like Longhorn, which directly manages block devices, I need to check whether another device management process on the node can touch the same devices.

Checkpoints

After this incident, my way of checking storage failures changed. Before, I mostly interpreted the problem from Longhorn or Kubernetes event messages. Now I also check the actual device state underneath them.

The points I checked were:

  • A PVC or Longhorn Volume existing does not mean it is actually usable.
  • Bound and a successful attach/mount are different states.
  • Mount failures may not be fully explained by Kubernetes events or the Longhorn UI.
  • When mount busy appears, check who is holding the device on the node.
  • share-manager failure can be a cause, but it can also be a result of a lower-level device problem.
  • Even if AppArmor logs stand out, check whether they actually explain the symptom.
  • Check whether a system process such as multipathd can touch the device that Longhorn is supposed to manage.

When I see a similar mount failure again, I try to check Kubernetes events, Longhorn volume and share-manager state, node device state, and system processes in that order.

Takeaway

This incident showed me that looking only at the Longhorn volume state can miss part of the problem. From the top, it looked like a PostgreSQL instance problem. In the middle, it looked like a Longhorn mount failure. But to solve it, I had to go down to node-level device state and system processes.

Mount failure, format failure, and share-manager problems looked unrelated when viewed separately. But when I looked at them from the possibility that multipathd could interfere with the device Longhorn needed, messages such as mount busy, ext4 in use, and format failed started to connect as one flow.

When troubleshooting an operations failure, I need to check not only where the error is shown, but also where that error is created. This issue was not solved at the Longhorn screen alone. It had to be checked down to the node's device management process.