Degraded RAID array | Missing ".ocdata"

orgamisho · August 8, 2022, 9:36am

My NC was running great for 50 days already, but yesterday my co-workers told me, that they experience a problem when uploading bigger files. So I edited the php.ini file as follows:
Max file uploads 20->20 000
Upload max filesize 2M->10G
Memory limit 512M->2048M
post max size 8M->5G
max exec time 30->120
max input time 60->120
chunk size 10MB->30MB

I am not sure if that is connected, but later in the day I got an error for missing “.ocdata” in data directory. I was not sure what to do, so at first I restarted my Ubuntu Machine, and the system was up and running again … but for few hours only and now restarting is not helping. The whole error I get is this:

Your data directory is invalid Ensure there is a file called “.ocdata” in the root of the data directory. Cannot create “data” directory This can usually be fixed by giving the webserver write access to the root directory. See Installation wizard — Nextcloud latest Administration Manual latest documentation

I am using 2x 4TB Seagate NAS Ironwolf HDD in RAID1 array and I believe it is the source of the main problem. After running “sudo mdadm --query --detail /dev/md0”, I see that the status of the RAID is “clean, degraded” and “cat /proc/mdstat” is giving “U_”.

I was able to find some instructions for repairing RAID array, but since they were not especcially for NC, I wanted to ensure that I am doing the right thing, so I decided to ask here.
Already tried “sudo mdadm /dev/md0 -a /dev/sdc”, but its giving an error “cannot load array metadata for /dev/md0”

So long story short I have 2 questions:

May my changes to php.ini be the cause of this problem?
Do you also think the problem is in the RAID array and how to fix it without loosing data?

Thank in advance!

Vincent_Stans · August 8, 2022, 10:45am

Hello

First your changes to php.ini have nothing to do with anything. Just bad luck that it happend when you changed php.ini.

2nd your raid1 array is defective. Why we don’t know yet but it is not related to NC. This is a OS level issue.

Currently I am in the same situation as your RAID but I have simply removed the defective hdd from the array and everything is still running fine.

What you should do is: look in DMESG for failing HDD.
If possible check SMARTCTL for failing HDD

remove the defective HDD from the array.
replace the defective HDD with a new one.
format new HDD as LINUX RAID-AUTODETECT
add HDD to RAID array.

It will then start to rebuild.

A few questions from our side.

Is your data directory located on the /dev/md0 RAID
You tried to readd /dev/sdc to your RAID are you sure this is correct ( maybe it is /dev/sdc1 but That is speculation )
Please create a backup ( userdata/SQL/NC installation ) asap if not already done.

please post the complete output of

sudo mdadm --detail /dev/md0

and

sudo fdisk -l

orgamisho · August 8, 2022, 3:57pm

I am glad that my changes are not the reason.

What do you mean by “removed the defective hdd”? Physically detached it from motherboard or removed it virtually from the raid array?

I am not able to replace it right away, so I wish there is a way to run it again as it is without losing any data and to check the health of the bad HDD (still believe something else triggered the problem and there is no damage to the HDD because it is brand new - only 2 motnhs of usage) .

DMESG:

md/raid1:md0: Disk failure on sdc, disabling device. md/raid1:md0: Operation continuing on 1 devices.
Buffer I/O error on dev md0, logical block 976721600, async page read
md super_written gets error=-5
Buffer I/O error on dev md0, logical block 0, lost sync page write
EXT4-fs (md0): I/O error while writing superblock
Buffer I/O error on dev md0, logical block 0, async page read
Dev md0: unable to read RDB block 0
there were other errors as well, but copied only the ones which I think are related to the raid.

sudo smartctl -t short /dev/sdc

SMART overall-health self-assessment test result: PASSED
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1230 -

‘datadirectory’ => ‘/mnt/md0/nextcloud/data’

Upon checking this I see the mismatch, but I have no explanation for how it worked till now ?!

I was also not sure, but checking now with the DMESG I think I was correct.

Is there a smart way to create a back up or just plug an external HDD and copy everything?

/dev/md0:
Version : 1.2
Creation Time : Sun Jun 19 18:39:38 2022
Raid Level : raid1
Array Size : 3906886464 (3725.90 GiB 4000.65 GB)
Used Dev Size : 3906886464 (3725.90 GiB 4000.65 GB)
Raid Devices : 2
Total Devices : 1
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Aug 8 09:38:21 2022
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Number Major Minor RaidDevice State
0 8 16 0 active sync
- 0 0 1 removed

Disk /dev/loop0: 400,82 MiB, 420265984 bytes, 820832 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop1: 81,27 MiB, 85209088 bytes, 166424 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop2: 4 KiB, 4096 bytes, 8 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop3: 54,24 MiB, 56872960 bytes, 111080 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop4: 46,98 MiB, 49233920 bytes, 96160 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop5: 91,7 MiB, 96141312 bytes, 187776 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop6: 61,98 MiB, 64970752 bytes, 126896 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop7: 254,1 MiB, 266436608 bytes, 520384 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/sda: 465,78 GiB, 500107862016 bytes, 976773168 sectors
Disk model: ST9500325AS
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x21b9fc3f

Device Boot Start End Sectors Size Id Type
/dev/sda1 * 2048 1050623 1048576 512M b W95 FAT32
/dev/sda2 1050624 2101247 1050624 513M b W95 FAT32
/dev/sda3 2101248 3151871 1050624 513M b W95 FAT32
/dev/sda4 3153918 976771071 973617154 464,3G 5 Extended
/dev/sda5 3153920 976771071 973617152 464,3G 83 Linux

Disk /dev/loop8: 61,98 MiB, 64970752 bytes, 126896 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/loop9: 46,98 MiB, 49242112 bytes, 96176 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

The primary GPT table is corrupt, but the backup appears OK, so that will be used.
Disk /dev/sdc: 3,65 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: ST4000VN008-2DR1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: BDD95849-5D83-4F94-A4D4-2510A6D27DCD

The primary GPT table is corrupt, but the backup appears OK, so that will be used.
Disk /dev/sdd: 3,65 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: ST4000VN008-2DR1
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 9F71BB6E-FAEC-4D3A-ABCA-A2CA7A94F1B3

Here I am not sure where sdd come from ?!

As of some kind of conclusion, I think the problem is in the RAID itself and not in the disk, but I might be wrong.

Vincent_Stans · August 8, 2022, 10:24pm

I meant remove it virtually from the array
first make sure it’s not active ( we can see it is deactivated by DMESG ) but double check

else this will set the disk as faulty

mdadm --manage /dev/md0 --fail /dev/sdc

then remove it from the array with

mdadm --manage /dev/md0 --remove /dev/sdc

The device /dev/md0 is like a drive or partition it is just a disk which must be mounted before you can functionally use it. This is done to /mnt/md0(/nextcloud/data)
where the part in brackets maybe a folder on the disk /dev/md0 or it might be a excisting folder (user created) where /dev/md0 is mounted.
check it in /etc/fstab or with

df -h

I would use the smart way Backup — Nextcloud latest Administration Manual latest documentation

Most important is the database which cannot be copied like files you need to extract it from the SQL server.
I use some thing similar as the manual

mysqldump --single-transaction --default-character-set=utf8mb4 -h [server] -u [username] -p[password] [db_name] > nextcloud-sqlbkp_date +"%Y%m%d".sql

for the data I use rsync ofcourse make sure the MOUNTED_USB/backupdir/ either exists or corrected

  rsync -ah --stats \
 --exclude={'lost+found','var/','mirror/','skel/','updater-*','appdata_*'} \
 /mnt/md0/nextcloud/data/ \
 /media/MOUNTED_USB/backupdir/ \
 &>/tmp/rsync.log

Your output of fdisk is missing the /dev/md0
/dev/md0 was probably created with /dev/sdc and /dev/sdd as you said

so you must have 2 disk.
please full output

fdisk -l /dev/md0

cat /etc/mdadm/mdadm.conf

mdam -D /dev/md0

and

cat /proc/mdstat

in order to determine which disk is still ok and if raid is still setup correct to rebuild or assemble.

Also in the mean time check if /mnt/md0/nextcloud/data exists and that

sudo -u www-data touch file.tmp

creates a file there and that the file .ocdata is there

ls -a

orgamisho · August 9, 2022, 2:12pm

I had a very long and interesting day fixing my system. I will explain everythin, but I don’t believe it will help somebody. I am just looking for a confirmation to my theory.

First, when I woke up in the morning I saw Vincent has replied me (thank you for your continious support !!!) and I was motivated to fix the problem. Tried to remotely access my machine, but anydesk couldn’t connect to it, so I went to the machine and saw very strange activity - it was constantly turning on and off every 3-5 seconds and pressing the power button seemed to no help the situation. I’ve spent few hours to isolate the problem to the power adater and PSU (built in mini ITX) so I had to mount another PSU (this time satndart ITX) and the machine was again working… or atleast just booting properly, but after the Ubuntu logo there was only a black screen and no responsiveness. So another few hours of resolving the black screen were spent and finally I was able to access my machine somehow.

The strange thing was that at this point I accidently checked my NC (had an opened browser) and it was working !! I’ve ran “sudo mdadm --detail” and I see the status of the RAID is now “clean” (not degraded as before).

So at this point I am wondering if the previous PSU was running close to its maximum and when I boosted up the file transfer settings for NC could I damaged the PSU somehow ?! And if the PSU was somehow damaged, may it be the cause of the RAID not working properly?

Do you think I will be all good if I only replace the PSU, or I should also purchase new HDD, although it works good for now? I’ve ran short SMART self test to both disks and the results are “OK”.

Vincent_Stans · August 9, 2022, 11:28pm

Glad to hear the problem is resolved.

check your hardware and source there power consumption either on there box or the manual or online.

use a tool like inxi to find out hardware serienumbers or types.

You disk are probably fine but having a spare can’t hurt. you could also add spares to your raid.

There are many ways that a PSU can die make sure the PSU has at least == 10% more power than your max usage, preferably => 25%

setup smartd with something like below in /etc/smartd.conf

/dev/sdc -a -o on -S on -s (S/../../(1|2|3|4|5)/14|L/../../(3|6)/01) -m root -M exec /usr/share/smartmontools/smartd-runner
/dev/sdd -a -o on -S on -s (S/../../(1|2|3|4|5)/10|L/../../(3|6)/07) -m 'your@email.com' -M exec /usr/share/smartmontools/smartd-runner

orgamisho · August 12, 2022, 8:08pm

Thank you for the advice!
Already set email notifications!