1 00:00:06,000 --> 00:00:11,441 Welcome to my talk about loop mounting and going beyond ordinary loop mounting. 2 00:00:11,441 --> 00:00:16,492 Hopefully most of you here know what loop mounting is 3 00:00:16,493 --> 00:00:22,330 but in case you don't I'm just going to give you a quick demo 4 00:00:20,124 --> 00:00:25,600 of what I'm going to call traditional, old fashioned loop mounting. 5 00:00:26,570 --> 00:00:30,279 I've got a file on my laptop here, a Fedora disk image. 6 00:00:30,280 --> 00:00:35,008 It's 6 GB in size. 7 00:00:32,162 --> 00:00:40,444 I use the command "losetup" to associate files with Linux kernel block devices. 8 00:00:40,581 --> 00:00:43,494 I don't have any loops set up at the moment. 9 00:00:43,494 --> 00:00:48,717 But I can associate this file with a Linux kernel block device called /dev/loop0. 10 00:00:50,166 --> 00:00:54,312 And I can look at partitions. 11 00:00:55,621 --> 00:00:56,857 And I can mount them -- that's loop mounting. 12 00:00:59,118 --> 00:01:06,211 That's just ordinary loop mounting, it's not what we're going to be talking about today although conceptually it's similar. 13 00:01:06,555 --> 00:01:09,250 Traditional loop mounting is fine if it's a plain file. 14 00:01:09,930 --> 00:01:11,686 It falls over very quickly. 15 00:01:12,286 --> 00:01:15,558 What happens if you've got a compressed disk image? 16 00:01:16,100 --> 00:01:23,080 This isn't a disk image that I've compressed, this is how cloud images are distributed by the Fedora community. 17 00:01:23,160 --> 00:01:24,835 And this is one that I just downloaded. 18 00:01:25,520 --> 00:01:28,914 From the Fedora website in exactly this format. I didn't modify it in any way. 19 00:01:28,918 --> 00:01:30,687 It's XZ-compressed. 20 00:01:31,496 --> 00:01:33,883 And of course you could turn that into a loop device. 21 00:01:34,435 --> 00:01:38,391 And what would happen if you'd end up with a device which contained XZ-compressed data. 22 00:01:39,034 --> 00:01:43,358 And it is my contention today that this is not what you meant by loop mounting this. 23 00:01:44,638 --> 00:01:50,478 What you'd want it to do is you'd like it to transparently uncompress so you can see the data inside. 24 00:01:51,100 --> 00:01:53,119 And we can do that. 25 00:01:53,835 --> 00:01:56,835 Using a command called "nbdkit" which I'm going to talk about in a minute. 26 00:01:56,315 --> 00:01:59,519 And that creates a process. 27 00:02:02,036 --> 00:02:06,794 And now I use a command to associate that process with a loop device. 28 00:02:07,926 --> 00:02:10,755 It's a different command, "nbd-client", but it's conceptually similar. 29 00:02:12,676 --> 00:02:16,997 And you see there the size -- 4096 MB, 4 GB. 30 00:02:17,846 --> 00:02:20,519 So that's not the compressed size, that's the uncompressed size. 31 00:02:21,033 --> 00:02:24,082 What this is doing is it's transparently uncompressing. It's not uncompressing the whole thing. 32 00:02:25,042 --> 00:02:28,895 It's only uncompressing the blocks that are being used when you request them. 33 00:02:30,396 --> 00:02:33,048 And I can mount that. 34 00:02:34,686 --> 00:02:43,198 And I can look at it, and I can even create files. 35 00:02:46,249 --> 00:02:48,487 So that is what I'm going to be talking about today. 36 00:02:49,192 --> 00:02:51,964 How does this differ from loop mounting? 37 00:02:52,688 --> 00:02:57,519 In both cases we've got a kernel module, on the left hand side "loop.ko". 38 00:02:58,994 --> 00:03:02,714 And it's configured using a command line utility, "losetup". 39 00:03:03,374 --> 00:03:07,003 And you use that to create Linux kernel block devices. 40 00:03:07,568 --> 00:03:08,607 Like /dev/loop0. 41 00:03:09,130 --> 00:03:11,477 On the right hand side I've got a kernel module, "nbd.ko". 42 00:03:12,198 --> 00:03:15,196 It's configured using a command line client "nbd-client". 43 00:03:15,677 --> 00:03:19,377 And it creates loop devices, Linux kernel block devices, "/dev/nbd0" etc. 44 00:03:20,562 --> 00:03:22,886 But the back end as you can see is a little bit different. 45 00:03:23,396 --> 00:03:29,754 On the left hand side the back end is talking over internal Linux kernel APIs like the VFS. 46 00:03:30,435 --> 00:03:33,191 To the file which is associated with the loop device. 47 00:03:33,679 --> 00:03:37,098 On the right hand side we've got a user process running. 48 00:03:37,633 --> 00:03:44,250 This is critical: We've got a user process in this case "nbdkit", but I should say that other NBD servers are available. 49 00:03:45,276 --> 00:03:47,114 Other very very good NBD servers are available. 50 00:03:48,722 --> 00:03:52,523 The kernel is talking to that process using 51 00:03:53,018 --> 00:03:57,717 a TCP port or a Unix domain socket as you require. 52 00:03:59,957 --> 00:04:05,156 I'm going to demonstrate in this talk nbdkit which is an NBD server which I wrote 53 00:04:05,714 --> 00:04:07,564 with a guy called Eric Blake 54 00:04:07,971 --> 00:04:09,838 who's a brilliant free software hacker. 55 00:04:10,601 --> 00:04:14,061 Nbdkit is slightly different from other NBD servers 56 00:04:14,526 --> 00:04:16,903 in that we have a plugin API. It's a stable API. 57 00:04:17,484 --> 00:04:21,336 Which means you can write a plugin now 58 00:04:21,754 --> 00:04:23,719 or you could have written a plugin back in 2013 when we started the project 59 00:04:24,409 --> 00:04:26,677 and it would still be compilable with nbdkit now 60 00:04:27,135 --> 00:04:29,289 and it will still be compilable in the future. 61 00:04:29,929 --> 00:04:31,881 We're not going to break plugins at the source level. 62 00:04:32,779 --> 00:04:35,326 It also has an ABI guarantee. 63 00:04:35,916 --> 00:04:38,368 That means you can compile your plugin 64 00:04:39,085 --> 00:04:42,004 and you can distribute it separately from nbdkit 65 00:04:42,838 --> 00:04:44,038 as a binary 66 00:04:44,434 --> 00:04:49,037 and load it into nbdkit at some point later. We're not going to break that even as we evolve nbdkit 67 00:04:49,394 --> 00:04:51,812 and we evolve the API we don't break source or binary compatibility. 68 00:04:52,633 --> 00:04:55,679 If you don't want to write a plugin -- and I'm going to show you in a minute how you can write a plugin simply 69 00:04:56,763 --> 00:05:00,242 if you don't want to write a plugin, many other plugins are available. 70 00:05:00,777 --> 00:05:03,369 I've listed the ones which are in nbdkit 1.10 on here. 71 00:05:05,326 --> 00:05:07,422 Some of these plugins aren't quite like the others. 72 00:05:09,290 --> 00:05:12,873 These are plugins like Perl, Python and they are gateways to writing plugins 73 00:05:13,249 --> 00:05:15,526 in non-C languages. 74 00:05:15,887 --> 00:05:17,396 So you can write plugins in scripting languages, 75 00:05:17,849 --> 00:05:21,414 even in shell script, if you're not very happy with writing C plugins. 76 00:05:23,963 --> 00:05:26,239 The other concept that nbdkit has is "filters". 77 00:05:26,648 --> 00:05:29,045 You can think of a plugin as like a data source. 78 00:05:29,527 --> 00:05:30,753 It's like a source of disk images. 79 00:05:31,522 --> 00:05:35,277 But filters apply modifications or changes to 80 00:05:36,279 --> 00:05:37,769 that data source. 81 00:05:38,355 --> 00:05:43,454 An example here is the partition plugin. If your source is a whole disk image, a partitioned disk image, 82 00:05:43,476 --> 00:05:46,914 but you only want to serve one of the partitions over NBD, 83 00:05:47,196 --> 00:05:49,112 you can apply the partition filter, 84 00:05:49,628 --> 00:05:50,675 which selects a partition. 85 00:05:51,157 --> 00:05:55,388 Each running nbdkit instance must have exactly one plugin running in it. 86 00:05:56,682 --> 00:05:58,885 But it can have zero or any number of filters. 87 00:05:59,196 --> 00:06:01,297 In this case I've selected the "file" plugin. 88 00:06:01,973 --> 00:06:03,353 So my source is a local file. 89 00:06:03,773 --> 00:06:07,313 But as it's a compressed file I'm going to apply the "xz" filter on top 90 00:06:07,956 --> 00:06:09,193 to transparently uncompress it. 91 00:06:09,714 --> 00:06:11,516 And then I'm going to apply the "partition" filter 92 00:06:12,086 --> 00:06:13,052 to select a partition. 93 00:06:13,406 --> 00:06:15,680 And then I'm going to apply the "cow" filter 94 00:06:16,200 --> 00:06:19,089 because I want to make a writable overlay which I can write out to a qcow2 file later. 95 00:06:20,198 --> 00:06:24,173 This is how you would express all that on the nbdkit command line. 96 00:06:26,613 --> 00:06:28,445 So you put "nbdkit", the name of the program. 97 00:06:28,763 --> 00:06:32,396 The list of the filters. You might think of these filters as being in reverse order. 98 00:06:32,960 --> 00:06:37,842 They're in reverse order from the distance they are from the plugin. 99 00:06:38,594 --> 00:06:42,556 Or another way to think about it is: When an NBD request comes into the server, 100 00:06:42,832 --> 00:06:44,515 it travels through the filters in this order. 101 00:06:45,849 --> 00:06:48,929 At the bottom I've got the name of the plugin. 102 00:06:49,357 --> 00:06:53,888 And then any parameters that the plugin needs. Obviously the file plugin needs to know which file you want to serve. 103 00:06:54,169 --> 00:06:56,315 So I give it the disk name ... the file name. 104 00:06:56,775 --> 00:07:01,678 And filters may also require parameters as well. 105 00:07:01,992 --> 00:07:05,563 In this case the partition filter wants to know which partition you want to serve, 106 00:07:06,637 --> 00:07:07,931 so you have to give that as a parameter. 107 00:07:10,687 --> 00:07:15,006 Now ... I wanted to demonstrate actually writing a plugin, 108 00:07:15,446 --> 00:07:17,926 and I want to do it very quickly so I don't bore you, 109 00:07:18,525 --> 00:07:23,237 and I was trying to think what could I do to demonstrate 110 00:07:23,522 --> 00:07:24,996 to demonstrate writing a plugin? 111 00:07:25,270 --> 00:07:26,963 I thought I'd write a test device, 112 00:07:27,284 --> 00:07:29,116 so I'm going to write a Linux kernel device, 113 00:07:29,487 --> 00:07:30,978 to test the "badblocks" command. 114 00:07:31,952 --> 00:07:33,468 It's quite a young audience here, 115 00:07:33,827 --> 00:07:36,741 and we haven't used the badblocks command for a really long time, 116 00:07:37,250 --> 00:07:40,112 perhaps since we've had IDE disks in the early 90s. 117 00:07:40,383 --> 00:07:47,450 But before then old grey-haired people will remember RLL and MFM disks? 118 00:07:48,921 --> 00:07:52,390 Everyone's looking a bit confused. Floppy disks? 119 00:07:54,200 --> 00:07:57,710 In those systems, when there was an error on the surface of the disk 120 00:07:58,347 --> 00:08:02,105 that would appear at the filesystem layer. 121 00:08:02,586 --> 00:08:03,909 So you had to run the badblocks command first, 122 00:08:05,073 --> 00:08:07,874 to find these bad sectors, 123 00:08:08,343 --> 00:08:12,059 and it would produce a list of blocks which were bad, 124 00:08:12,292 --> 00:08:15,877 and it would pass that over to mkfs, and mkfs could work around this by, ... 125 00:08:19,372 --> 00:08:20,494 So that's the badblocks command. 126 00:08:22,076 --> 00:08:24,370 And this is the device I'm going to write to test that. 127 00:08:24,675 --> 00:08:26,152 It's going to be a big virtual device. 128 00:08:26,405 --> 00:08:27,863 It's going to have a bad sector somewhere in it. 129 00:08:29,225 --> 00:08:32,368 And the idea is whenever the kernel requests, 130 00:08:32,687 --> 00:08:33,539 or reads from, 131 00:08:34,242 --> 00:08:37,099 that bad sector, so whenever my request contains the bad sector, 132 00:08:37,560 --> 00:08:38,667 it's going to return an error. 133 00:08:39,023 --> 00:08:41,686 But any other place in the disk that it tries to read, 134 00:08:42,234 --> 00:08:43,825 it's just going to return some data. 135 00:08:44,446 --> 00:08:47,491 So a nice and simple demo, let's write that now. 136 00:08:48,039 --> 00:08:51,751 What's a good language for writing Linux kernel block devices in? 137 00:08:54,016 --> 00:08:54,631 bash 138 00:08:55,678 --> 00:08:56,765 yup, bash 139 00:08:58,257 --> 00:09:04,033 The first thing nbdkit is going to do is it's going to send me a request for the size of the disk. 140 00:09:04,352 --> 00:09:10,145 So I'm just going to return any size -- it doesn't matter -- 64 MB is fine. 141 00:09:11,823 --> 00:09:14,805 And then nbdkit will send me a request any time there's a read. 142 00:09:15,386 --> 00:09:18,195 The request is called "pread". 143 00:09:19,362 --> 00:09:23,563 And the parameters for that: $1 is the literal string "pread". 144 00:09:24,789 --> 00:09:26,964 $2 is a handle, which we're not using here. 145 00:09:27,397 --> 00:09:30,174 $3 is the size in bytes. 146 00:09:32,962 --> 00:09:37,042 And $4 is the offset in bytes of the request. 147 00:09:39,983 --> 00:09:45,463 Error case. The error case is if my request contains the bad sector or bad byte. 148 00:09:46,103 --> 00:09:47,811 So I'm going to put the bad byte at 100,000. 149 00:09:48,996 --> 00:09:54,905 So if my offset is less than the bad byte and the offset + size is bigger than the bad byte, 150 00:09:55,307 --> 00:09:57,109 that means that the bad byte is in the request. 151 00:09:58,277 --> 00:09:58,900 Agreed? 152 00:09:59,440 --> 00:10:02,230 Has anyone done pair programming where you have people looking over your shoulder? 153 00:10:02,952 --> 00:10:05,239 Have you done pair programming with 100 people looking over your shoulder? 154 00:10:06,584 --> 00:10:10,704 My offset less than the bad byte at 100,000, 155 00:10:12,846 --> 00:10:13,551 and ... 156 00:10:14,796 --> 00:10:18,945 and the size, sorry the offset + the size, 157 00:10:19,687 --> 00:10:24,945 offset is $4 plus the size [$3] if that's greater than the 158 00:10:25,434 --> 00:10:26,384 bad byte, 159 00:10:28,004 --> 00:10:28,631 (I hope I've got the right number of 000's there) 160 00:10:29,371 --> 00:10:30,724 so this is my error case ... 161 00:10:31,391 --> 00:10:35,150 So I just have to echo the error number that I want. 162 00:10:35,530 --> 00:10:36,351 EIO 163 00:10:37,156 --> 00:10:38,505 and just something that goes into syslog. 164 00:10:38,876 --> 00:10:39,886 And I have to send that to stderr. 165 00:10:40,627 --> 00:10:42,880 And I have to exit with an error code. 166 00:10:43,683 --> 00:10:45,346 So that's the error case. 167 00:10:45,870 --> 00:10:47,881 The other case is where I'm reading somewhere else in the disk. 168 00:10:49,239 --> 00:10:54,624 I have to return a block of size bytes back to nbdkit. 169 00:10:54,958 --> 00:10:57,322 I'm going to return just zeroes, it doesn't matter what I return. 170 00:10:58,247 --> 00:10:59,587 So if I use "dd", 171 00:11:00,447 --> 00:11:02,663 from /dev/zero, 172 00:11:05,865 --> 00:11:10,166 and I want to return exactly bytes $3 size 173 00:11:10,531 --> 00:11:14,985 So if I set the block size [bs] to $3, 174 00:11:16,122 --> 00:11:19,268 and I set the count to 1, that should return that number of zero [bytes], right? 175 00:11:21,026 --> 00:11:28,674 So that's my complete bash script Linux block device. 176 00:11:29,874 --> 00:11:32,526 So to [test] that I'm just going to run it. 177 00:11:33,164 --> 00:11:38,647 So run "nbdkit", name of the plugin which is "sh", and the "sh" plugin needs the name of the script which I've just written. 178 00:11:40,471 --> 00:11:42,032 The moment of truth here. 179 00:11:43,010 --> 00:11:46,319 If I associate that ... so the good thing there is the size, 180 00:11:46,761 --> 00:11:47,512 64 megabytes, 181 00:11:48,033 --> 00:11:51,589 remember we set the size to 64 MB? 182 00:11:51,991 --> 00:11:53,270 So that's good. 183 00:11:54,024 --> 00:11:56,035 And now if I run "badblocks". 184 00:11:56,552 --> 00:11:58,686 That worked. 185 00:11:59,454 --> 00:12:03,192 Now you might say why has that printed out 4 numbers where there was only 1 bad block? 186 00:12:03,384 --> 00:12:07,476 And the reason for this is the badblocks command reads the disk in 4K chunks, 187 00:12:08,691 --> 00:12:10,623 and when it hits a bad 4K chunk, 188 00:12:11,941 --> 00:12:13,819 it wants the output to be in 1K chunks, 189 00:12:14,047 --> 00:12:15,668 it says there must be 4 bad [1K] blocks. 190 00:12:15,889 --> 00:12:18,163 It doesn't go in any deeper, 191 00:12:18,639 --> 00:12:20,483 and try to work out which of the blocks is bad, 192 00:12:20,753 --> 00:12:22,324 it's just the way badblocks works. 193 00:12:22,633 --> 00:12:25,704 So it's good that badblocks, we've proven here, doesn't have any bugs, 194 00:12:26,361 --> 00:12:28,972 even though nobody's used it since 1992. 195 00:12:38,248 --> 00:12:39,835 You don't have to write plugins. 196 00:12:40,187 --> 00:12:41,562 You can use existing plugins. 197 00:12:41,912 --> 00:12:45,727 We've got loads of them, I don't know what to demonstrate, so I'm going to demonstrate 2 at random. 198 00:12:46,316 --> 00:12:48,913 The "floppy" plugin, and the "memory" plugin which is a RAM disk. 199 00:12:49,786 --> 00:12:51,063 The "floppy" plugin first. 200 00:12:51,505 --> 00:12:52,110 Real simple. 201 00:12:53,202 --> 00:12:59,445 "nbdkit", name of the plugin which is "floppy", any directory. This happens to be the directory of the source code of nbdkit. 202 00:13:02,279 --> 00:13:05,503 And same old nbd-client command to associate that with a loop device. 203 00:13:06,587 --> 00:13:08,967 And it should pop up in a second. There it goes. 204 00:13:10,536 --> 00:13:11,079 That popped up. 205 00:13:11,519 --> 00:13:16,070 That is a floppy disk image, 206 00:13:16,909 --> 00:13:19,123 which contains the source of nbdkit. 207 00:13:21,077 --> 00:13:22,669 And what exactly happened there? 208 00:13:22,991 --> 00:13:25,787 I took a filesystem from my host, 209 00:13:26,079 --> 00:13:31,114 I turned it into a floppy disk, a FAT formatted, MBR partitioned disk image, 210 00:13:31,884 --> 00:13:33,523 and then I loop mounted it on my host again. 211 00:13:34,603 --> 00:13:35,318 Why is that useful? 212 00:13:36,966 --> 00:13:43,348 One thing you can genuinely use this for is to export disk images really easily to virtual machines. 213 00:13:44,161 --> 00:13:45,179 Or to containers, 214 00:13:46,126 --> 00:13:49,476 some container systems let you [import] a disk image which gets loop mounted inside the container. 215 00:13:50,205 --> 00:13:53,905 When this is really useful is actually not in the loop mounting case. 216 00:13:54,344 --> 00:13:58,653 That's when you're creating a PXE client, [correction:] PXE server, 217 00:13:59,441 --> 00:14:01,711 and your PXE client needs to be given a root filesystem, 218 00:14:02,445 --> 00:14:05,817 and the traditional way you do that is you create a massive initramfs, 219 00:14:06,168 --> 00:14:09,204 that you TFTP over to the client at boot time. 220 00:14:09,511 --> 00:14:12,478 Which is slow, TFTP is unreliable, it's not encrypted, etc. 221 00:14:12,927 --> 00:14:15,833 NBD is encrypted and authenticated, 222 00:14:16,274 --> 00:14:19,105 it's a super-efficient protocol, 223 00:14:19,557 --> 00:14:23,107 and it's a much better protocol [to use] 224 00:14:23,433 --> 00:14:25,706 because it only fetches the bits it needs to read, 225 00:14:26,032 --> 00:14:29,729 so you can have a much bigger root filesystem. So it's a useful thing to do. 226 00:14:30,827 --> 00:14:33,986 My next demonstration is ... 227 00:14:34,943 --> 00:14:36,704 a RAM disk. 228 00:14:38,945 --> 00:14:43,206 Linux of course has a RAM disk driver inside [the kernel]. 229 00:14:43,793 --> 00:14:47,319 It is however much more convenient to be able to 230 00:14:47,872 --> 00:14:50,387 write RAM disks in userspace. 231 00:14:50,873 --> 00:14:53,357 We've written a simple RAM disk called the "memory" plugin, 232 00:14:54,539 --> 00:14:58,594 which is implemented using a sparse array, 233 00:14:59,279 --> 00:15:02,917 so it's not limited by the size of RAM, 234 00:15:03,824 --> 00:15:04,763 (in virtual size) 235 00:15:05,013 --> 00:15:07,278 you can actually create really massive disks. 236 00:15:07,597 --> 00:15:11,463 In this case I'm going to create the most massive disk that you can create. 237 00:15:12,070 --> 00:15:14,589 The biggest that Linux supports, 238 00:15:15,120 --> 00:15:18,517 until we eventually move to 128 bit block sizes, 239 00:15:18,984 --> 00:15:24,440 this is 2^63-1, it's the largest signed 64 bit integer you can have. 240 00:15:25,030 --> 00:15:28,544 How big is that -- 2^63 - 1 -- in terms of disks? 241 00:15:28,799 --> 00:15:33,327 I went on to Amazon to try to work out how much it would cost you to buy that many disks. 242 00:15:34,264 --> 00:15:38,553 And it turns out that's €300 million. 243 00:15:38,908 --> 00:15:42,315 I was very disappointed that Amazon doesn't let you create an order 244 00:15:42,913 --> 00:15:46,185 for €300 million that I could have screenshotted here 245 00:15:46,711 --> 00:15:49,829 the field isn't big enough, it doesn't let you do that, 246 00:15:50,225 --> 00:15:51,055 you can't do that. 247 00:15:51,678 --> 00:15:53,386 But anyway it's €300 million on Amazon, 248 00:15:53,788 --> 00:15:56,724 but we can create it here much more cheaply. 249 00:15:57,269 --> 00:15:59,785 Associate it with a loop device. 250 00:16:00,313 --> 00:16:02,065 You can see the size there is just massive. 251 00:16:04,158 --> 00:16:07,950 I'm going to use GPT for partitioning, because MBR is limited to 2 TB. 252 00:16:08,892 --> 00:16:10,347 All defaults. 253 00:16:11,640 --> 00:16:13,276 That's what the partition size looks like. 254 00:16:14,055 --> 00:16:17,322 8 exabytes [EB]. It's actually 8 EB minus 1 byte. 255 00:16:19,143 --> 00:16:21,509 I'm going to use btrfs, but what are my other choices? 256 00:16:21,889 --> 00:16:23,823 I could have used ext4 ... could I? 257 00:16:24,758 --> 00:16:26,591 What's the limit on filesystem [size] on ext4? 258 00:16:27,794 --> 00:16:29,248 Nobody knows. It's 1 EB. 259 00:16:29,667 --> 00:16:31,113 We'd have 7 EB wasted. 260 00:16:31,866 --> 00:16:35,943 XFS is possible, but XFS has quite a high metadata overhead. 261 00:16:36,271 --> 00:16:37,470 Actually that's unfair on XFS. 262 00:16:37,705 --> 00:16:39,789 XFS has a really nice, low metadata overhead, 263 00:16:40,037 --> 00:16:41,224 but it's about 1%. 264 00:16:41,543 --> 00:16:45,303 1% of 8 EB is too big for my laptop. 265 00:16:47,090 --> 00:16:49,446 So I'm going to use btrfs. 266 00:16:49,799 --> 00:16:51,785 You can see there, btrfs is an absolute champ. 267 00:16:52,353 --> 00:16:55,012 It totally just creates an 8 EB filesystem. 268 00:16:56,551 --> 00:16:58,223 And I can mount it. 269 00:16:59,671 --> 00:17:02,237 And I can ... we've got 8 EB ... 270 00:17:03,714 --> 00:17:09,202 I'm just going to "chown" this so I can go in there and show you. 271 00:17:09,906 --> 00:17:13,991 I played around with this [before]. 272 00:17:17,588 --> 00:17:19,140 I missed that question I'm afraid. 273 00:17:20,289 --> 00:17:23,553 (inaudible question) 274 00:17:24,403 --> 00:17:28,560 The question was: How many bugs in anything 275 00:17:28,862 --> 00:17:33,787 do you hit when you try to use the very last block which is only 511 bytes long? 276 00:17:34,587 --> 00:17:36,412 The answer is you definitely hit bugs in qemu. 277 00:17:36,725 --> 00:17:37,994 Qemu can't handle that case. 278 00:17:39,885 --> 00:17:43,627 You can create btrfs subvolume[s]. 279 00:17:47,209 --> 00:17:51,305 What is it? "btrfs filesystem df", I think? 280 00:17:51,823 --> 00:17:54,838 And it just works great. 281 00:17:55,263 --> 00:17:58,751 And the next thing is when I click to the next slide ... that's gone. 282 00:17:59,317 --> 00:18:02,607 This software I'm using will kill nbdkit, 283 00:18:02,914 --> 00:18:04,489 everything's destroyed and it goes away. 284 00:18:04,886 --> 00:18:06,944 So it's great for testing. 285 00:18:07,703 --> 00:18:11,369 Other things that are useful for testing. There are some plugins there which are very useful for testing. 286 00:18:12,041 --> 00:18:13,746 And some filters I'm going to talk about now. 287 00:18:14,642 --> 00:18:17,924 Which are super-useful if you're testing filesystems or the limits of filesystems. 288 00:18:20,431 --> 00:18:23,866 The first filter I'm going to talk about which is useful for testing is the "delay" filter. 289 00:18:25,124 --> 00:18:26,520 You can inject delays ... 290 00:18:26,870 --> 00:18:30,636 into the nbdkit request. You can specify the number of seconds, 291 00:18:31,036 --> 00:18:32,254 or the number of milliseconds. 292 00:18:32,918 --> 00:18:35,668 This is useful if you were testing, say, a distributed filesystem. 293 00:18:36,503 --> 00:18:38,005 You want to test it all on one machine, 294 00:18:38,351 --> 00:18:41,734 but you want to simulate the effects of having a really remote node, 295 00:18:41,999 --> 00:18:43,956 that has a long delay, 296 00:18:44,294 --> 00:18:46,458 you just inject delays into that device 297 00:18:46,935 --> 00:18:47,873 to simulate that. 298 00:18:48,307 --> 00:18:50,264 So it's a very simple filter. 299 00:18:51,356 --> 00:18:53,046 This filter's also a lot of fun. 300 00:18:53,352 --> 00:18:55,637 It's the "error" filter, and it injects errors. 301 00:18:56,051 --> 00:18:57,748 Obvious use for testing here. 302 00:18:58,291 --> 00:18:59,503 There's two ways to use this. 303 00:18:59,736 --> 00:19:04,425 The first way is we want this particular error [EIO] and we want a generalized error rate of 10%. 304 00:19:05,164 --> 00:19:07,202 This means that at random 10% of [requests] are going to fail. 305 00:19:07,887 --> 00:19:10,948 However I think the second way of doing this is more useful for 306 00:19:11,562 --> 00:19:12,594 most people. 307 00:19:13,102 --> 00:19:15,205 Here we're saying the error rate is 100% so 308 00:19:15,559 --> 00:19:19,043 100% of requests are going to fail reliably all the time. 309 00:19:19,565 --> 00:19:20,945 However it's gated ... 310 00:19:22,033 --> 00:19:23,945 on the error file. Now what that means is 311 00:19:24,266 --> 00:19:25,836 if that error file doesn't exist, 312 00:19:26,150 --> 00:19:28,448 or you delete it, no errors are injected, 313 00:19:28,836 --> 00:19:30,225 the error filter is turned off. 314 00:19:30,827 --> 00:19:32,923 When you create that file, 315 00:19:33,241 --> 00:19:35,444 the error filter is turned on. This is while nbdkit is running, 316 00:19:36,090 --> 00:19:37,948 so it's checking that error file [on every request]. 317 00:19:38,602 --> 00:19:42,241 And that's super-useful for testing because obviously you can inject errors 318 00:19:42,600 --> 00:19:44,477 when you want them to be injected 319 00:19:44,795 --> 00:19:50,472 and then turn off error injection and see if your filesystem recovers or whatever it's supposed to do. 320 00:19:51,755 --> 00:19:56,256 And the third filter which is a very simple filter but also useful is the 321 00:19:56,670 --> 00:19:59,031 "log" filter. You give it the name of a log file 322 00:19:59,345 --> 00:20:00,719 and it writes the log file in that format. 323 00:20:02,031 --> 00:20:04,450 In the next demonstration I'm going to show you 324 00:20:05,035 --> 00:20:08,085 we're going to have some graphical visualization of what happens 325 00:20:08,424 --> 00:20:10,797 inside filesystems when you do things like creating filesystems. 326 00:20:11,280 --> 00:20:14,835 It's important to note that nbdkit is not a graphical tool. 327 00:20:15,863 --> 00:20:19,172 Nbdkit knows nothing about graphics or anything like that. 328 00:20:19,433 --> 00:20:21,624 What's actually happening here is we're using the log filter, 329 00:20:22,036 --> 00:20:23,263 we're writing a log file, 330 00:20:23,784 --> 00:20:26,910 we've got a second graphical program, a program I wrote for this talk, 331 00:20:27,486 --> 00:20:29,118 which is tailing that log file and then 332 00:20:29,341 --> 00:20:31,992 creating the visualizations which you'll see. 333 00:20:32,465 --> 00:20:36,034 So nbdkit is not a graphical program, it's just a command line tool / server. 334 00:20:37,729 --> 00:20:41,731 Let's have a look at what it looks like to create a filesystem. 335 00:20:42,764 --> 00:20:44,506 Slightly long nbdkit command line here, 336 00:20:44,841 --> 00:20:46,882 but hopefully you should be able to understand what's going on. 337 00:20:47,643 --> 00:20:50,841 We're creating ... the "memory" plugin, so we're creating a RAM disk. 338 00:20:51,256 --> 00:20:52,179 64 megabytes 339 00:20:52,815 --> 00:20:57,224 We're using that log [filter] to create the log file which we're going to tail with a second process. 340 00:20:57,967 --> 00:21:01,944 And we're inserting delays. Now the delays are just so it slows it down a little bit, 341 00:21:02,430 --> 00:21:03,911 to make it a little bit easier to see. 342 00:21:04,345 --> 00:21:05,792 Otherwise everything goes past too quickly. 343 00:21:06,719 --> 00:21:08,431 So I'll run nbdkit. 344 00:21:08,924 --> 00:21:12,326 And this is my second program which is going to visualize things. 345 00:21:14,916 --> 00:21:19,590 Same old command to associate the nbdkit instance 346 00:21:20,141 --> 00:21:21,230 with a loop device. 347 00:21:22,426 --> 00:21:24,394 Now hopefully you could see that. 348 00:21:24,710 --> 00:21:26,103 Little black flashes going on. 349 00:21:26,357 --> 00:21:27,463 Those are reads. 350 00:21:27,796 --> 00:21:30,224 What's happening there is because we've created a Linux kernel block device, 351 00:21:30,881 --> 00:21:34,909 the kernel, udev, are looking at that and saying "is there an LVM PV [physical volume] there?" 352 00:21:35,226 --> 00:21:39,043 "Is there a filesystem there? Is there a partition there I should know about?" 353 00:21:39,320 --> 00:21:42,239 It's a RAM disk so it's empty, but it has to check. 354 00:21:44,417 --> 00:21:46,272 Now let's partition it. I'm going to use GPT. 355 00:21:48,104 --> 00:21:49,625 All defaults. 356 00:21:52,796 --> 00:22:00,081 GPT works by creating a partition table at the beginning of the disk and a secondary or backup PT at the end of the disk. 357 00:22:00,951 --> 00:22:02,846 Those are represented in red. Those are writes. 358 00:22:03,236 --> 00:22:08,274 And you probably also saw little black flashes there. We've created another Linux kernel device, 359 00:22:08,760 --> 00:22:11,085 and again udev has to check it. 360 00:22:13,479 --> 00:22:15,406 Let's create a filesystem in there. 361 00:22:18,242 --> 00:22:21,080 The big thing that happens there is this lump of blue that happens at the beginning. [Not shown because of a technical problem in the video] 362 00:22:21,609 --> 00:22:24,673 Blue in this diagram represents discards. 363 00:22:25,981 --> 00:22:31,319 Modern mkfs always issues a big discard or trim over the entire partition. 364 00:22:32,202 --> 00:22:35,169 The reason for that is it makes SSDs work more efficiently. 365 00:22:36,674 --> 00:22:42,365 Other notable features: The red bar here is some kind of metadata. 366 00:22:42,791 --> 00:22:45,199 I'm in a Storage [Track] room full of filesystem experts, 367 00:22:45,810 --> 00:22:50,618 so hopefully you know better than I do what's going on here. But that's probably an inode table. 368 00:22:51,526 --> 00:22:55,435 Big lump of red here. Could be the journal, maybe? 369 00:22:56,623 --> 00:22:59,704 Little red dots. I think those are backup superblocks. 370 00:23:00,039 --> 00:23:03,037 If you notice there are 4 red dots and 4 backup superblocks. 371 00:23:04,472 --> 00:23:05,822 Let's mount that. 372 00:23:08,218 --> 00:23:10,989 I'm not touching the laptop here but something funny happens in a second. 373 00:23:14,445 --> 00:23:15,241 Here it goes. 374 00:23:15,826 --> 00:23:17,555 Do you see it's writing? 375 00:23:17,928 --> 00:23:19,467 We've just mounted the disk but it's writing to it. 376 00:23:19,797 --> 00:23:25,023 This is lazy blockgroup initialization, it's another feature of modern filesystems. 377 00:23:26,145 --> 00:23:34,087 Because disks are really big these days, and writing to them (relative to the size of the disk) is really slow, 378 00:23:35,265 --> 00:23:39,306 so you wouldn't want your mkfs to sit there, 379 00:23:39,669 --> 00:23:42,998 for hours on end writing all the blockgroup metadata. 380 00:23:44,162 --> 00:23:47,824 And in any case why would you do that? Because you can't 381 00:23:48,189 --> 00:23:51,285 use all of those blockgroups for writing because writing is so slow 382 00:23:51,797 --> 00:23:53,037 compared to the size of the disks. 383 00:23:53,926 --> 00:23:57,944 So it makes so much more sense for filesystems to defer all this [work] to the kernel, 384 00:23:58,349 --> 00:24:04,108 so when the disk is mounted the kernel sees there are uninitialized block groups 385 00:24:04,748 --> 00:24:05,961 blockgroup metadata 386 00:24:06,240 --> 00:24:08,206 and it creates that in the background. 387 00:24:08,545 --> 00:24:11,197 It doesn't matter anyway because you can't write to those new blockgroups 388 00:24:11,530 --> 00:24:14,679 faster than they're being created 389 00:24:14,942 --> 00:24:16,210 so it's fine. 390 00:24:17,039 --> 00:24:18,390 So let's mount this. 391 00:24:19,030 --> 00:24:25,626 It's mounted so I'm going to chown it to make it convenient for me to put some files on there. 392 00:24:26,385 --> 00:24:29,104 Let's again copy the nbdkit source code. 393 00:24:31,639 --> 00:24:34,197 You see that nothing actually is written until I "sync". 394 00:24:34,783 --> 00:24:36,183 We know this, right? 395 00:24:36,569 --> 00:24:38,493 When you write to a disk, 396 00:24:38,843 --> 00:24:42,706 the writes don't hit the disk immediately, they get stored in memory for a bit 397 00:24:43,080 --> 00:24:44,510 and they get written a few seconds later, 398 00:24:44,913 --> 00:24:46,914 unless you do a "sync" which forces that write. 399 00:24:47,755 --> 00:24:52,231 And of course when you delete that directory, 400 00:24:52,504 --> 00:24:53,945 even when I "sync", 401 00:24:54,675 --> 00:24:56,365 it's not going to [update] that. 402 00:24:57,189 --> 00:25:02,562 You know why this is. You know when you delete files it doesn't really delete them, it simply marks them in the block group 403 00:25:02,886 --> 00:25:03,764 as being unused, 404 00:25:04,263 --> 00:25:09,416 and later on those blocks get reused for other files you create. 405 00:25:09,663 --> 00:25:14,833 But there is a command -- for modern filesystems -- we can use 406 00:25:15,152 --> 00:25:18,238 to actually tell it to discard them 407 00:25:18,503 --> 00:25:19,839 and that's the "fstrim" command. 408 00:25:20,199 --> 00:25:25,904 That issues a discard request to the filesystem. 409 00:25:36,525 --> 00:25:39,881 This is my final demo. 410 00:25:40,183 --> 00:25:42,811 That was a nice one showing a single filesystem, 411 00:25:43,094 --> 00:25:46,844 but I think more interesting is when you run multiple copies of nbdkit 412 00:25:47,790 --> 00:25:48,880 to create multiple devices. 413 00:25:49,790 --> 00:25:53,392 And this is the longest nbdkit command line that you'll probably ever see. 414 00:25:54,960 --> 00:25:56,863 There are only two important changes here. 415 00:25:58,759 --> 00:26:01,844 The first one is ... 416 00:26:02,996 --> 00:26:05,361 Previously I was only running one copy of nbdkit. 417 00:26:06,668 --> 00:26:11,804 So I could have it listen on TCP port 10809 418 00:26:12,187 --> 00:26:13,983 which is the default port for NBD. 419 00:26:15,408 --> 00:26:18,439 However I'm going to be running 5 copies of nbdkit this time, and they can't all be listening on the same port. 420 00:26:19,560 --> 00:26:24,484 So I'm going to be using a Unix domain socket and that's the purpose of the -U option here. 421 00:26:25,547 --> 00:26:28,435 And the second change is I'm using the error filter. 422 00:26:30,042 --> 00:26:32,090 I'm using this in the way we described before 423 00:26:32,410 --> 00:26:34,752 where you set the error rate to 100%, 424 00:26:35,232 --> 00:26:39,347 but we gate this on the presence or absence of an error file. 425 00:26:39,749 --> 00:26:42,128 So the error filter is turned off because that error file doesn't exist. 426 00:26:43,226 --> 00:26:44,726 But it gets turned on later on. 427 00:26:45,715 --> 00:26:48,041 I'm going to start 5 copies of nbdkit. 428 00:26:55,323 --> 00:26:57,064 I'll just show you what's going on on the filesystem here. 429 00:26:58,545 --> 00:26:59,869 We've got 5 log files as you'd expect. 430 00:27:00,626 --> 00:27:04,430 Those are going to be tailed by the graphical viewer. 431 00:27:05,278 --> 00:27:09,583 We've got 5 sockets. There are 5 copies of nbdkit hiding behind those Unix domain sockets. 432 00:27:12,624 --> 00:27:14,907 Let me run the graphical viewer. 433 00:27:18,115 --> 00:27:19,266 5 devices this time. 434 00:27:20,586 --> 00:27:21,521 Hopefully not too small. OK. 435 00:27:24,505 --> 00:27:30,328 And now I'm going to associate the 5 nbdkits with the 5 devices. 436 00:27:36,960 --> 00:27:41,071 Now I'm going to create a RAID 5 array. 437 00:27:42,166 --> 00:27:45,606 I'm going to use the first 4 disks as data disks, 438 00:27:46,675 --> 00:27:48,972 and the last disk as a hot spare. 439 00:27:53,549 --> 00:27:54,318 Let's get that going. 440 00:27:57,004 --> 00:28:00,268 You can see what's happening here is it's reading the first 3 disks 441 00:28:00,508 --> 00:28:02,273 and creating a parity disk on the 4th disk. 442 00:28:03,197 --> 00:28:07,377 People who know about RAID will be thinking: "Why's that parity disk not being striped 443 00:28:08,523 --> 00:28:09,170 over all of the data disks?" 444 00:28:09,711 --> 00:28:12,992 The reason is because these disks are so small, they're 64 MB, 445 00:28:13,303 --> 00:28:16,639 the stripe size is actually [larger] than the entire disk. 446 00:28:17,625 --> 00:28:20,524 So there's 1 parity disk and there are 3 data disks. 447 00:28:21,633 --> 00:28:25,258 Let's have a look at the kernel messages which will be interesting in a minute. 448 00:28:27,644 --> 00:28:31,624 Let's partition that as before. 449 00:28:33,573 --> 00:28:34,663 All defaults. 450 00:28:36,387 --> 00:28:38,564 And we can create a filesystem on there as well. 451 00:28:43,437 --> 00:28:46,203 This looks a lot like it did last time, 452 00:28:46,671 --> 00:28:48,956 except there's no trim. 453 00:28:49,432 --> 00:28:54,270 Now the MD device in the kernel doesn't believe you can send discard requests to devices. 454 00:28:54,623 --> 00:28:56,286 I guess because they've been burned [by faulty hardware] in the past. 455 00:28:57,594 --> 00:29:01,126 There is a way to do this, by setting a kernel command line flag, 456 00:29:01,583 --> 00:29:03,230 which is something weird like 457 00:29:03,474 --> 00:29:07,185 [raid456.devices_handle_discard_safely=Y] 458 00:29:08,107 --> 00:29:11,304 However as this is quite literally my work laptop, 459 00:29:11,759 --> 00:29:14,109 I don't have that on my kernel command line 460 00:29:14,480 --> 00:29:18,830 so it's not issuing discards to the underlying devices. 461 00:29:20,506 --> 00:29:22,025 And I can mount this. 462 00:29:24,876 --> 00:29:30,865 Let's chown it so I can write to it. 463 00:29:32,028 --> 00:29:34,175 Let's create some files in there. 464 00:29:36,869 --> 00:29:41,949 And the interesting thing is what happens when I inject an error into this? 465 00:29:43,343 --> 00:29:46,086 Well you can see what happened there, quite dramatically. 466 00:29:46,515 --> 00:29:50,185 It detected first of all that the error occured 467 00:29:50,543 --> 00:29:52,285 on the second disk 468 00:29:53,191 --> 00:29:58,514 and the second disk is called "/dev/nbd1" here because I'm starting the disk numbering from "/dev/nbd0". 469 00:29:59,886 --> 00:30:02,192 You can see also it started to do a recovery. 470 00:30:02,492 --> 00:30:06,089 So it started to read from the remaining good disks, 471 00:30:06,553 --> 00:30:09,897 and it created an extra parity disk on the hot spare. 472 00:30:10,903 --> 00:30:13,710 It took a little bit of time. We're injecting delays here. 473 00:30:15,065 --> 00:30:19,824 Although we're injecting delays so it's a bit slower than normal, 474 00:30:20,115 --> 00:30:22,955 you can imagine how it would be if this wasn't a 64 MB disk 475 00:30:23,223 --> 00:30:26,180 but this was 6.4 TB or larger. 476 00:30:27,199 --> 00:30:30,085 Recovery on RAID 5 takes a really long time, 477 00:30:30,507 --> 00:30:34,985 and unfortunately the way that RAID 5 works is that if you then get another disk failing, 478 00:30:36,245 --> 00:30:37,786 at certain points during the recovery, 479 00:30:38,103 --> 00:30:39,429 you can lose all your data. 480 00:30:39,642 --> 00:30:43,324 That's why we don't use RAID 5 in production, 481 00:30:43,649 --> 00:30:45,951 certainly on larger systems these days. 482 00:30:46,983 --> 00:30:49,477 However it's still a good demo. 483 00:30:50,863 --> 00:30:55,985 I should just note that when I clicked the "Error" button there 484 00:30:56,388 --> 00:30:59,914 the graphical tool didn't start injecting errors. 485 00:31:00,182 --> 00:31:02,677 All that happened was the graphical tool created a file called "error2". 486 00:31:03,749 --> 00:31:07,025 And then nbdkit notices that the file exists and starts to inject errors on that disk. 487 00:31:08,892 --> 00:31:11,399 Now although all that dramatic stuff happened in the background, 488 00:31:11,873 --> 00:31:14,435 the actual filesystem is fine. 489 00:31:14,841 --> 00:31:17,999 There are no errors on the filesystem or array level. 490 00:31:18,348 --> 00:31:20,480 The dramatic stuff happened below there. 491 00:31:21,511 --> 00:31:25,443 And of course I can inject more errors on a second disk. 492 00:31:26,307 --> 00:31:28,606 And now we're running in degraded mode. 493 00:31:28,871 --> 00:31:30,865 This is the minimum that this RAID array can support 494 00:31:31,605 --> 00:31:33,431 without actually failing. 495 00:31:33,703 --> 00:31:36,511 Although there was another error 496 00:31:36,869 --> 00:31:39,634 I'm still just about OK, although if there was another error 497 00:31:40,688 --> 00:31:42,465 if I clicked another button, two things would happen: 498 00:31:44,147 --> 00:31:47,114 You'd see errors appearing at the filesystem level. 499 00:31:47,676 --> 00:31:50,483 And the second thing that would happen is I'd have to reboot my laptop. 500 00:31:50,943 --> 00:31:53,080 Because you cannot ... and I could not work out how to do this ... 501 00:31:53,561 --> 00:31:57,219 You cannot then unmount a RAID array 502 00:31:57,464 --> 00:31:59,578 that's in that state. It's just impossible 503 00:32:00,002 --> 00:32:02,277 and I've no idea why. There's lots about this on stackoverflow. 504 00:32:02,929 --> 00:32:05,553 I don't want to reboot my laptop in the middle of the talk 505 00:32:05,914 --> 00:32:09,346 so I'm not going to do that. 506 00:32:09,693 --> 00:32:11,706 It's probably a kernel bug or something, I don't know. 507 00:32:12,792 --> 00:32:16,154 Instead what I'm going to do is umount the filesystem 508 00:32:16,522 --> 00:32:18,343 and stop the RAID array. 509 00:32:18,515 --> 00:32:21,515 [end of subtitles]