1
00:00:06,000 --> 00:00:11,441
Welcome to my talk about loop mounting and going beyond ordinary loop mounting.

2
00:00:11,441 --> 00:00:16,492
Hopefully most of you here know what loop mounting is

3
00:00:16,493 --> 00:00:22,330
but in case you don't I'm just going to give you a quick demo

4
00:00:20,124 --> 00:00:25,600
of what I'm going to call traditional, old fashioned loop mounting.

5
00:00:26,570 --> 00:00:30,279
I've got a file on my laptop here, a Fedora disk image.

6
00:00:30,280 --> 00:00:35,008
It's 6 GB in size.

7
00:00:32,162 --> 00:00:40,444
I use the command "losetup" to associate files with Linux kernel block devices.

8
00:00:40,581 --> 00:00:43,494
I don't have any loops set up at the moment.

9
00:00:43,494 --> 00:00:48,717
But I can associate this file with a Linux kernel block device called /dev/loop0.

10
00:00:50,166 --> 00:00:54,312
And I can look at partitions.

11
00:00:55,621 --> 00:00:56,857
And I can mount them -- that's loop mounting.

12
00:00:59,118 --> 00:01:06,211
That's just ordinary loop mounting, it's not what we're going to be talking about today although conceptually it's similar.

13
00:01:06,555 --> 00:01:09,250
Traditional loop mounting is fine if it's a plain file.

14
00:01:09,930 --> 00:01:11,686
It falls over very quickly.

15
00:01:12,286 --> 00:01:15,558
What happens if you've got a compressed disk image?

16
00:01:16,100 --> 00:01:23,080
This isn't a disk image that I've compressed, this is how cloud images are distributed by the Fedora community.

17
00:01:23,160 --> 00:01:24,835
And this is one that I just downloaded.

18
00:01:25,520 --> 00:01:28,914
From the Fedora website in exactly this format. I didn't modify it in any way.

19
00:01:28,918 --> 00:01:30,687
It's XZ-compressed.

20
00:01:31,496 --> 00:01:33,883
And of course you could turn that into a loop device.

21
00:01:34,435 --> 00:01:38,391
And what would happen if you'd end up with a device which contained XZ-compressed data.

22
00:01:39,034 --> 00:01:43,358
And it is my contention today that this is not what you meant by loop mounting this.

23
00:01:44,638 --> 00:01:50,478
What you'd want it to do is you'd like it to transparently uncompress so you can see the data inside.

24
00:01:51,100 --> 00:01:53,119
And we can do that.

25
00:01:53,835 --> 00:01:56,835
Using a command called "nbdkit" which I'm going to talk about in a minute.

26
00:01:56,315 --> 00:01:59,519
And that creates a process.

27
00:02:02,036 --> 00:02:06,794
And now I use a command to associate that process with a loop device.

28
00:02:07,926 --> 00:02:10,755
It's a different command, "nbd-client", but it's conceptually similar.

29
00:02:12,676 --> 00:02:16,997
And you see there the size -- 4096 MB, 4 GB.

30
00:02:17,846 --> 00:02:20,519
So that's not the compressed size, that's the uncompressed size.

31
00:02:21,033 --> 00:02:24,082
What this is doing is it's transparently uncompressing.  It's not uncompressing the whole thing.

32
00:02:25,042 --> 00:02:28,895
It's only uncompressing the blocks that are being used when you request them.

33
00:02:30,396 --> 00:02:33,048
And I can mount that.

34
00:02:34,686 --> 00:02:43,198
And I can look at it, and I can even create files.

35
00:02:46,249 --> 00:02:48,487
So that is what I'm going to be talking about today.

36
00:02:49,192 --> 00:02:51,964
How does this differ from loop mounting?

37
00:02:52,688 --> 00:02:57,519
In both cases we've got a kernel module, on the left hand side "loop.ko".

38
00:02:58,994 --> 00:03:02,714
And it's configured using a command line utility, "losetup".

39
00:03:03,374 --> 00:03:07,003
And you use that to create Linux kernel block devices.

40
00:03:07,568 --> 00:03:08,607
Like /dev/loop0.

41
00:03:09,130 --> 00:03:11,477
On the right hand side I've got a kernel module, "nbd.ko".

42
00:03:12,198 --> 00:03:15,196
It's configured using a command line client "nbd-client".

43
00:03:15,677 --> 00:03:19,377
And it creates loop devices, Linux kernel block devices, "/dev/nbd0" etc.

44
00:03:20,562 --> 00:03:22,886
But the back end as you can see is a little bit different.

45
00:03:23,396 --> 00:03:29,754
On the left hand side the back end is talking over internal Linux kernel APIs like the VFS.

46
00:03:30,435 --> 00:03:33,191
To the file which is associated with the loop device.

47
00:03:33,679 --> 00:03:37,098
On the right hand side we've got a user process running.

48
00:03:37,633 --> 00:03:44,250
This is critical: We've got a user process in this case "nbdkit", but I should say that other NBD servers are available.

49
00:03:45,276 --> 00:03:47,114
Other very very good NBD servers are available.

50
00:03:48,722 --> 00:03:52,523
The kernel is talking to that process using

51
00:03:53,018 --> 00:03:57,717
a TCP port or a Unix domain socket as you require.

52
00:03:59,957 --> 00:04:05,156
I'm going to demonstrate in this talk nbdkit which is an NBD server which I wrote

53
00:04:05,714 --> 00:04:07,564
with a guy called Eric Blake

54
00:04:07,971 --> 00:04:09,838
who's a brilliant free software hacker.

55
00:04:10,601 --> 00:04:14,061
Nbdkit is slightly different from other NBD servers

56
00:04:14,526 --> 00:04:16,903
in that we have a plugin API.  It's a stable API.

57
00:04:17,484 --> 00:04:21,336
Which means you can write a plugin now

58
00:04:21,754 --> 00:04:23,719
or you could have written a plugin back in 2013 when we started the project

59
00:04:24,409 --> 00:04:26,677
and it would still be compilable with nbdkit now

60
00:04:27,135 --> 00:04:29,289
and it will still be compilable in the future.

61
00:04:29,929 --> 00:04:31,881
We're not going to break plugins at the source level.

62
00:04:32,779 --> 00:04:35,326
It also has an ABI guarantee.

63
00:04:35,916 --> 00:04:38,368
That means you can compile your plugin

64
00:04:39,085 --> 00:04:42,004
and you can distribute it separately from nbdkit

65
00:04:42,838 --> 00:04:44,038
as a binary

66
00:04:44,434 --> 00:04:49,037
and load it into nbdkit at some point later.  We're not going to break that even as we evolve nbdkit

67
00:04:49,394 --> 00:04:51,812
and we evolve the API we don't break source or binary compatibility.

68
00:04:52,633 --> 00:04:55,679
If you don't want to write a plugin -- and I'm going to show you in a minute how you can write a plugin simply

69
00:04:56,763 --> 00:05:00,242
if you don't want to write a plugin, many other plugins are available.

70
00:05:00,777 --> 00:05:03,369
I've listed the ones which are in nbdkit 1.10 on here.

71
00:05:05,326 --> 00:05:07,422
Some of these plugins aren't quite like the others.

72
00:05:09,290 --> 00:05:12,873
These are plugins like Perl, Python and they are gateways to writing plugins

73
00:05:13,249 --> 00:05:15,526
in non-C languages.

74
00:05:15,887 --> 00:05:17,396
So you can write plugins in scripting languages,

75
00:05:17,849 --> 00:05:21,414
even in shell script, if you're not very happy with writing C plugins.

76
00:05:23,963 --> 00:05:26,239
The other concept that nbdkit has is "filters".

77
00:05:26,648 --> 00:05:29,045
You can think of a plugin as like a data source.

78
00:05:29,527 --> 00:05:30,753
It's like a source of disk images.

79
00:05:31,522 --> 00:05:35,277
But filters apply modifications or changes to

80
00:05:36,279 --> 00:05:37,769
that data source.

81
00:05:38,355 --> 00:05:43,454
An example here is the partition plugin.  If your source is a whole disk image, a partitioned disk image,

82
00:05:43,476 --> 00:05:46,914
but you only want to serve one of the partitions over NBD,

83
00:05:47,196 --> 00:05:49,112
you can apply the partition filter,

84
00:05:49,628 --> 00:05:50,675
which selects a partition.

85
00:05:51,157 --> 00:05:55,388
Each running nbdkit instance must have exactly one plugin running in it.

86
00:05:56,682 --> 00:05:58,885
But it can have zero or any number of filters.

87
00:05:59,196 --> 00:06:01,297
In this case I've selected the "file" plugin.

88
00:06:01,973 --> 00:06:03,353
So my source is a local file.

89
00:06:03,773 --> 00:06:07,313
But as it's a compressed file I'm going to apply the "xz" filter on top

90
00:06:07,956 --> 00:06:09,193
to transparently uncompress it.

91
00:06:09,714 --> 00:06:11,516
And then I'm going to apply the "partition" filter

92
00:06:12,086 --> 00:06:13,052
to select a partition.

93
00:06:13,406 --> 00:06:15,680
And then I'm going to apply the "cow" filter

94
00:06:16,200 --> 00:06:19,089
because I want to make a writable overlay which I can write out to a qcow2 file later.

95
00:06:20,198 --> 00:06:24,173
This is how you would express all that on the nbdkit command line.

96
00:06:26,613 --> 00:06:28,445
So you put "nbdkit", the name of the program.

97
00:06:28,763 --> 00:06:32,396
The list of the filters.  You might think of these filters as being in reverse order.

98
00:06:32,960 --> 00:06:37,842
They're in reverse order from the distance they are from the plugin.

99
00:06:38,594 --> 00:06:42,556
Or another way to think about it is: When an NBD request comes into the server,

100
00:06:42,832 --> 00:06:44,515
it travels through the filters in this order.

101
00:06:45,849 --> 00:06:48,929
At the bottom I've got the name of the plugin.

102
00:06:49,357 --> 00:06:53,888
And then any parameters that the plugin needs.  Obviously the file plugin needs to know which file you want to serve.

103
00:06:54,169 --> 00:06:56,315
So I give it the disk name ... the file name.

104
00:06:56,775 --> 00:07:01,678
And filters may also require parameters as well.

105
00:07:01,992 --> 00:07:05,563
In this case the partition filter wants to know which partition you want to serve,

106
00:07:06,637 --> 00:07:07,931
so you have to give that as a parameter.

107
00:07:10,687 --> 00:07:15,006
Now ... I wanted to demonstrate actually writing a plugin,

108
00:07:15,446 --> 00:07:17,926
and I want to do it very quickly so I don't bore you,

109
00:07:18,525 --> 00:07:23,237
and I was trying to think what could I do to demonstrate

110
00:07:23,522 --> 00:07:24,996
to demonstrate writing a plugin?

111
00:07:25,270 --> 00:07:26,963
I thought I'd write a test device,

112
00:07:27,284 --> 00:07:29,116
so I'm going to write a Linux kernel device,

113
00:07:29,487 --> 00:07:30,978
to test the "badblocks" command.

114
00:07:31,952 --> 00:07:33,468
It's quite a young audience here,

115
00:07:33,827 --> 00:07:36,741
and we haven't used the badblocks command for a really long time,

116
00:07:37,250 --> 00:07:40,112
perhaps since we've had IDE disks in the early 90s.

117
00:07:40,383 --> 00:07:47,450
But before then old grey-haired people will remember RLL and MFM disks?

118
00:07:48,921 --> 00:07:52,390
Everyone's looking a bit confused.  Floppy disks?

119
00:07:54,200 --> 00:07:57,710
In those systems, when there was an error on the surface of the disk

120
00:07:58,347 --> 00:08:02,105
that would appear at the filesystem layer.

121
00:08:02,586 --> 00:08:03,909
So you had to run the badblocks command first,

122
00:08:05,073 --> 00:08:07,874
to find these bad sectors,

123
00:08:08,343 --> 00:08:12,059
and it would produce a list of blocks which were bad,

124
00:08:12,292 --> 00:08:15,877
and it would pass that over to mkfs, and mkfs could work around this by, ...

125
00:08:19,372 --> 00:08:20,494
So that's the badblocks command.

126
00:08:22,076 --> 00:08:24,370
And this is the device I'm going to write to test that.

127
00:08:24,675 --> 00:08:26,152
It's going to be a big virtual device.

128
00:08:26,405 --> 00:08:27,863
It's going to have a bad sector somewhere in it.

129
00:08:29,225 --> 00:08:32,368
And the idea is whenever the kernel requests,

130
00:08:32,687 --> 00:08:33,539
or reads from,

131
00:08:34,242 --> 00:08:37,099
that bad sector, so whenever my request contains the bad sector,

132
00:08:37,560 --> 00:08:38,667
it's going to return an error.

133
00:08:39,023 --> 00:08:41,686
But any other place in the disk that it tries to read,

134
00:08:42,234 --> 00:08:43,825
it's just going to return some data.

135
00:08:44,446 --> 00:08:47,491
So a nice and simple demo, let's write that now.

136
00:08:48,039 --> 00:08:51,751
What's a good language for writing Linux kernel block devices in?

137
00:08:54,016 --> 00:08:54,631
bash

138
00:08:55,678 --> 00:08:56,765
yup, bash

139
00:08:58,257 --> 00:09:04,033
The first thing nbdkit is going to do is it's going to send me a request for the size of the disk.

140
00:09:04,352 --> 00:09:10,145
So I'm just going to return any size -- it doesn't matter -- 64 MB is fine.

141
00:09:11,823 --> 00:09:14,805
And then nbdkit will send me a request any time there's a read.

142
00:09:15,386 --> 00:09:18,195
The request is called "pread".

143
00:09:19,362 --> 00:09:23,563
And the parameters for that: $1 is the literal string "pread".

144
00:09:24,789 --> 00:09:26,964
$2 is a handle, which we're not using here.

145
00:09:27,397 --> 00:09:30,174
$3 is the size in bytes.

146
00:09:32,962 --> 00:09:37,042
And $4 is the offset in bytes of the request.

147
00:09:39,983 --> 00:09:45,463
Error case.  The error case is if my request contains the bad sector or bad byte.

148
00:09:46,103 --> 00:09:47,811
So I'm going to put the bad byte at 100,000.

149
00:09:48,996 --> 00:09:54,905
So if my offset is less than the bad byte and the offset + size is bigger than the bad byte,

150
00:09:55,307 --> 00:09:57,109
that means that the bad byte is in the request.

151
00:09:58,277 --> 00:09:58,900
Agreed?

152
00:09:59,440 --> 00:10:02,230
Has anyone done pair programming where you have people looking over your shoulder?

153
00:10:02,952 --> 00:10:05,239
Have you done pair programming with 100 people looking over your shoulder?

154
00:10:06,584 --> 00:10:10,704
My offset less than the bad byte at 100,000,

155
00:10:12,846 --> 00:10:13,551
and ...

156
00:10:14,796 --> 00:10:18,945
and the size, sorry the offset + the size,

157
00:10:19,687 --> 00:10:24,945
offset is $4 plus the size [$3] if that's greater than the

158
00:10:25,434 --> 00:10:26,384
bad byte,

159
00:10:28,004 --> 00:10:28,631
(I hope I've got the right number of 000's there)

160
00:10:29,371 --> 00:10:30,724
so this is my error case ...

161
00:10:31,391 --> 00:10:35,150
So I just have to echo the error number that I want.

162
00:10:35,530 --> 00:10:36,351
EIO

163
00:10:37,156 --> 00:10:38,505
and just something that goes into syslog.

164
00:10:38,876 --> 00:10:39,886
And I have to send that to stderr.

165
00:10:40,627 --> 00:10:42,880
And I have to exit with an error code.

166
00:10:43,683 --> 00:10:45,346
So that's the error case.

167
00:10:45,870 --> 00:10:47,881
The other case is where I'm reading somewhere else in the disk.

168
00:10:49,239 --> 00:10:54,624
I have to return a block of size bytes back to nbdkit.

169
00:10:54,958 --> 00:10:57,322
I'm going to return just zeroes, it doesn't matter what I return.

170
00:10:58,247 --> 00:10:59,587
So if I use "dd",

171
00:11:00,447 --> 00:11:02,663
from /dev/zero,

172
00:11:05,865 --> 00:11:10,166
and I want to return exactly bytes $3 size

173
00:11:10,531 --> 00:11:14,985
So if I set the block size [bs] to $3,

174
00:11:16,122 --> 00:11:19,268
and I set the count to 1, that should return that number of zero [bytes], right?

175
00:11:21,026 --> 00:11:28,674
So that's my complete bash script Linux block device.

176
00:11:29,874 --> 00:11:32,526
So to [test] that I'm just going to run it.

177
00:11:33,164 --> 00:11:38,647
So run "nbdkit", name of the plugin which is "sh", and the "sh" plugin needs the name of the script which I've just written.

178
00:11:40,471 --> 00:11:42,032
The moment of truth here.

179
00:11:43,010 --> 00:11:46,319
If I associate that ... so the good thing there is the size,

180
00:11:46,761 --> 00:11:47,512
64 megabytes,

181
00:11:48,033 --> 00:11:51,589
remember we set the size to 64 MB?

182
00:11:51,991 --> 00:11:53,270
So that's good.

183
00:11:54,024 --> 00:11:56,035
And now if I run "badblocks".

184
00:11:56,552 --> 00:11:58,686
That worked.

185
00:11:59,454 --> 00:12:03,192
Now you might say why has that printed out 4 numbers where there was only 1 bad block?

186
00:12:03,384 --> 00:12:07,476
And the reason for this is the badblocks command reads the disk in 4K chunks,

187
00:12:08,691 --> 00:12:10,623
and when it hits a bad 4K chunk,

188
00:12:11,941 --> 00:12:13,819
it wants the output to be in 1K chunks,

189
00:12:14,047 --> 00:12:15,668
it says there must be 4 bad [1K] blocks.

190
00:12:15,889 --> 00:12:18,163
It doesn't go in any deeper,

191
00:12:18,639 --> 00:12:20,483
and try to work out which of the blocks is bad,

192
00:12:20,753 --> 00:12:22,324
it's just the way badblocks works.

193
00:12:22,633 --> 00:12:25,704
So it's good that badblocks, we've proven here, doesn't have any bugs,

194
00:12:26,361 --> 00:12:28,972
even though nobody's used it since 1992.

195
00:12:38,248 --> 00:12:39,835
You don't have to write plugins.

196
00:12:40,187 --> 00:12:41,562
You can use existing plugins.

197
00:12:41,912 --> 00:12:45,727
We've got loads of them, I don't know what to demonstrate, so I'm going to demonstrate 2 at random.

198
00:12:46,316 --> 00:12:48,913
The "floppy" plugin, and the "memory" plugin which is a RAM disk.

199
00:12:49,786 --> 00:12:51,063
The "floppy" plugin first.

200
00:12:51,505 --> 00:12:52,110
Real simple.

201
00:12:53,202 --> 00:12:59,445
"nbdkit", name of the plugin which is "floppy", any directory.  This happens to be the directory of the source code of nbdkit.

202
00:13:02,279 --> 00:13:05,503
And same old nbd-client command to associate that with a loop device.

203
00:13:06,587 --> 00:13:08,967
And it should pop up in a second.  There it goes.

204
00:13:10,536 --> 00:13:11,079
That popped up.

205
00:13:11,519 --> 00:13:16,070
That is a floppy disk image,

206
00:13:16,909 --> 00:13:19,123
which contains the source of nbdkit.

207
00:13:21,077 --> 00:13:22,669
And what exactly happened there?

208
00:13:22,991 --> 00:13:25,787
I took a filesystem from my host,

209
00:13:26,079 --> 00:13:31,114
I turned it into a floppy disk, a FAT formatted, MBR partitioned disk image,

210
00:13:31,884 --> 00:13:33,523
and then I loop mounted it on my host again.

211
00:13:34,603 --> 00:13:35,318
Why is that useful?

212
00:13:36,966 --> 00:13:43,348
One thing you can genuinely use this for is to export disk images really easily to virtual machines.

213
00:13:44,161 --> 00:13:45,179
Or to containers,

214
00:13:46,126 --> 00:13:49,476
some container systems let you [import] a disk image which gets loop mounted inside the container.

215
00:13:50,205 --> 00:13:53,905
When this is really useful is actually not in the loop mounting case.

216
00:13:54,344 --> 00:13:58,653
That's when you're creating a PXE client, [correction:] PXE server,

217
00:13:59,441 --> 00:14:01,711
and your PXE client needs to be given a root filesystem,

218
00:14:02,445 --> 00:14:05,817
and the traditional way you do that is you create a massive initramfs,

219
00:14:06,168 --> 00:14:09,204
that you TFTP over to the client at boot time.

220
00:14:09,511 --> 00:14:12,478
Which is slow, TFTP is unreliable, it's not encrypted, etc.

221
00:14:12,927 --> 00:14:15,833
NBD is encrypted and authenticated,

222
00:14:16,274 --> 00:14:19,105
it's a super-efficient protocol,

223
00:14:19,557 --> 00:14:23,107
and it's a much better protocol [to use]

224
00:14:23,433 --> 00:14:25,706
because it only fetches the bits it needs to read,

225
00:14:26,032 --> 00:14:29,729
so you can have a much bigger root filesystem.  So it's a useful thing to do.

226
00:14:30,827 --> 00:14:33,986
My next demonstration is ...

227
00:14:34,943 --> 00:14:36,704
a RAM disk.

228
00:14:38,945 --> 00:14:43,206
Linux of course has a RAM disk driver inside [the kernel].

229
00:14:43,793 --> 00:14:47,319
It is however much more convenient to be able to

230
00:14:47,872 --> 00:14:50,387
write RAM disks in userspace.

231
00:14:50,873 --> 00:14:53,357
We've written a simple RAM disk called the "memory" plugin,

232
00:14:54,539 --> 00:14:58,594
which is implemented using a sparse array,

233
00:14:59,279 --> 00:15:02,917
so it's not limited by the size of RAM,

234
00:15:03,824 --> 00:15:04,763
(in virtual size)

235
00:15:05,013 --> 00:15:07,278
you can actually create really massive disks.

236
00:15:07,597 --> 00:15:11,463
In this case I'm going to create the most massive disk that you can create.

237
00:15:12,070 --> 00:15:14,589
The biggest that Linux supports,

238
00:15:15,120 --> 00:15:18,517
until we eventually move to 128 bit block sizes,

239
00:15:18,984 --> 00:15:24,440
this is 2^63-1, it's the largest signed 64 bit integer you can have.

240
00:15:25,030 --> 00:15:28,544
How big is that -- 2^63 - 1 -- in terms of disks?

241
00:15:28,799 --> 00:15:33,327
I went on to Amazon to try to work out how much it would cost you to buy that many disks.

242
00:15:34,264 --> 00:15:38,553
And it turns out that's €300 million.

243
00:15:38,908 --> 00:15:42,315
I was very disappointed that Amazon doesn't let you create an order

244
00:15:42,913 --> 00:15:46,185
for €300 million that I could have screenshotted here

245
00:15:46,711 --> 00:15:49,829
the field isn't big enough, it doesn't let you do that,

246
00:15:50,225 --> 00:15:51,055
you can't do that.

247
00:15:51,678 --> 00:15:53,386
But anyway it's €300 million on Amazon,

248
00:15:53,788 --> 00:15:56,724
but we can create it here much more cheaply.

249
00:15:57,269 --> 00:15:59,785
Associate it with a loop device.

250
00:16:00,313 --> 00:16:02,065
You can see the size there is just massive.

251
00:16:04,158 --> 00:16:07,950
I'm going to use GPT for partitioning, because MBR is limited to 2 TB.

252
00:16:08,892 --> 00:16:10,347
All defaults.

253
00:16:11,640 --> 00:16:13,276
That's what the partition size looks like.

254
00:16:14,055 --> 00:16:17,322
8 exabytes [EB].  It's actually 8 EB minus 1 byte.

255
00:16:19,143 --> 00:16:21,509
I'm going to use btrfs, but what are my other choices?

256
00:16:21,889 --> 00:16:23,823
I could have used ext4 ... could I?

257
00:16:24,758 --> 00:16:26,591
What's the limit on filesystem [size] on ext4?

258
00:16:27,794 --> 00:16:29,248
Nobody knows.  It's 1 EB.

259
00:16:29,667 --> 00:16:31,113
We'd have 7 EB wasted.

260
00:16:31,866 --> 00:16:35,943
XFS is possible, but XFS has quite a high metadata overhead.

261
00:16:36,271 --> 00:16:37,470
Actually that's unfair on XFS.

262
00:16:37,705 --> 00:16:39,789
XFS has a really nice, low metadata overhead,

263
00:16:40,037 --> 00:16:41,224
but it's about 1%.

264
00:16:41,543 --> 00:16:45,303
1% of 8 EB is too big for my laptop.

265
00:16:47,090 --> 00:16:49,446
So I'm going to use btrfs.

266
00:16:49,799 --> 00:16:51,785
You can see there, btrfs is an absolute champ.

267
00:16:52,353 --> 00:16:55,012
It totally just creates an 8 EB filesystem.

268
00:16:56,551 --> 00:16:58,223
And I can mount it.

269
00:16:59,671 --> 00:17:02,237
And I can ... we've got 8 EB ...

270
00:17:03,714 --> 00:17:09,202
I'm just going to "chown" this so I can go in there and show you.

271
00:17:09,906 --> 00:17:13,991
I played around with this [before].

272
00:17:17,588 --> 00:17:19,140
I missed that question I'm afraid.

273
00:17:20,289 --> 00:17:23,553
(inaudible question)

274
00:17:24,403 --> 00:17:28,560
The question was: How many bugs in anything

275
00:17:28,862 --> 00:17:33,787
do you hit when you try to use the very last block which is only 511 bytes long?

276
00:17:34,587 --> 00:17:36,412
The answer is you definitely hit bugs in qemu.

277
00:17:36,725 --> 00:17:37,994
Qemu can't handle that case.

278
00:17:39,885 --> 00:17:43,627
You can create btrfs subvolume[s].

279
00:17:47,209 --> 00:17:51,305
What is it?  "btrfs filesystem df", I think?

280
00:17:51,823 --> 00:17:54,838
And it just works great.

281
00:17:55,263 --> 00:17:58,751
And the next thing is when I click to the next slide ...  that's gone.

282
00:17:59,317 --> 00:18:02,607
This software I'm using will kill nbdkit,

283
00:18:02,914 --> 00:18:04,489
everything's destroyed and it goes away.

284
00:18:04,886 --> 00:18:06,944
So it's great for testing.

285
00:18:07,703 --> 00:18:11,369
Other things that are useful for testing.  There are some plugins there which are very useful for testing.

286
00:18:12,041 --> 00:18:13,746
And some filters I'm going to talk about now.

287
00:18:14,642 --> 00:18:17,924
Which are super-useful if you're testing filesystems or the limits of filesystems.

288
00:18:20,431 --> 00:18:23,866
The first filter I'm going to talk about which is useful for testing is the "delay" filter.

289
00:18:25,124 --> 00:18:26,520
You can inject delays ...

290
00:18:26,870 --> 00:18:30,636
into the nbdkit request.  You can specify the number of seconds,

291
00:18:31,036 --> 00:18:32,254
or the number of milliseconds.

292
00:18:32,918 --> 00:18:35,668
This is useful if you were testing, say, a distributed filesystem.

293
00:18:36,503 --> 00:18:38,005
You want to test it all on one machine,

294
00:18:38,351 --> 00:18:41,734
but you want to simulate the effects of having a really remote node,

295
00:18:41,999 --> 00:18:43,956
that has a long delay,

296
00:18:44,294 --> 00:18:46,458
you just inject delays into that device

297
00:18:46,935 --> 00:18:47,873
to simulate that.

298
00:18:48,307 --> 00:18:50,264
So it's a very simple filter.

299
00:18:51,356 --> 00:18:53,046
This filter's also a lot of fun.

300
00:18:53,352 --> 00:18:55,637
It's the "error" filter, and it injects errors.

301
00:18:56,051 --> 00:18:57,748
Obvious use for testing here.

302
00:18:58,291 --> 00:18:59,503
There's two ways to use this.

303
00:18:59,736 --> 00:19:04,425
The first way is we want this particular error [EIO] and we want a generalized error rate of 10%.

304
00:19:05,164 --> 00:19:07,202
This means that at random 10% of [requests] are going to fail.

305
00:19:07,887 --> 00:19:10,948
However I think the second way of doing this is more useful for

306
00:19:11,562 --> 00:19:12,594
most people.

307
00:19:13,102 --> 00:19:15,205
Here we're saying the error rate is 100% so

308
00:19:15,559 --> 00:19:19,043
100% of requests are going to fail reliably all the time.

309
00:19:19,565 --> 00:19:20,945
However it's gated ...

310
00:19:22,033 --> 00:19:23,945
on the error file.  Now what that means is

311
00:19:24,266 --> 00:19:25,836
if that error file doesn't exist,

312
00:19:26,150 --> 00:19:28,448
or you delete it, no errors are injected,

313
00:19:28,836 --> 00:19:30,225
the error filter is turned off.

314
00:19:30,827 --> 00:19:32,923
When you create that file,

315
00:19:33,241 --> 00:19:35,444
the error filter is turned on.  This is while nbdkit is running,

316
00:19:36,090 --> 00:19:37,948
so it's checking that error file [on every request].

317
00:19:38,602 --> 00:19:42,241
And that's super-useful for testing because obviously you can inject errors

318
00:19:42,600 --> 00:19:44,477
when you want them to be injected

319
00:19:44,795 --> 00:19:50,472
and then turn off error injection and see if your filesystem recovers or whatever it's supposed to do.

320
00:19:51,755 --> 00:19:56,256
And the third filter which is a very simple filter but also useful is the

321
00:19:56,670 --> 00:19:59,031
"log" filter.  You give it the name of a log file

322
00:19:59,345 --> 00:20:00,719
and it writes the log file in that format.

323
00:20:02,031 --> 00:20:04,450
In the next demonstration I'm going to show you

324
00:20:05,035 --> 00:20:08,085
we're going to have some graphical visualization of what happens

325
00:20:08,424 --> 00:20:10,797
inside filesystems when you do things like creating filesystems.

326
00:20:11,280 --> 00:20:14,835
It's important to note that nbdkit is not a graphical tool.

327
00:20:15,863 --> 00:20:19,172
Nbdkit knows nothing about graphics or anything like that.

328
00:20:19,433 --> 00:20:21,624
What's actually happening here is we're using the log filter,

329
00:20:22,036 --> 00:20:23,263
we're writing a log file,

330
00:20:23,784 --> 00:20:26,910
we've got a second graphical program, a program I wrote for this talk,

331
00:20:27,486 --> 00:20:29,118
which is tailing that log file and then

332
00:20:29,341 --> 00:20:31,992
creating the visualizations which you'll see.

333
00:20:32,465 --> 00:20:36,034
So nbdkit is not a graphical program, it's just a command line tool / server.

334
00:20:37,729 --> 00:20:41,731
Let's have a look at what it looks like to create a filesystem.

335
00:20:42,764 --> 00:20:44,506
Slightly long nbdkit command line here,

336
00:20:44,841 --> 00:20:46,882
but hopefully you should be able to understand what's going on.

337
00:20:47,643 --> 00:20:50,841
We're creating ... the "memory" plugin, so we're creating a RAM disk.

338
00:20:51,256 --> 00:20:52,179
64 megabytes

339
00:20:52,815 --> 00:20:57,224
We're using that log [filter] to create the log file which we're going to tail with a second process.

340
00:20:57,967 --> 00:21:01,944
And we're inserting delays.  Now the delays are just so it slows it down a little bit,

341
00:21:02,430 --> 00:21:03,911
to make it a little bit easier to see.

342
00:21:04,345 --> 00:21:05,792
Otherwise everything goes past too quickly.

343
00:21:06,719 --> 00:21:08,431
So I'll run nbdkit.

344
00:21:08,924 --> 00:21:12,326
And this is my second program which is going to visualize things.

345
00:21:14,916 --> 00:21:19,590
Same old command to associate the nbdkit instance

346
00:21:20,141 --> 00:21:21,230
with a loop device.

347
00:21:22,426 --> 00:21:24,394
Now hopefully you could see that.

348
00:21:24,710 --> 00:21:26,103
Little black flashes going on.

349
00:21:26,357 --> 00:21:27,463
Those are reads.

350
00:21:27,796 --> 00:21:30,224
What's happening there is because we've created a Linux kernel block device,

351
00:21:30,881 --> 00:21:34,909
the kernel, udev, are looking at that and saying "is there an LVM PV [physical volume] there?"

352
00:21:35,226 --> 00:21:39,043
"Is there a filesystem there?  Is there a partition there I should know about?"

353
00:21:39,320 --> 00:21:42,239
It's a RAM disk so it's empty, but it has to check.

354
00:21:44,417 --> 00:21:46,272
Now let's partition it.  I'm going to use GPT.

355
00:21:48,104 --> 00:21:49,625
All defaults.

356
00:21:52,796 --> 00:22:00,081
GPT works by creating a partition table at the beginning of the disk and a secondary or backup PT at the end of the disk.

357
00:22:00,951 --> 00:22:02,846
Those are represented in red.  Those are writes.

358
00:22:03,236 --> 00:22:08,274
And you probably also saw little black flashes there.  We've created another Linux kernel device,

359
00:22:08,760 --> 00:22:11,085
and again udev has to check it.

360
00:22:13,479 --> 00:22:15,406
Let's create a filesystem in there.

361
00:22:18,242 --> 00:22:21,080
The big thing that happens there is this lump of blue that happens at the beginning.  [Not shown because of a technical problem in the video]

362
00:22:21,609 --> 00:22:24,673
Blue in this diagram represents discards.

363
00:22:25,981 --> 00:22:31,319
Modern mkfs always issues a big discard or trim over the entire partition.

364
00:22:32,202 --> 00:22:35,169
The reason for that is it makes SSDs work more efficiently.

365
00:22:36,674 --> 00:22:42,365
Other notable features: The red bar here is some kind of metadata.

366
00:22:42,791 --> 00:22:45,199
I'm in a Storage [Track] room full of filesystem experts,

367
00:22:45,810 --> 00:22:50,618
so hopefully you know better than I do what's going on here.  But that's probably an inode table.

368
00:22:51,526 --> 00:22:55,435
Big lump of red here.  Could be the journal, maybe?

369
00:22:56,623 --> 00:22:59,704
Little red dots.  I think those are backup superblocks.

370
00:23:00,039 --> 00:23:03,037
If you notice there are 4 red dots and 4 backup superblocks.

371
00:23:04,472 --> 00:23:05,822
Let's mount that.

372
00:23:08,218 --> 00:23:10,989
I'm not touching the laptop here but something funny happens in a second.

373
00:23:14,445 --> 00:23:15,241
Here it goes.

374
00:23:15,826 --> 00:23:17,555
Do you see it's writing?

375
00:23:17,928 --> 00:23:19,467
We've just mounted the disk but it's writing to it.

376
00:23:19,797 --> 00:23:25,023
This is lazy blockgroup initialization, it's another feature of modern filesystems.

377
00:23:26,145 --> 00:23:34,087
Because disks are really big these days, and writing to them (relative to the size of the disk) is really slow,

378
00:23:35,265 --> 00:23:39,306
so you wouldn't want your mkfs to sit there,

379
00:23:39,669 --> 00:23:42,998
for hours on end writing all the blockgroup metadata.

380
00:23:44,162 --> 00:23:47,824
And in any case why would you do that?  Because you can't

381
00:23:48,189 --> 00:23:51,285
use all of those blockgroups for writing because writing is so slow

382
00:23:51,797 --> 00:23:53,037
compared to the size of the disks.

383
00:23:53,926 --> 00:23:57,944
So it makes so much more sense for filesystems to defer all this [work] to the kernel,

384
00:23:58,349 --> 00:24:04,108
so when the disk is mounted the kernel sees there are uninitialized block groups

385
00:24:04,748 --> 00:24:05,961
blockgroup metadata

386
00:24:06,240 --> 00:24:08,206
and it creates that in the background.

387
00:24:08,545 --> 00:24:11,197
It doesn't matter anyway because you can't write to those new blockgroups

388
00:24:11,530 --> 00:24:14,679
faster than they're being created

389
00:24:14,942 --> 00:24:16,210
so it's fine.

390
00:24:17,039 --> 00:24:18,390
So let's mount this.

391
00:24:19,030 --> 00:24:25,626
It's mounted so I'm going to chown it to make it convenient for me to put some files on there.

392
00:24:26,385 --> 00:24:29,104
Let's again copy the nbdkit source code.

393
00:24:31,639 --> 00:24:34,197
You see that nothing actually is written until I "sync".

394
00:24:34,783 --> 00:24:36,183
We know this, right?

395
00:24:36,569 --> 00:24:38,493
When you write to a disk,

396
00:24:38,843 --> 00:24:42,706
the writes don't hit the disk immediately, they get stored in memory for a bit

397
00:24:43,080 --> 00:24:44,510
and they get written a few seconds later,

398
00:24:44,913 --> 00:24:46,914
unless you do a "sync" which forces that write.

399
00:24:47,755 --> 00:24:52,231
And of course when you delete that directory,

400
00:24:52,504 --> 00:24:53,945
even when I "sync",

401
00:24:54,675 --> 00:24:56,365
it's not going to [update] that.

402
00:24:57,189 --> 00:25:02,562
You know why this is.  You know when you delete files it doesn't really delete them, it simply marks them in the block group

403
00:25:02,886 --> 00:25:03,764
as being unused,

404
00:25:04,263 --> 00:25:09,416
and later on those blocks get reused for other files you create.

405
00:25:09,663 --> 00:25:14,833
But there is a command -- for modern filesystems -- we can use

406
00:25:15,152 --> 00:25:18,238
to actually tell it to discard them

407
00:25:18,503 --> 00:25:19,839
and that's the "fstrim" command.

408
00:25:20,199 --> 00:25:25,904
That issues a discard request to the filesystem.

409
00:25:36,525 --> 00:25:39,881
This is my final demo.

410
00:25:40,183 --> 00:25:42,811
That was a nice one showing a single filesystem,

411
00:25:43,094 --> 00:25:46,844
but I think more interesting is when you run multiple copies of nbdkit

412
00:25:47,790 --> 00:25:48,880
to create multiple devices.

413
00:25:49,790 --> 00:25:53,392
And this is the longest nbdkit command line that you'll probably ever see.

414
00:25:54,960 --> 00:25:56,863
There are only two important changes here.

415
00:25:58,759 --> 00:26:01,844
The first one is ...

416
00:26:02,996 --> 00:26:05,361
Previously I was only running one copy of nbdkit.

417
00:26:06,668 --> 00:26:11,804
So I could have it listen on TCP port 10809

418
00:26:12,187 --> 00:26:13,983
which is the default port for NBD.

419
00:26:15,408 --> 00:26:18,439
However I'm going to be running 5 copies of nbdkit this time, and they can't all be listening on the same port.

420
00:26:19,560 --> 00:26:24,484
So I'm going to be using a Unix domain socket and that's the purpose of the -U option here.

421
00:26:25,547 --> 00:26:28,435
And the second change is I'm using the error filter.

422
00:26:30,042 --> 00:26:32,090
I'm using this in the way we described before

423
00:26:32,410 --> 00:26:34,752
where you set the error rate to 100%,

424
00:26:35,232 --> 00:26:39,347
but we gate this on the presence or absence of an error file.

425
00:26:39,749 --> 00:26:42,128
So the error filter is turned off because that error file doesn't exist.

426
00:26:43,226 --> 00:26:44,726
But it gets turned on later on.

427
00:26:45,715 --> 00:26:48,041
I'm going to start 5 copies of nbdkit.

428
00:26:55,323 --> 00:26:57,064
I'll just show you what's going on on the filesystem here.

429
00:26:58,545 --> 00:26:59,869
We've got 5 log files as you'd expect.

430
00:27:00,626 --> 00:27:04,430
Those are going to be tailed by the graphical viewer.

431
00:27:05,278 --> 00:27:09,583
We've got 5 sockets.  There are 5 copies of nbdkit hiding behind those Unix domain sockets.

432
00:27:12,624 --> 00:27:14,907
Let me run the graphical viewer.

433
00:27:18,115 --> 00:27:19,266
5 devices this time.

434
00:27:20,586 --> 00:27:21,521
Hopefully not too small.  OK.

435
00:27:24,505 --> 00:27:30,328
And now I'm going to associate the 5 nbdkits with the 5 devices.

436
00:27:36,960 --> 00:27:41,071
Now I'm going to create a RAID 5 array.

437
00:27:42,166 --> 00:27:45,606
I'm going to use the first 4 disks as data disks,

438
00:27:46,675 --> 00:27:48,972
and the last disk as a hot spare.

439
00:27:53,549 --> 00:27:54,318
Let's get that going.

440
00:27:57,004 --> 00:28:00,268
You can see what's happening here is it's reading the first 3 disks

441
00:28:00,508 --> 00:28:02,273
and creating a parity disk on the 4th disk.

442
00:28:03,197 --> 00:28:07,377
People who know about RAID will be thinking: "Why's that parity disk not being striped

443
00:28:08,523 --> 00:28:09,170
over all of the data disks?"

444
00:28:09,711 --> 00:28:12,992
The reason is because these disks are so small, they're 64 MB,

445
00:28:13,303 --> 00:28:16,639
the stripe size is actually [larger] than the entire disk.

446
00:28:17,625 --> 00:28:20,524
So there's 1 parity disk and there are 3 data disks.

447
00:28:21,633 --> 00:28:25,258
Let's have a look at the kernel messages which will be interesting in a minute.

448
00:28:27,644 --> 00:28:31,624
Let's partition that as before.

449
00:28:33,573 --> 00:28:34,663
All defaults.

450
00:28:36,387 --> 00:28:38,564
And we can create a filesystem on there as well.

451
00:28:43,437 --> 00:28:46,203
This looks a lot like it did last time,

452
00:28:46,671 --> 00:28:48,956
except there's no trim.

453
00:28:49,432 --> 00:28:54,270
Now the MD device in the kernel doesn't believe you can send discard requests to devices.

454
00:28:54,623 --> 00:28:56,286
I guess because they've been burned [by faulty hardware] in the past.

455
00:28:57,594 --> 00:29:01,126
There is a way to do this, by setting a kernel command line flag,

456
00:29:01,583 --> 00:29:03,230
which is something weird like

457
00:29:03,474 --> 00:29:07,185
[raid456.devices_handle_discard_safely=Y]

458
00:29:08,107 --> 00:29:11,304
However as this is quite literally my work laptop,

459
00:29:11,759 --> 00:29:14,109
I don't have that on my kernel command line

460
00:29:14,480 --> 00:29:18,830
so it's not issuing discards to the underlying devices.

461
00:29:20,506 --> 00:29:22,025
And I can mount this.

462
00:29:24,876 --> 00:29:30,865
Let's chown it so I can write to it.

463
00:29:32,028 --> 00:29:34,175
Let's create some files in there.

464
00:29:36,869 --> 00:29:41,949
And the interesting thing is what happens when I inject an error into this?

465
00:29:43,343 --> 00:29:46,086
Well you can see what happened there, quite dramatically.

466
00:29:46,515 --> 00:29:50,185
It detected first of all that the error occured

467
00:29:50,543 --> 00:29:52,285
on the second disk

468
00:29:53,191 --> 00:29:58,514
and the second disk is called "/dev/nbd1" here because I'm starting the disk numbering from "/dev/nbd0".

469
00:29:59,886 --> 00:30:02,192
You can see also it started to do a recovery.

470
00:30:02,492 --> 00:30:06,089
So it started to read from the remaining good disks,

471
00:30:06,553 --> 00:30:09,897
and it created an extra parity disk on the hot spare.

472
00:30:10,903 --> 00:30:13,710
It took a little bit of time.  We're injecting delays here.

473
00:30:15,065 --> 00:30:19,824
Although we're injecting delays so it's a bit slower than normal,

474
00:30:20,115 --> 00:30:22,955
you can imagine how it would be if this wasn't a 64 MB disk

475
00:30:23,223 --> 00:30:26,180
but this was 6.4 TB or larger.

476
00:30:27,199 --> 00:30:30,085
Recovery on RAID 5 takes a really long time,

477
00:30:30,507 --> 00:30:34,985
and unfortunately the way that RAID 5 works is that if you then get another disk failing,

478
00:30:36,245 --> 00:30:37,786
at certain points during the recovery,

479
00:30:38,103 --> 00:30:39,429
you can lose all your data.

480
00:30:39,642 --> 00:30:43,324
That's why we don't use RAID 5 in production,

481
00:30:43,649 --> 00:30:45,951
certainly on larger systems these days.

482
00:30:46,983 --> 00:30:49,477
However it's still a good demo.

483
00:30:50,863 --> 00:30:55,985
I should just note that when I clicked the "Error" button there

484
00:30:56,388 --> 00:30:59,914
the graphical tool didn't start injecting errors.

485
00:31:00,182 --> 00:31:02,677
All that happened was the graphical tool created a file called "error2".

486
00:31:03,749 --> 00:31:07,025
And then nbdkit notices that the file exists and starts to inject errors on that disk.

487
00:31:08,892 --> 00:31:11,399
Now although all that dramatic stuff happened in the background,

488
00:31:11,873 --> 00:31:14,435
the actual filesystem is fine.

489
00:31:14,841 --> 00:31:17,999
There are no errors on the filesystem or array level.

490
00:31:18,348 --> 00:31:20,480
The dramatic stuff happened below there.

491
00:31:21,511 --> 00:31:25,443
And of course I can inject more errors on a second disk.

492
00:31:26,307 --> 00:31:28,606
And now we're running in degraded mode.

493
00:31:28,871 --> 00:31:30,865
This is the minimum that this RAID array can support

494
00:31:31,605 --> 00:31:33,431
without actually failing.

495
00:31:33,703 --> 00:31:36,511
Although there was another error

496
00:31:36,869 --> 00:31:39,634
I'm still just about OK, although if there was another error

497
00:31:40,688 --> 00:31:42,465
if I clicked another button, two things would happen:

498
00:31:44,147 --> 00:31:47,114
You'd see errors appearing at the filesystem level.

499
00:31:47,676 --> 00:31:50,483
And the second thing that would happen is I'd have to reboot my laptop.

500
00:31:50,943 --> 00:31:53,080
Because you cannot ... and I could not work out how to do this ...

501
00:31:53,561 --> 00:31:57,219
You cannot then unmount a RAID array

502
00:31:57,464 --> 00:31:59,578
that's in that state.  It's just impossible

503
00:32:00,002 --> 00:32:02,277
and I've no idea why.  There's lots about this on stackoverflow.

504
00:32:02,929 --> 00:32:05,553
I don't want to reboot my laptop in the middle of the talk

505
00:32:05,914 --> 00:32:09,346
so I'm not going to do that.

506
00:32:09,693 --> 00:32:11,706
It's probably a kernel bug or something, I don't know.

507
00:32:12,792 --> 00:32:16,154
Instead what I'm going to do is umount the filesystem

508
00:32:16,522 --> 00:32:18,343
and stop the RAID array.

509
00:32:18,515 --> 00:32:21,515
[end of subtitles]