TRiRODS: Deploying Logical Quotas with iRODS at CU Boulder Research Computing

so welcome tonight for TRiRODS we’re
going to have Jonathon Anderson who is associate director of the University of
Colorado Boulder research computing he’s going to talk about logical quotas along
with iRODS at CU Boulder for the PetaLibrary thanks Jonathon
thanks Terrell and thanks everyone we’re pretty new to iRODS still so I’ve got
some stuff in here that’s just background on a little bit on research
computing in a little bit on what brought us to this state so there I’m
sure everyone there is quite a bit more familiar with iRODS
than I or anyone here is so if you see something that we’re doing and you’re
like why are you doing it that way that question to be raised because there’s a
good chance that we don’t know that we shouldn’t be doing it that way
but of course any other questions about our experiences with the two main newish
components that are in our deployment being an affair on sand and logical
quotas I’m happy to talk about that or really anything else that comes up so
like Terrell said I’m Jonathan Anderson I’m with the research computing group here at
CU Boulder we primarily at least people primarily
think of us as an HPC Center we’ve run a couple compute clusters one as kind of a
minute NSF grant and the other is a condo system that we aggregate from
contributions from a bunch of different users that we also run a research data
storage system that we call the panel I bury and that’s in kind of two parts
that we’ll get into let’s get this one then here we go so the pedal library has
been in operation for about eight years at research computing it predates me by
a fair bit it starts with the tape library that you can see I have yeah
sorry I’ve got two slides here on my screen so I’m going to use the table
library on your right is one of these kind of standard to frame IBM tape
library things that was expanded at one point it’s currently it still is
accessed using hierarchical to get and TFS TSM file system sorry
Caesars primarily interact with the gpfs file system and then there I am policy
management stuff Stubbs files off to tape and does it relatively
transparently and then on the left is a DD and grid scaler cluster based on an
SFA 10k that we kind of got on a back yard DUI I don’t really know how this
happened again it predates me but my understanding is like DD an had a used
one lying around and have a special deal and so it ended up in our racks and so
the one on the right became pedal library archive which we mount pretty
much just on my Globus transfer nodes and login nodes where people can put
dated that they want to keep but not necessarily use it any given point and
then on the left is a lot very active which we only mounted over NFS but it’s
you know relatively performant and we give access to that from all of our
compute clusters and especially our condo users you don’t have a dedicated
scratch environment I’ve tended to use pedal library active for their active
data sets and things like that we sell access to both of these systems at a
terabyte per year rate what I like to say is that our competition isn’t Amazon
or any cloud service but going out to Best Buy and buying a hard drive that
you keep on your desk so that’s where we try and match price points so what we
end up doing is we buy or pay for infrastructure somehow and then we base
our storage reach around the prices the media that goes into it to be the goal
is to get researchers to do the right thing with their data which is kind of a
theme with using eyelids as well but rather than putting it on a hard drive
that doesn’t even have an enclosure and then keeping in a drawer we want them to
have it in a raid array and then most of our users to interact with both of these
systems at some point certainly the archive via Globus we have a number of
users that use global shared endpoints in particular to share data with other
with users and other institutions so over the last I have in my notes year
or so but I am depressed to admit that it is actually almost been two years
we’ve been migrating our storage from those infrastructure bits to these that
you see on the screen here so on the left is a series of SAS enclosures
they’re relatively standard 60 V HGST enclosures in fact I think they are the
same enclosures that are in the 10k just direct from HGST instead of instead of
through DVM and then those are aggregated using ZFS and then we run B
GFS on top of that so ZFS for storage and then we have a pair of servers that
are providing metadata for B GFS that use XF a so with this and then on the
right is our new spectra T 380 tape library so that uses this for LC o 8
drives and we use the either LT o 7 RL t o8 I see it set different ways the type
M media that’s an LTO 7 cartridge that’s formatted with the LT o 8 format so we
get nine terabytes of cartridge on that which is kind of cool and then above it
is a strongbox LTFS storage appliance so if you’re not familiar with this it has
some local cache and you write into an NFS or SMB share or read from it and
then it handles basically stubbing that out to tape I think it’s actually using
a database inside there’s no real stuffing going out on the filesystem but
if it stages those files in and out of tape depending on the focus of the cache
yeah and then just by happenstance the two servers that are above that our
database servers that ended up holding our irods catalog which is a whole other
story so the main thing that was giving us problems even with the legacy archive
is that we so the way we like to hand out this storage is we create we want to
create a directory and then set a maximum size for that direction so in
gpfs this is you do this with file sets where you create a file set and then you
junction that fine set into your namespace somewhere and
then you can set a quota on that file set it effectively is just another bit
of whatever another ID that gets attached to every directory and file
just like UID and GID as a file set ID but then everything within then that sub
namespace gets that file set ID and then you can track quotas on it in the same
way that you would track UID and GID quotas CFS does this a little bit
differently you create a data set and then you can set a file size on that
data set and that’s exposes file system so we can do that in traditional file
systems but when it gets hierarchical it gets messy so even with GPFS and TSM
together when ggbfs stubs data off into tape that data comes off of your quota
and no longer impacts your quota so historically we’ve had a series of
terrible scripts that we had well not even that we that like the IBM
consultant that deployed all of this originally wrote in K shell to write the
current occupancy of your data set on tape into a file and then that file was
consulted when it was deciding whether or not to stub data off into tape and
it’s always been fragile and it’s always been a mess and we’ve never really liked
it so one of the original goals of the new pedal library was to fix this and to
make it feel better to make it you know just feel on purpose instead of
something that we hacked together so I had heard of irods
a while ago the last site that I was at I had a co-worker who had used it at a
previous job of his in the pharmaceutical industry he was just
always saying we should have irods we really should be using hydraulics and I
had never really I definately hadn’t used it before so I didn’t really
understand what it would be for what the value of it would be and but it was
interesting and he and I trusted the guy so I started playing around with it on
the side in my spare time I have a server just here under my desk that I
keep kicking on accident that I’ve been that’s been very useful but when we
started designing this actual service you know irods is one of those things I
thought maybe we should be doing this with irods
but we wanted we wanted NFS access or basically filesystem level access that
we could mount and at the time it didn’t ready for that and then it also didn’t
do the quote of stuff and it looked like we would be able to do that but without
NFS access like though we’ll go somewhere else we went with a
proprietary competitor for this and many other features that were promised and
didn’t actually exist which is part of our why it’s taken us multiple years to
do this migration so when when that proprietary competitor to irods did not
pan out we reached out to the irods consortium
in Italy Jason and I think Tara was part of that conversation as well at the time
and started talking about what it would look to do this deployment with higher
odds I think NFS Rods was like Oh dot seven
or something like that at the time oh this is coming we thought okay that that
sounds encouraging and then we gave our ideas for what it would what it might
look like to do what ended up being collection based or now logical quotas
in the higher odds and got some good feedback on that so ultimately we put
this aside for a little bit on the archive and focused on the active side
and got that up and running so that let there be time for some development which
was nice but yeah just in the last few months we’ve kind of been pursuing this
in earnest since the be GFS file system has relatively stabilized so we’ve been
early testers of NFS Rods we have it running on one server I want
to deploy it out into basically all of our data transfer notes the same I’m
still kind of whiteboarding what that architecture should look like I’m
thinking of Fanning our our resource servers out there as well that’s still
in the early planning stages and then like I said we we had an opportunity to
co-develop the logical play and we didn’t right
much code we I messed around with prototyping a little bit during the when
the logical quotas were just written in the traditional rule engine but none of
that actually got committed Cory’s done pretty much all of this and I want to I want to say here that I
can’t overstate how responsive we found our interactions with with the
consortium it’s just been really great we were kind of in a bad way when we
came to the consortium for help and it’s been really good to get a lot of
encouraging response but from both of these features but just in kind of
figuring out how to that’s in general and so and that was even before we were
consortium members and I’m glad to say that we are conservative members now and
I expect that to continue for the in perpetuity let’s see so this is what we
have now in terms of the archive we have our strongbox LT investing or fronting
tape strongbox exports a series of either and FS or SMB shares and we do it
all with NFS and then these shares are mounted on our resource server which
right now is the same that is a kind of log server but we might do that
elsewhere in the future and then re-exported or you know added into
higher odds as resources alongside behind that’s kind of like itself all
this is fronted with NFS rots with the main goal they are being access to irods
via Globus and then because it’s what our users expect we will also make those
mounts available on our login notes but and that’s about where it ends for the
archive in the future if we get active into this as well we’ll look we might
mount this everywhere on all the computes and everything but
one step at a time so another little bit of history is we were trying to work
about with getting access to our odds with the current versions of Lotus and
that didn’t work out and it’s another place where we really stepped up and
helped us make this work so we don’t have this exporting via clovis yet it’s
just because it’s is with NFS rods and there’s no reason
why we couldn’t have those mounts there it’s just actually I think the amounts
are even there so it probably is technically there are using freezing the
data transfer notes for some data ingest right now so over here on the left in
our ILS resource thing we so we aggregate those those shares with a what
is it called with a combined resource I wrote this down here but I don’t see it yeah anyway they’re children of apparent
resource that randomly allocates files to one of four and that is because the
way strongbox works we each share will only ever use one drive at a time so
doing it this way of let’s multiple users doing i/o across the whole system
utilize up to all of the drives at the same time and then we have these single
and double parent resources because we also want to use this to replicate data
from active and doubled resources are automatically replicating between two
tapes in the background and we don’t need to replicate a thing that itself is
already a replica so we have that resource configured so we can just have
a single copy of some data so NFS Rods is I don’t well I guess it’s on docker
hub we’re running it directly out of git because we’re making some changes but
ultimately the what is recommended and what we are doing is running it as a
docker container we do it slightly differently than is in the readme so the
readme has you put a password and shadow file into the container or you know
mounted into the container and that’s where your ID mapping happens but
ultimately that’s just using NSS for your ID mapping so we modified that
container to install SSST and then we mount varlet sss which is
where all the pipes live to connect the host sss to the one running inside of
the container or to the libraries for on the inside the container there’s no
actual SSST in there but let’s live n SS SSS contact the hosts SSST and
get all the users that are in our LDAP and by that we can I even have any user
in our domain without having to put them in all of our you know esthetic password
file or anything like that that just worked it worked immediately and I have
a pull request for optionally having SSST available in that container that is
hanging open but I don’t think there’s any pushback on that there’s just a lot
going on but configuration is largely do these three files
log4j is just a standard log for Java configuration files so I don’t know much
about it we just copied it out of the example and then an exports file which
is largely static and the main stuff is the server died Jason you know the
original point of this talk is not all be manifest rod stuff but for the sake
of rhetoric it’s it’s pretty simple again we just copied it out of the
example but what I would say is as we’ve been following development of NFS
permits the format of this and other configure files or which attributes are
where and how they are spelled has changed over time so if any of you are
setting up and the best rods yourself I recommend every time there’s an upgrade
to double check your current configure the example because things like that
were a big pain for us at one point namely with this exports file so the way
NFS rots does ID mapping is actually really cool it’s mapping or sorry not
hiding nothing permissions mapping is really cool it’s exposing all the irods
ackles as nfsv4 hackls but that is apparently a relatively new feature and
the original pull of NFS rods that we got before we started upgrade or before
we started really testing it in earnest did not have that so our exports long
and did not have this ACL bit either and and you know we had one on our on our
system that we copied out at the example and then it took us about a week to find
that it was missing when things didn’t work we started to that eventually with
a lot of help so in general I’ve been like I have a
very polar experience with NFS rods which is what it is working it is
awesome and in ways better than I expected so the example that I have
given in the past is I didn’t even know that I rods had a POSIX kind of
streaming API before so I was expecting FS rods was just like caching uh put and
get but no it’sit’s fully POSIX II and I was able to write a random file with DD
onto it and then use an offset of the tip do IO into a random chunk of the
middle of the vial and it all worked exactly like it was supposed to
integrate but it’s clearly still early days we every now and again when we’re
trying to do something that we hadn’t done with it before working but just in
preparing this presentation I found one and then I I filed another bug I imagine
it will get sorted out but it’s clearly still relatively early but I’m very
excited it’s is a great thing for a tab and so logical quotas which i think are
the real point here are also very cool so they it’s deployed now as a policy
engine as opposed to you two so the original prototype was like a set of
rules written in the irods native rural language or whatever my understanding is
the primary motivator for taking it from that to C++ was that we we hit a maximum
size limit on the integers that that room language supports so we were
getting integer overflows with since it measures the size of your data set in
bytes that overflows pretty quickly but it’s all working out and it’s pretty
good there’s an RPM and I assume a Deb we aren’t using any depth based distres
but there’s a package available you can install it it’s and then you just add
this rule engine into your list of rule engines and away you go so
you can interact with it with the kind of straight native I rural command but
we took some examples that Corey had written up and committed and to get and
wrote this tiny little I logic you a Python script it’s it’s like a couple
dozen lines but it ends up making a lot of this very straightforward we intend
to contribute that into the repository just haven’t gotten to it yet but up
here on the top left you can see on our irods catalog server running as a user
that is logged in as I think like rods admin or something like that we’re able
to start monitoring a collection so this is my home collection here and then the
list the ad who’s defined on it and it will only list the one this I logic you
LS will filter down to the attributes that matter with respect to quotas we
can see that it is determined that there are fourteen objects in there and that’s
recursively so if there’s sub collections that will go find all of
those and record how many data objects there are in that whole collection at
the top of it and same for total size and bytes it sums up all of the sizes of
all of the data objects and stores that at the top and then the idea here is
anytime you make a change it checks to see if you’re in a collection that is
tracked and then we’ll update the values stored against that collection as you
make changes and you know we’re finding corner cases here and there of things
that aren’t getting caught yet I think I saw that Cori opened a bug recently for
it not handling streaming io yet but it’s just a matter of catching all of
the cases it seems like it’s relatively straightforward to add these in so I
here we see that I have 14 objects in here so if I set my limit to 14 and then
try to put a file into it so on the right I’m me logged in as me I try to
put a test file into my collection I get an error saying
that I doing so would exceed my maximum number of objects which is exactly what
we want we care about number of files just because number of files on tape
gets kind of dicey with with like C time and things like that so for tape we also
want to limit how what we tend to do is we have a ratio of size to number of
files that we allow people to store to try and encourage them to pack things
into larger files but more normally in an example here so I unset that max
objects limit on the left as rods admin and I set a max size of 10 megabytes
again this is expressed in bytes and I can see that that value has been set as
a maximum size and bytes over there and then I created on the right as me log
goodness me a one megabyte file and just started putting it in there and I’m
using this weird syntax here so that each file is getting a new file name so
so that it’s a new file each time but after I put in a few files it eventually
comes down to oh look this would exceed your maximum data size and exactly as
you would want and then over here on this last one we see that the total size
in bytes for my collection is just under 10 back as you would expect and as you
would want so as with NFS Rods it is also clearly early days here there are a
few bugs in the tracker but most of them are either recorder cases for things
that are being tracked yet or little like permissions issues with who is
allowed to set quotas and who can change photos and things like that but this is
seems like a much more straightforward thing than the NFS Rods stuff so I
imagine again that this will be that this will become production ready for us
without too much additional work this is the the link for the github for NFS Rods
was on the slide back there as well the biggest thing right now with the quotas
is that it isn’t integrating particularly well with NFS rods and
that’s not that’s not the streaming part but that when we right into a collection
with NFS rods and and it exceeds quota it’s just getting a generic IO error
back and if we look at the logs and then if s rods is getting a traceback that
it’s just not handling so hopefully it will be possible to actually pass like a
quota exceeded error or something like that back or or at least handle that
exception that will continue working on that and submitting bugs and mostly lazy
sitting around lazily while we wait for the consortium to talk about books
that’s everything that I have prepared like I I’m I’m happy to talk more about
it and take questions and you’re welcome to reach out to me after the fact also
there’s my contact information but what what can I tell people about how we’ve
set this up and and what our experience has been like so I thank you so much is
there has anybody else touched it but you me
and the storage admin here and I’ve done most of the the configuration but
Patricia’s doing all of the ingesting and things like that right and so has
anybody else come up against the quotas stuff yet maybe you’re in your testing
manage cases but nobody’s seen it live that’s correct yeah and even I’ll be
ingesting that we’re doing we’re doing that without quotas and then we’ll apply
it after the fact so no it’s mostly just the me messing around with it we did
apply a limit to one of our larger allocations like this is clearly just
like a test one that has nothing in it yeah but it’s been performant for
backporting tracking on to an existing thing and we’ve been in irods and using
the native commands all of that seems to be working with the one caveat of it no
they’re being a little bit of inconsistency and who is allowed to
modify the attributes but I imagine that’ll get sorted out all right
and also isn’t the big deal our our concern isn’t adversarial users who
might try and get around the quota system at this point it’s just you know
having there be something there that tells people they need to contact that
sends to combine more storage right and so in terms of other clients and by that
I mean software pieces of software that talk to irods
I was wondering if anything other than AI commands and NFS rods have been used
to put anything in biological quotas are firing and they’re behaving yeah so I
expect we will also be very interested in in DAF rods I have that in my
personal environment but I don’t have it here yet we haven’t tested it probably
what I’ll do is configure the logical quotas on my personal environment and
start messing around with that there rather than then try and get to have
rats working for for production at this point I know that one question we’re
gonna get about this is because of all the extra bookkeeping you know what kind
of performance is this effect and I don’t know how to measure that yet
you’re you’re our first hope for measuring that in a real way probably
sure I mean I I haven’t looked at the C++ code but it shouldn’t be too bad
because ultimately it should only ever be like an additional call during open
and close right so I wouldn’t expect it to add that much overhead and it’s
ultimately business adding to an integer or subtracting from an integer right
he’s just you know the slowest part of irods is talking to the database
probably so we’re doing that a couple times you know but it’s going it’s not
gonna be zero right yeah and and the our initial deployment here is for the
archive so the performance isn’t as big of a frightening but it’s totally fair
that for you know if we wanted to start doing this to front the be gff’s file
system or its underlying ZFS which is my ultimate goal yeah that’s where we
really start seeing that hit yeah we haven’t tested any of that yet right
okay anybody else have any any questions so are you so you’re using this so that
you can get our rods data also voiced it up into Globus via the NFS that’s kind
of like your way of getting around the perhaps and Tran transition so the
globus consortium I mean yeah and ultimately I kind of prefer that
solution to be honest it’s just more pluggable maybe rather than having it be
custom stuff everywhere and the best is a common protocol and plugging high rods
into Globus via and if s makes some sense to me well cyberverse is looking
to do the exact same thing and they know about it now because I just texted
neuron but I think that would be interesting to kind of see how this
works with that Globus constellation and that might actually be a good use case
for tuning and you know it’s always different when you say see somebody
else’s software trying to go through your interface in terms of revealing
corner cases and stuff like that so we have our our current globe is like our
production Lotus cluster is a three node cluster that’s really aging and the CPU
and memory is not keeping up with other demands so we have another four node
cluster that we’ve started building that isn’t quite in production yet the my
current thinking is so right now we have NFS rods running on the VM that is also
running ikat what I imagine we’ll do is run NFS rods on each of those globus
servers and then have them mount themselves for the NFS mount and so then
they they all talk to high rods in parallel they are all going over a
single and at best mount and then I’m also imagining that we might move the
resource servers for each of those strongbox resources out of the ICAT
server and which again is a weird funnel for all that data big up to be going
through and move those to the DTM’s as well so that each of those resource
servers is reacting NFS itself and in parallel with other servers rather than
cutting go through one yeah what we have not I
don’t think we’ve actually turned plug 2nf Esperanza’s to the same catalogue
yet but shouldn’t matter it shouldn’t matter it’s just a client so yeah famous
last words I wonder is there any like need for
distributed caching between the nodes like I like it what level are you
caching and is there any in what software and NFS rods yeah I will like
through something like a distributed cache for any like in terms of
performance tuning I think that’s really like it around here BM yeah I mean the
only concern we would have with cash would be coherency and you know they all
have their own time outs and again that that’s probably totally fine for the
archive side of this it might get into weirdness if we start trying to do it
with it yeah yeah but yeah I feel the same it really shouldn’t matter to have
more than one connected at most you might see a couple seconds where a file
has been written into one and hasn’t appeared in the other yet but that’s not
unheard of with NFS anyway right and we’ve got knobs for that if you know
that’s the thing that you really care about anything else thank you so much on
the Internet’s shortly thank you very much I’m glad to have an opportunity to
participate more on the community and I want to be in more of those morning
planning meetings and things that it’s just not working out with the schedule
of the time difference right right okay you’re exactly two hours into the
morning understood thanks I appreciate it

Daniel Ostrander

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *