Skip to content

Lauren Hutton Cosmetics

Online Directory of Architects

  • Curation Policy
  • Home
  • Articles
  • Metadata Standards
Metadata Standards
Articles , Blog
November 19, 2019
No Comments

Metadata Standards


Good day, everyone. This is Jack Van Horn from the Big Data to
Knowledge Training Coordinating Center located at the University of Southern California. And I’d like to welcome you to the next in
our series of Big Data to Knowledge Guide to the Fundamentals of Data Science. We’re absolutely delighted to have Susanna-Assunta
Sansone from the University of Oxford speaking to us today on metadata standards. Susanna is the associate director at the Oxford
e-Research centre, which is a department at the University of Oxford. Her research focuses on data representation,
curation management to support data reproducibility, and the evolution of scholarly publishing,
which drives science as well as discoveries. She is the principal investigator leading
a number of programs in the UK and the European Commission, as well as being a participant
in the NIH’s BD2K activities. She’s also the vice chair of the Dryad Board
and a board member of the ELIXIR UK Node, which is sort of a European analog of the
Big Data to Knowledge program, the Research Data Alliance and the Force11 task force. She has Founding Honorary Academic Editor
of Springer Nature’s Scientific Data open-access journal. And today she’s going to be talking about
metadata standards and interoperability of standards to be able to promote operational
processes, underlying exchange, and the sharing of information between different systems and
how these relate to the FAIR principles. That is, that data should be findable, accessible,
interoperable, and reusable. And without further ado, I’ll turn it over
to Susanna. Thank you so much for joining us, Susanna. Thank you so much for having me. And good morning or good afternoon, depends
where you are. OK, here is the outline of my lecture today. I plan to actually first introduce the landscape
of standards, starting with interoperability standards. And then I’ll go down the area and focus on
metadata and content standards and I’ll explain to you what those are, what they’re useful
for. And also illustrating the lifecycle of those
efforts and a few key players. And I’ll close to this cause actually, it’s
important to have producer of the standards and consumer of the standards to both make
discoverable those resources, and for the consumer to actually find the right resource
appropriate to his or her case. So let’s start with a definition. What is a standard? And, of course, this is a definition because
there are many. But, let’s say that it’s actually an agreed-upon
convention for doing something. And it can be established by an authority
or by a grassroot group. What I really want to define, actually, are
the interoperability standards, which are one type of standards that are given to what
we do and what the scope of this talk is. And the best example to use, actually, it’s
one of the nuts and bolts. And therefore, let’s use engineering. Engineering in the 19th century wasn’t as
easy as it is now. Now, you go and buy dry nuts that fit the
right bolts, and that’s it. You know that they will work together. That certainly wasn’t the case before standards
were defined and that was actually in the mid– mid-19th century. But those standards, although were initiated
quite early, were only adopted after the second World War. So a century after. And that’s quite an interesting thing to bear
in mind, because from the development to the adoption of standards, there is a very big
time gap. Now, let’s talk, of course, about interoperability
standards in our area, in the life sciences and biomedical sciences. So what we are interested in are interoperability
standards for digital object. Research output, being data, or codes, algorithm,
workflow, models, software, or even paper, of course. Such in our cases those are also agreed upon
specification, guidance, or criteria. And they are designed to ensure that this,
the research object, digital research objects, are FAIR. And FAIR actually means findable, and accessible,
and interoperable, and reusable. So interoperability standards are enablers. And they enable us to do better science in
a more efficient way. To actually have straightforward access and
reuse of these digital objects. If you’re not familiar with what FAIR stands
for and what FAIR is and the work behind it or what the principles say in detail, I will
actually refer you to this reading material. So this is a publication in scientific data
that a group of grassroots organizations, but also as you can see funders, the BD2K
or infrastructure developer and service provider being ELIXIR, or other grassroots organization
and advocator, with [INAUDIBLE], Force11, et cetera, have actually promoted have endorsed. So the FAIR principles, are really a very
precise and measurable set of guidelines that allow digital object and service provider
to have infrastructure for those digital objects, to make sure that those are actually FAIR,
fundable, accessible, and reusable. And really the third principle stressed the
fact that the digital object should be just human readable, but should also be accessible
to machine. And the article contains a few examples of
implementations that are working towards making data, all things FAIR, actually. Because per have certain different metrics,
metric which actually have been devised, as well, by working group within the BD2K. And those are just a few examples. And as you can see there are groups within
the BD2K which are also working towards making data FAIR or even standards FAIR, but I will
touch up on those as we go along in the talk. So let’s go back to define the fundamentals
of interoperability standards. So, they are essential for discovering data,
or code, or software, and et cetera. But they’re also essential to allow to site
those digital objects, and therefore give credit and recognition to the person that
has actually shared and made this option FAIR. So we can say that, actually, the nuts and
the bolts of the interoperability standards are metadata and identifiers. So these are the essential elements to build
an interoperable machine. And metadata and identifiers are essential
to the underlying process, the operational process of interoperability, which will allow
us data or any other digital object to be aggregated, to be integrated within another,
to be compared, to be exchanged. So, although, identifiers are not the focus
of this presentation, because actually it requires a presentation of its own, I would
like to give you some pointers and some reading material, because there is a lot of efforts
around defining unique, resolvable, and measurable identifiers which are essential for digital
objects. So there are several efforts within the Force11
group, being the Resource Identification Initiative, and I believe Anita Bandrowski might actually
present this in the next lecture next week. There are identifiers for researchers defined
by ORCID identify for data-set life for data sites, and so on and so forth. And that is also quite nice, some of the document
which will provide you simple rules for designing, and providing, and reusing persistent identifiers
and that’s the reference that you can see there. This is a pre-print in [INAUDIBLE] and the
paper has actually been submitted. So, let’s focus now especially on the metadata
elements, these interoperability standards. So, but both identifiers and metadata need
to be actually implemented by technical experts. So, to have a machine that works well you
need to put together the nuts and the bolts. And that’s where you need the technical expertise. You need the modelers, you need anthologists,
you need the software developers, and so on and so forth. And we need tools, we need the registry, we
need catalogs of these digital objects. We need databases for data. We need service around all of this. Because we need to ensure that the researcher
can actually find, can actually store, can actually manage all this information. And they can do all they need to do in their
own research activity. But we need to allow them to make all this
quite seamless. And to do that we need to make sure that these
interoperability standards, these nuts and the bolts, are invisible to them. Because usually researcher has very little
knowledge and unfortunately very little interest in wanting to know what this metadata is and
what they should use, and so on and so forth. So making interoperable standards invisible
is the key to their adoption and their success in the end. So talking about metadata in particular. Now, metadata, it’s a descriptor for the digital
objects that will help someone, actually, to understand what that digital object is,
or where it is, or how it can be accessed, or who owns this digital object, and so on
and so forth. The type of course of the descriptors that
we will find or we can use for the digital object really varies. Depends on
the object itself. And the depth and the prep, as well, will
vary depending on the type of thing we enable added to those. So what this actually means, that the script
are for discovery, or the script are for citation, or, for example, for credit, might be actually
different from the metadata, the one required to enable reproducibility of a certain data-set. Reproducibility will require a richer and
deeper level of metadata. Let me give you some example now. Example for software, digital object software. There are several efforts out there, actually,
trying to tackle this area. Especially because all the infrastructure
that supports the discovery and the presentation of the software lags behind many of the other
digital objects. And this is quite actually a great report
from NIH workshop, which was held in 2014. It’s the Software Discovery Index workshop,
which really documents the need for infrastructure around software. And some of the efforts which are in progress. I want to highlight one of those efforts which
is called CodeMeta. And CodeMeta brings together academics from
around the world, and vendors, and commercial entities, which are stakeholders in the area. And they’re working toward developing what
they call a crosswalk table that would have to translate the different descriptors used
for codes and for software. This would certainly enhance discoverability
and reuse of software. There is another effort which is then focused
on the metadata for content in websites, or in services, data services for example. And this effort is called BioSchema. So BioSchema actually brings together groups
from NIH, from ELIXIR, from commercial entities like Google, and many others. And in support of the use of schema.org which
is a structured semantic markup which is used by Google, by Bing, by Yahoo, by the major
search engines. And it’s coordinating extension of the scheme.org
into the life science domain. This is so that data, or events, or software,
or training materials standards, which are described in this, or served, in these services,
will be more discoverable by those search engines. And I want to flag that, of course, data,
is one of the key area and [INAUDIBLE] already presented The effort by [INAUDIBLE] and the
data metadata data index. Which is underpinned by the model. Which is also being mapped and is expressed
with the Bioschemas.org vocabulary. And because this lecture is about standards,
I got to flag as well that standard is one of the content being tackled by schema.org
and BioSharing group, which I will mention later in the talk, it’s also participating. So this will ensure that also standards, which
are described in this registry, will be more discoverable to widely used search engines. And this is all because of the metadata. The common metadata from schema.org extended
to those domains. So metadata standards are key to science and
I want to use the word that Mark Musen used in his first lecture that opened the series. Well, he said that to be a data parasite,
you can only be a data parasite if you can actually
access the data, you can find and access the data. And the data is available in standard format. And there is enough metadata that you can
actually understand what the data is to reuse it. So that’s really what metadata standards are. They are enablers for science to do better
science. The truth is, that we already know, actually,
most of the data or the digital objects, which are shared, aren’t exactly findable, accessible,
interoperable, usable, and for many reasons. Because the metadata isn’t rich enough. Because the metadata is not harmonized. Is there atherogenicity? Because things are not very well linked, they
are not very well identified, and therefore they’re not well cited. Or they’re not stored in the right place. And so on and so forth. And this is true also because those that share
those digital objects see this as a very time consuming activity. Having to add the rich metadata to describe
your work so that somebody else can reuse it, it’s not exactly, unfortunately, someone’s
priority. And so curation activities are seen as some
sort of second class citizen type work. Which is not the case because I want to echo
the presentation by Pascal Gaudet about biocuration and importance of professionalize the curation,
professionalize the use of standards in data curation because this is essential. This is just an example, one of many, that
actually you can find of importance of metadata. So we share a file, we share a data file. In this case, it’s an excel spreadsheet with
some information in it. But certainly, just having here notation and
with a very clear manner, in an ambiguous– expressing in an ambiguous manner, it’s actually
key to ensure that the data is actually FAIR at the end. This is an example when information aren’t
clear and they’re only clear to the person that has shared it but not to a third party. Of course, this is an example where everything
is clearer and is expressed, and so on and so forth. Let’s now move from the metadata standards
to the content standards. So content standards, it’s a type of metadata
standards. And let me also say that this type of classification
that you see here, it’s something that I put together with several colleagues because there
isn’t an accepted way to classify attributes or types of standards. But it’s a simple way to show that indeed
have nested. So content standard is a type of metadata
standard, which is very much– it covers descriptor at the domain level. So, in that essential for interpretation and
essential for verification, reproducibility, reusability of a data-set. I didn’t mention before that a metadata for
citing a missing data-set, it’s net rich than metadata for interpreting, or reproducing,
when you need an eye level of detail of what’s actually been done. And the breadth and depth of those descriptors
vary depending on the type of study one is actually doing. But generally, what it covers is, what was
done? Who did it? When was it done? How was it done? And, actually, why? So content standards allow both experimental
components, the design, or the condition, or the parameters of the experiment, the fundamental
biological entity which had been started, or any other concept which has been started,
an analytical process, or the mathematical model, or any simulation that has been done,
to be harmonized with respect to structure, format, and annotation. So then when we share it, we will actually
know what path is the experiment, where does the design, where was the condition, which
are the parameters. So structure is essential, not just to discovery,
but to interpret, to understand, and then be able to reuse it. Let’s use an example, a very, very simple
example. So this is just some pretext of a part of
an experiment , or a very simple one. So what does it mean, structuring the information? Structuring information means abstracting
the key element, the key metadata, the key content metadata, out of this information
so that when you look at the text, you will see the structure. You’ll immediately see that seven week is
the age. And you immediately see that the mice, it’s
actually the strain name. And it’s the subject of this experiment. You immediately see that the liver it’s actually
the anatomical part that has been used. You immediately see where the protocols are. And you can immediately see the type of protocol
that this information is part of et cetera. Now, choose another example. I do work also for Scientific Data, so it
allow me to use this as a simple example. This is just a data journal where data publication
are actually anchored to structure metadata, as simple as this. If you go on the website, on the center part
of the homepage, you were find the “isaexplorer.” What the isaexplorer actually is is just a
simple tool that reads and visualize the experimental metadata which has been curated by a curator
editor, which is Vasha, which is there. She goes from free text to structure. So that what this then implemented in the
tool will allow the user to immediately find the data paper which is relevant to their
type of design that they’re interested, or the type of the organ that they’re interested,
or using the technology that they are interested, and/or combination of these. And I can’t stress more the value of, actually,
curation. And again here I have a tiny screenshot of
a Pascal presentation. Because it is essential. Standards are enablers, but professionals
like curator are those that put the standard in action. Now, let’s look what are the type of content
standard which are out there. And again, there isn’t any wide agreed classification
on the content standards. And this is just the way to get some structure
to what is out there. I guess you count standards in three types. There are guidelines, terminology, and format. The guidelines, starting from right-hand side,
are what other people also call minimum information guidelines
checklists. You might know MIAME Guidelines, for example. This was one of the first published in 2001. It’s the Minimum Information of our Microarray
Experiment. And MIAME launched a trend. And there are a lot of minimum information
now covering different domain and different technology. The guidelines are essential because the community
design and define what are the metadata that need to be reported. What is the essential metadata that need to
be reported so that somebody else can actually understand what was done? There are terminology. Now terminology widely cover control vocabulary,
taxonomy, thesauri, ontology, and so on and so forth. And they are essential because it will provide
an unambiguous identification and definition of those core information which has been reported
on gene ontology and what’s an example is OVEN ontology. Now I’m not going to go in more detail because
[INAUDIBLE] did an excellent job in presenting ontology. The third type of a standard, it’s actually
the format. Those are counting, called the conceptual
model, schema, exchange format, and et cetera. And these represent the structure which inter-relate
information and will allow, for example, to share data from one system to another. And PASTA file, it’s an example of a format. There is a fourth type, actually, of standard,
which is slightly different because it’s been developed in a specific domain and it’s driven
specific by health care industry needs. And these are the common data elemental CDE,
which actually they support both patient care, but also secondary use of this data, for example,
disease surveillance, populations and public health, clinical research, but also, reimbursement. Let’s look actually now at content standard
in numbers. Because there are a lot of those. I would say that more or less, there are a
thousand known guideline, terminology, and format. And the number counts, as you can see, from
the BioPortal and the BioSharing has resources combined. In terms of common data elements, taking the
NIH Common Data Elements Repository as a reference, there are almost 20,000 elements here. Let’s now start looking a little bit at the
community which develop those standards, what their motivation is, who they are, how they’re
assembled, how do they create the standard? What’s the lifecycle of the standard? Because it’s very important to understand
who produced the standard, why they’re being produced et cetera. I think again here, there isn’t an agreed
way to classify the groups. But I divide them between a de facto standard,
which are the grassroots initiatives, like the Global Alliance for Genomics and Health,
the Proteomic Standard Initiative, the Metabolomic Standard Initiative, the Genomics Standard
Initiative, the imaging groups, and so on and so forth. Those are really bottom-up community which
have an interest in, perhaps, sharing data, sharing data between systems. And the standard which they develop are, obviously,
free to use. They are mostly volunteer effort, actually,
they are all volunteer efforts. And they tend to have very little money to
do the work, but also to provide training, et cetera. Conversely, you have the Standard Developing
Organization on the other side which are more formal authority. The openness to participation here really
vary. Not all of them are open. You need to be member or actually nationally
nominated to be from their national presenting. It really depends on the group. In many cases, the standards by those groups
are actually sold a license. And in the best case scenario, the license
at no cost. But not all of them. And usually charges apply if you want training,
or you want to have a market access to the standards. And that’s why they’re self- supportive. But that’s also why the grassroots initiative
struggles with funds. Because everything is free. Everything is open. It is a complex landscape out there. And in order to understand what the group
does, it’s also important to understand that a standard be developed from this different
perspective and they have a different focus, actually. There are groups which are focused on a specific
biological or clinical domain. They are interested in neuroscience. But there are others that come together because
they want to model processes, or there are others which are interested in bioimaging,
no matter the sample, no matter the application. Some come from the biological perspective,
and other from the technological perspective. The motivation for which this group come together
also vary. There are some that actually create new standard,
or they want to fill a gap that exists between certain standards. Some actually come together to map, to harmonize
effort, which can be complementary or contrasting. Some try to extensions. Some try for repurpose the standard for other
places. Also the type of people that participate in
this effort, it’s really very diverse, and actually, it should be the case. Because it is essential that all the stakeholders
are represented. It is essential to have the technologists,
the one that bring the knowledge, the builders, the data ontologist, the modelers, et cetera,
but it’s also important to have the researcher, the handy users because they bring the use
cases. You need both. You need the consumer and the producer at
the same time. And generally this group too have representative
from academia, industry, governmental, and funding agency, as well. And you see people that are actually involved
in the sector, and they cover different roles. But they all manage somehow digital object. Because that’s the reason why you would want
to volunteer and participate in these efforts. Understanding the landscape, understanding
what’s out there, how we can use content standard, what’s ready to use, what needs to be further
developed, et cetera, is also been and now can be supported has been the focus of a couple
of BD2K workshop on standards. And here is the reference to one in 2013,
and one in 2015. The one in 2013 has a longer report as well
available. And the other, ’15, also has a report available. Now, both workshop have highlighted that,
of course, standard is essential. We need to build them. We need to understand how to help the community
to build them. And in understanding how to help, It’s also
important to highlight what the challenges are and what the pain points are. And it’s both a workshop to highlight that
the pain points are quite the same. And it’s not just a technical problem. It’s also a social problem. And social engineering is as important as
technical engineering in these cases. And they also illustrate the fact that life
cycles of the standard, they seem to be quite the same, regardless the type of organization. And here, I’m going to spend a couple of slides
in explaining what the lifecycle of this standard development is. There are three phases. The first phase is core formulation where
actually the groups come together. The right people are assembled. But that’s where it’s very important to assemble
the right people to have the right expertise, the right variety and heterogeneity, as well. This is when use cases are defined, the scope
is defined, what is in and what is not. What is in and what is out, it’s also important. And a prioritization is important. This is just an example to say that collecting
competency question, it’s one of the first things that this group do. And then guide them in formal questions to
decide what’s in or what’s not. And then from those questions, they start
to extract what are the metadata that are needed and then decide how this can be structured
and represented, and et cetera. The culmination of this is followed by a development
phase. And this standard iteration is usually done
by a smaller group. This is especially true in grassroots organization
where there is little money. So very few people that actually have the
time, the possibility to come and do the work to get it for the others. But it’s also important to engage, continue
maintain engagement to put the broader community for testing for getting feedback, for having
evaluation cycles. That’s essential. And it’s also very important, this phase,
to be able to actually analyze the different perspective that the people working in this
path and the effort had, and what the options are, the technical option and the solution. A third phase of this life cycle is called
maintenance. Now maintenance, it’s when this community
start building exemplar implementation. This is the phase where you actually start
implement and making the standard invisible. You have the tool, the databases, and the
services. They make [INAUDIBLE] invisible and usable. This is also the phase where it’s essential
to have documentation, both technical documentation, but also user guides, education material,
some metrics by which people can understand if they actually meet it or not, the standards. It’s also phase where the committees start
thinking about how to sustain themselves, how do they move in the next phase. Because standards, it’s not something that
it’s done and is static. It’s a very dynamic thing. And science change. There is a new type of process to explore. Or the process is actually enriched by new
information, or there is a new technology out, and therefore, standards have to be extended,
adjusted, adopted. They need to evolve as science evolves. There are versions. And there is backwork compatibility with this
version. There are conversion model and et cetera. But it’s not a trivial thing. And backing and developing standards, it’s
a kind of lifelong commitment, I would say. Let me also highlight a couple of pain points. Because it’s important if you’re a producer,
or if you are a consumer of the standard, to be aware of what the difficulty is if you
are building it or trying to use it. There are key issues. One is the fragmentation. There are so many activities. And this is great because they are rooted
in real needs, use cases. If you are an epidemiologist, why would you
want to develop a standard for plant? Now, of course you don’t. You assemble the community which, like you,
has the same interest. But what actually happens that because this
community are so diverse, and they have their own environment to work, they might tend to
duplicate, unnecessary duplicate, perhaps metadata elements which are common, which
are common regardless the type of biological sample you use or regarding the technology
used. And this picture is actually trying to illustrate
the problem of fragmentation. And I can go on forever to explain fragmentation. But I don’t want. I think this actually does give enough of
a picture of the difficulties. There are other pain points besides fragmentation,
which is, of course, we need to coordinate, or we need to harmonize, we need to handle
extensions. How to incentivize the people that work with
you or the contributors? How do you get somebody else to come in because
you need a new expertise? How do you manage governance when you have,
for example, industry to participate? Or who owns the standard? Who owns the second standard? Do you own the definition of the standard? Do you own the use case of the standard? There’s no ownership, especially in grassroots
initiative. But it is essential because people want some
rewards and want some credit when contributing. Funding stream, it’s always a problem. Although it’s important to acknowledge that
especially the NIH has launched a specific funding program for standards. This is going to be essential to help the
community. There is also an important pain point, which
is indicator and evaluation method. It’s really hard sometimes from the standard,
the standard is successful, has been successful, and why has helped, and how? Implementation as you see, they’re essential. It’s essential to make the standard invisible
so we need the tools. The outreach and engagement, it’s always critical. This is the social engineering part. There is another important factor here, which
is the synergy, or I would say, the lack of synergy, between the effort which are around
basic research, and those that come more in the health care and clinical and medical care. There’s very much of a division there. It’s a divide. We need a bridge. And of course, education, documentation, and
training. It’s really very time consuming. And especially for grassroots volunteer, which
are volunteer job, actually struggle to develop enough material, keep up with updating the
material. And of course, a business model for sustainability
will be essential. And this is something which is in our heart
to actually identify. OK. This brings me– I have 14 minutes left–
and brings me to, somehow, the last part of my lecture. Well, I want to do more discuss, OK, we have
this standard. We have this wealth of standards out there,
especially content standards that I’m focusing on. But I am a consumer of this standard. So where do I go to find what I need? And who can help me to find what I need? Or I am a producer of a standard. I just produced a visible format with all
my community. And how do I make sure that my format is actually
visible to other people, which are operating in the same area, that perhaps could join
forces with us. Or at least, I will make sure that they’re
not duplicating it if I make it visible. Understanding the standard, understanding
the context of the standard associate this essential for consumer for producer of the
standards. And these actually are, beside the question
of just the example of the question I’ve just said, there many others. For example, you have curator and developers
who want to use the latest version of a certain ontology to make sure that we are annotating
it with the right version. Or a developer wants to create a data submission
tool and wants to know what the latest form of the database, the database their targeting
uses. They want to know if something has been deprecated
and replaced by something else. The researcher, of course, have questions
which are, in a way, simpler because they just want to know which standard is good for
their toxicological data, or which standard is actually endorsed. Which standards were recommended? But we also need to bear in mind that as part
of the stakeholder here scenario, there are also the founders and the journal editors,
and others being librarians or research data manager which actually are creating guidelines,
policy, to the authors, the ORDs, the research community. And they also need to be guided. They also want to know. They also have the same questions, which standard
do I recommend? And this policy, is there a list which is
approved by this community? Or actually, which ones have been funded? Or which one are implemented in a certain
database? OK. Let me give you a couple of direction here
where to go where you can find things. But let me also tell you that there isn’t
a single place where you find everything. Because this is a landscape which is being
mapped. If you focus on ontology, well, certainly,
I would say, you go to the BioPortal Get where you deposit your ontology, your files, and
you make it visible. Also because BioPortal isn’t just a depository. It provides you with tools to do many other
things, mapping, annotating, and etc. If you are in ontology engineering, and you
are building an ontology, and you want to make sure that when you build it, you are
following proper guidelines, proper principles to ensure that your ontology will be a piece
of a puzzle, would be interoperable, will be orthogonal to another ontology, then I
will say the OBO Foundry is the place to go to get the guidelines and to actually
then make sure that your ontology is also present in the other Foundry so that you are
part of this larger community of people creating orthogonal ontology. Now ontology, it’s only one type of standards,
of content standards, as I expressed before. There are formats. There are the checklists. But then, there are also the tools that implement
a standard. There are the database they implement from
the standard. And I think what’s important is to bring this
landscape together. Because in order to tell you if a standard
is mature, one good indicator of maturity could be that the database has implemented. Because it means that there’s more testing,
that there is data annotated, that at least you can verify and see if that standard works
for you, if that standard has improved the quality of the reporting of that dataset. Even indicators of maturity are not being
developed. That’s the reason why, also when we were pointing
to the BioSharing, which is also part of this ELIXIR infrastructure. The BioSharing, it’s a curated, informative,
and educational resource which links the standard, in this case, we said the moment only count
in standards, to database and to policy to understand who use what. It’s really to map the landscape, to monitor
the development and evolution of the standard. It isn’t just to say, this standard exists. Is the standard the latest version? Is the committee still supporting it? Has it been implemented? Has it been used? So the BioSharing connects to the BioPortal
because it’s essential decked. It’s the same list of terminology and ontology. But this tends to format, to model, the reporting
guideline, and et cetera. And link these to databases. Another message that actually I want to bring
up it’s that standards are digital objects. We have talked so– I have talked so far,
actually– about standards being enablers. Standards enable digital object to be fair. But actually, standard are digital objects
on their own right. And they also need to be fair. Standards need to be findable, at the very
least, and accessible. And of course, interoperable, and reusable. OK. Let’s say that this portal’s, at least, really
address the findability of the standard, accessibility of the standards, specifically. This is one of those slides that nobody can
read. But the reason why I wanted to take a screenshot
of a record in the BioSharing portal they describe the standards. In this case, it’s the system biology markup
language, SBML. It’s because also records, also the standard
needs content standard, needs content descriptors. It is important to be fundable, a standard
has its own key descriptors so that people can use those descriptors to find information
they want. It is important to describe the standard to
say, whose group is creating it, for which taxonomic range is this standard applicable? What area does this standard cover? And this is really tricky. In this case, this is, of course, a [INAUDIBLE]
model, but it’s also for molecular entity, for pathway, and et cetera. It’s important to tag the standard, a way
to make it discoverable. It’s important to say if there is support,
who the group is, who is maintaining the record, who is updating the record, who has funded
the standards, are there publication associated, are there predicted standard? And this is what meant mapping the landscape. Describing the standards so that’s findable. But doesn’t just mean that. In order for a consumer to actually pick the
right one, and the “right” in quotation, I say, you need to have indicators. Now, those indicators as of now, do not exist. But this is what actually is being developed. And hopefully pretty soon there will be a
sector of indicators who help you choose. And those indicators will be richer and richer
as the community agrees on what is the right indicator to use. At the moment, the BioSharing, for example,
use only few indicators to start with. It lists, marks, what’s ready for implementation,
what’s in development, what’s entered as certain, and what’s been deprecated. And all these tags are actually vetted with
the community which is behind the standards. An uncertain status might mean that, correct,
the group is no longer active. The standard is there. It’s still good, perhaps, could be even used
by databases. But the community behind it is no longer active. In developments is just started. Ready to use and implementation means it’s
either ready and not yet been implemented, or even been implemented. Let me show an example using standards by
the LINCS group, which is part of the NIH BD2K. These are just two records. This show the LINC standard for an important
cell line. It’s a reporting guide. But on the left-hand side, it’s a version
which has been deprecated. So it’s a version 1, deprecated, which has
been superseded by another version, 2. And this is important. Because you might have people who have used
the version 1. And they need to understand that they still
need to find that version of the standard. But they will need to be directed to use the
latest version. And when is possible, and when is known, the
BioSharing works with the community, in this case, the LINCS community, to understand the
reason for this, in this case, this deprecation, and justify the reasons so that this reason
becomes public. And somebody who comes to the registry, or
come and discover the LINCS standard, knows why it’s being deprecated. Searching the standard can be tricky and is
tricky at the moment. Because it’s important to classify them and
categorize them. And this portal, it’s a work in progress,
of course. But it’s also I also want to underline the
fact that there are ways to get packaged information. There are way to package a set of standards
that perhaps it’s important for your domain. It’s around pathway information, or it’s for
clinical data, or it’s for imaging, it’s possible to actually package and group records. And we are working with a committee to try
to prepackage the standards so that we can direct you to a group of standards for a certain
domain. and make the search and the discovery easier. But I want to go one level deeper. I know the time I have four minutes in front
of me. I should actually make it. This is the last topic I will cover. It’s going deeper now. What I show you here is that the three types
of standard, guidance, technology, and format actually are related to each other. There are relations that need to be built
for someone to know what guidelines is implemented by database. If there is a terminology which the guidelines
recommend, or if there is a format that the guidelines recommend, we need to actually
build this relation. And we need them discoverable so that standards
are used. It’s a complex web of relation and let me
give you an example, hopefully, a pretty simple one. On the left-hand side, you see two databases. I’ve taken the most well-known databases NCBI
GEO and EBI ArrayExpress for gene expression data and transcriptomics data. Now, both databases implement the checklist,
a reporting guide called MIAME. But actually the two databases require different
type of format for data submission. GEO implements the [INAUDIBLE] format. And not just that one, one out of many. I just want to simplify the example. ArrayExpress requires a data submission in
each tag. Now this relation needs to be built. And indeed what I’m showing to you that those
databases use more than just that standard. They use more standards. And nevertheless, they both use the same reporting
guidelines implemented in different formats expressed with different terminology. If somebody wants to– if a developer comes
and says, I want to build a data submission to GEO, a tool that’s submitted to GEO, which
is the standard that GEO uses? They will be able to actually trace back and
understand, looking at the registry, looking at these relations. A couple of slides here which seems very hard
to read, and actually are hard to read, really show you how this
network relation, it’s really quite complicated, although it’s very useful. This just an illustration of the relation
between standard databases and policy which are recommended by different journals. But the [INAUDIBLE] restricting this relation
is essential. Because as you can see from this very now
dense network graph here, what I’m showing you here it’s that although there is a relation
between certain standard databases that policy recommends, there are many more standards
or in databases that, for example, a journal could recommend that a user could pick that’s
relevant to he or she. But actually, they’re not necessarily immediately
seen because there is so much. And the user needs to be guided, you need
to be guided, to looking for the right standard. What we are creating is a network that will
inform a guide what’s really important. Let me only point to very briefly on the right-hand
side, for example, you can actually see orange dots that represent databases which are recommended
by the PLOS data sharing policy. But as you can see, those database here, these
are model databases. All of them use standards. All of them use an ontology, which is Go,
which is in gray and is just behind. By showing to the PLOS editor that actually
those databases that they recommend also use a standard will be essential. Because the PLOS policy can evolve and they
could recommend Go as a standard for the annotation. And crediting and recommendation of standard
as well and so and so broader visibility is essential to help the community to actually
keep building it, make it even more usable. I’ve just realized from my watch that I’ve
reached the 50 minutes. And I want to close with this summary slide,
which actually isn’t a summary slide. It’s a reading material slide which points
again to the report from the two NIH workshops which are, I think, very informative, in terms
of the standards, metadata, and community. And many people have contributed to those
workshops and the reports. I also want to point you to the FAIR paper
by Wilkinson et al., and also a report that myself and a colleague have put together as
a part of the Wellcome Trust literature review and thinkbase, which, and fortunately, is
still being made available. But this is the link. And I’ve been told that in a week or so will
be available. And I want to close with a message that says
that as the data science grow, digital output are being recognized as
first-class citizens. I think we need to go a further step, one
step further, and also recognize that standards and interoperability standard are digital
objects in their own right. And they have to count with their known associative
research, development, and educational activity. Thank you. Susanna, that was really outstanding. Thank you so much for this fantastic lecture
on metadata standards, and so comprehensive, really a tour de force lecture. Thank you. We do have a few questions already lined up. I’d like to encourage those of you listening
to, if you have some questions, you can use the question feature on the GoToWebinar panel
there on the, probably right-hand side of your screen, and to encourage you to submit
some questions that we’ll try to get them all if we have time. The first question that has got my attention
here is what suggestions do you have, Susanna, for how to bridge the experts, the decision
makers, and the implementers in this process of standards development of use and practice? Well, the best advice, it’s really what the
community are doing. They’re actually bringing those stakeholder
at the table from the beginning, at formulation phase. It is essential, as I illustrated in the lifecycle,
that the formulation phase includes all stakeholders, not the ones that they will build the standards,
but also the one that they will use it and implement it. It is essential to bring them at the table
from the very beginning. It doesn’t assure adoption. But it will certainly give the best shot. Yeah, I would say that that is a very important
element of this. Otherwise, you run this danger of a “not invented
here” syndrome amongst some people who may feel that they got alienated somehow. One other question is, can you talk about
taking ontologies and standards and making them computable using the semantic web? Yeah. Definitely. Semantic web technology is widely used by
many of this standard community. When I say format, as I say, this is really
broad. There are standard which are expressed in
RDF, for example. So absolutely. When one has data dictionaries for metadata
standards, these sort of go hand-in-hand. Is there any sort of implication about reproducible
science that one can draw from the role of having data dictionaries, linking them with
standards, and then using those to help promote reproducible science? Does that make sense? What is definition of data dictionary? I think we all have different definition. For having-
You’re going to need a dictionary to keep track of all the terms for data dictionaries
and stuff. I think it’s just really the list of– it’s
almost in itself– it’s almost a conflated concept, in the sense that you’ve got a data
element. And it has, say, a range or data type. And these are allowed values. If it’s a missing value, it might hold this
sort of a value. On one hand, it’s almost synonymous with an
ontology, or some sort of metadata standard, but perhaps, is there a subtle distinction
there? Or are we really talking about the same thing? Well, I think there is such a subtle distinction. And to be honest, the classification that
are shown about guideline, terminology and format, it’s really just a way to try to classify
things. Because I have colleagues that argue that
ontology are, have, a structure, have a format. Of course they are. They’re expressing a format. They’re expressing now. They’re expressing a novel format. I think it’s very, very hard to try to box
set these things. Because there are a lot of different ways
to structure the information. And it’s hard to say which is the right way. The problem is because there’s such a variety
of ways to structure and express the standards that there is a lot of fragmentation as well. But it’s also essential-
I think- Yeah. Someone once said that the best thing about
standards is that there’s a lot of them. In a sense, I don’t know. There is a variety of them. And people will often vote with their feet
on which one they seem to like and it does the best job. And it gets also to the point of versioning
of standards. And I’m sure you probably touched on this,
at least in brief, or in part. And I’m wondering how that factors into the
evolution of standards and how important that is and if you wanted to comment? It’s actually essential. Essential. Well, I would say that probably versioning,
in terms of practice, it’s used especially in the ontology, in the ontology world. But it’s also true in terms of formatting,
as well. Again, here the practice is not widespread
and it’s still heterogeneous, depending on the community and depending on their approach. Right. Let’s see. Are there any other questions here? One question is, one last one is are there
any emergence of metadata standards which have been developed specifically for computational
sciences that you might be aware of? Computational sciences. I’ll refer you especially to the “Combined”
group. Combined brings together computational modeling,
system biology. It’s called Combined. It’s actually combined and brings the right
type of standard that perhaps this person is interested. Otherwise, please contact me, and I’ll try
to actually point you to the right one. That sounds wonderful. Well, everyone, we are at the top of the hour
now. And I’d like to thank Susanna once again for
a really fantastic and rich survey of metadata standards and all that that entails. I want to thank you all for joining us this
morning and to once again tune in next week, as they say, for our next in our series of
data science seminars. Enjoy your weekend. Thank you again. Thank you for having me. Thank you. Thank you Susanna.

Related posts:

  1. Data Indexing and Retrieval
  2. LHC Computing
  3. Chicago Adventure, Part Two: Catalogue Dialogue
  4. Finding & Accessing Datasets, Indexing & Identifiers
TAGS and, bd2k, bd2k nih, big data, biomedical, Data, data management, data science, education, metadata, metadata standards, nih, research, Science, susanna-assunta sansone, that, the, university of oxford

Daniel Ostrander

Post navigation

Composer Metadata – PHP Standard and Best Practice Part 15
Quantum Computing – The Einstein-Bohr Debates – Extra History – #3

Related Posts

December 10, 2019

San Diego methamphetamine deaths set record for region

Export AI Hub metadata into an iSheet
December 10, 2019

Export AI Hub metadata into an iSheet

December 10, 2019

Сравнение производительности веб-клиента 1с и metadata.js

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • San Diego methamphetamine deaths set record for region
  • Export AI Hub metadata into an iSheet
  • Сравнение производительности веб-клиента 1с и metadata.js
  • Cloud Computing – Computer Science for Business Leaders 2016
  • Photoshop Lightroom, Lightroom: Filters pt. 3: Metadata | lynda.com
  • Timeline Editing: Viewing the Source and Sequence Metadata
  • Roberta Style Lee and Ethical Brand Directory | Wear Your Values | Shop Your Values
  • Information and Data Models – Databases and SQL for Data Science by IBM #8
  • Database Features
  • How to add CoinMarketCap widgets to your webpage

Recent Comments

  • Storm Gaming on Emirates A380 World Record Flight For Most Nationalities Onboard
  • Riya Kis on Emirates A380 World Record Flight For Most Nationalities Onboard
  • joshua dionisio on Emirates A380 World Record Flight For Most Nationalities Onboard
  • joshua dionisio on Emirates A380 World Record Flight For Most Nationalities Onboard
  • SWAYAM SEKHAR on Emirates A380 World Record Flight For Most Nationalities Onboard

Tags

Active Directory and Azure azure active directory can catalogue cloud cloud computing computer computing Data database databases directory education for google guinness world records How To ITFreeTraining metadata Microsoft Music News Politics real estate record registry Science security seo Software SQL technology that the this tutorial video will Windows world record you your yt:cc=on
Copyright © 2019 Lauren Hutton Cosmetics. All rights reserved. Theme fashionclaire by vbaimas