Discussion #2661

[Register Federation] Registration process

Added by Daniele Francioli over 3 years ago. Updated over 3 years ago.

Status:Assigned
Priority:Normal
Assignee:Daniele Francioli

Description

This discussion describes the registration process needed to insert a registry in the RoR.

Background

We decided to split the Registry descriptor and the Register descriptor for the following reasons:

  • Avoid a big huge file (difficult to be edited and loaded)
  • Avoid a big load on the RoR: the RoR will have to read a huge file for each registry - this could increase harvesting time
  • Allow multi thread execution of the harvesting: the system could potentially contains hundreds of registers to be harvested. Keeping the files separated can allow the split of work load in a multi-thread implementation of the RoR .

Description

The RoR user interface will have a private area (using ECAS authentication) where you can specify the URL of the Registry descriptor.

The RoR will accept:

For detailed information related to the Registry and Register descriptor read the Conformance Classes page.

The system will import all the metadata related to the registry, all the metadata for each of the register specified in the list and all the relations between the registers. The RoR will then start the build of the search index for each of the Registers.

The harvesting process can be started manually or can be automatically handled by the RoR based on the update frequency specified in the Registry and Register exchange files (in this case the Registry and Register descriptor shall be conformant to the Automatic Harvesting Conformance class).

registry.rdf - Working draft of the registry exchange file (9.04 KB) Andrea Perego, 21 Jan 2016 10:02 am

eionet_Registry_example.rdf (4.82 KB) Daniele Francioli, 25 Jan 2016 09:25 am

inspire_Registry_example.rdf (9.04 KB) Daniele Francioli, 25 Jan 2016 09:25 am

eionet_DesignationSchemeValue_example_hv.rdf - Modified example eionet_DesignationSchemeValue (corrected use of xml:base) (4.45 KB) Heidi Vanparys, 05 Feb 2016 04:46 pm

inspire_DesignationSchemeValue_example_hv.rdf - Modified example inspire_DesignationSchemeValue (corrected use of xml:base) (4.34 KB) Heidi Vanparys, 05 Feb 2016 04:46 pm

History

#1 Updated by Daniele Francioli over 3 years ago

  • Description updated (diff)

#2 Updated by Andrea Perego over 3 years ago

  • File registry.rdf added

A working draft for the registry exchange file is attached to this page ( registry.rdf ).

This file is a live document, and it will be modified based on the comments and decisions taken.

#3 Updated by Daniele Francioli over 3 years ago

  • Description updated (diff)

#4 Updated by Daniele Francioli over 3 years ago

  • File deleted (registry.rdf)

#5 Updated by Andrea Perego over 3 years ago

  • File registry.rdf added

#6 Updated by Michael Lutz over 3 years ago

As a follow-up from the discussion on Monday, please find some further details on the proposed registration process in this issue. Please add your comments and questions as comments.

Thanks!

The JRC registry team

#7 Updated by Christian Ansorge over 3 years ago

Dear colleagues,

 

Thank you very much for all your work. It is clearly a big step forward. Still, during the discussions to day some aspects of the proposals you presented took us by surprise as they hasn’t been discussed in that form earlier. We needed to discuss some of the aspects, you showed, more in detail internally to get a clearer picture for ourselves.

 

Harvesting approach and Metadata on Registry:

We acknowledge that from the very beginning we assumed that in a registry federation there would be two mechanisms of data exchange, harvesting on the one hand and file upload on the other. While there is formally nothing wrong with the harvesting approach and the related suggestion for metadata exchange about registries, we consider this step an optimisation which should be addressed later in the development of the federation. And as said in the last meeting, it might add a level of complexity and vulnerability which isn’t needed. Instead we simply assumed that for the test bed we would go for the simplest way of data exchange, i.e. the upload of exchange files. We would suggest to focus on the exchange file and manual upload for the first stage of the testbed (where we have still enough to work on and test) and continue with harvesting once things are stabilized. If performance becomes a problem at a later stage, then separating the metadata from the data would be a good measure to look into. The underlying assumption here is then that the exchange file only contains the registries to be shared with the federation.

 

Exchange file for Metadata on Registries:

If this approach is to be used, we think that the time of last update should be given for each dataset listed individually in order to make it work. Otherwise the RoR would still have to access each dataset to check for updates on the level of the concepts. Maybe this was the original intention, but it was unclear to us from the example provided.

e.g.

<dcat:dataset rdf:resource="http://inspire.ec.europa.eu/theme"/> .. [add date of last update and status] ..

<dcat:dataset rdf:resource="http://inspire.ec.europa.eu/applicationschema"/> .. [add date of last update and status] ..

As the RoR shall just harvest the metadata about Registries, why do we add individual update frequencies? As file size is not an issue and we do not expect frequent changes anyway, we might just agree on a daily harvest interval as default for the sake of simplicity.

 

Exchange file for Registers:

There are some questions and potential issues we discovered when taking a closer look.

  • We agree with what was said during the meeting that the exchange file should not contain any (external) concepts from the register it aims to extend.
  • We are missing the status of the concept (e.g. valid, retired, …)
  • In the example on https://ies-svn.jrc.ec.europa.eu/issues/2615 under “Section 3: Concepts”, the concept is identified by rdf:about="CurrentUseValue/2", while the actual URI and ID of the concept is described in the dcterms:source element. Is there a specific reason to do so, or can we simplify it further by removing dcterms:source and provide that information in rdf:about instead, as it was initially proposed and discussed?

 

RoR Interface:

In the RoR interface we are missing the publisher information, which would be very useful and is actually delivered in the exchange format.

 

Thank you very much. As I said earlier it is clearly a big step forward, just that we have to agree on the general approach to be used for the test bed.

 

Best regards

Chris & Michael

#8 Updated by Daniele Francioli over 3 years ago

  • File deleted (registry.rdf)

#9 Updated by Andrea Perego over 3 years ago

#10 Updated by Andrea Perego over 3 years ago

New working draft of the registry file, including direct download URLs of the registers' files ( registry.rdf )

#11 Updated by Daniele Francioli over 3 years ago

Thank you for your feedback, we are always trying to follow the outcome of the discussion during the web meeting and to address it into the prototype. It is very useful to have these discussions in order to find the best way of implementation for the project.

Harvesting approach and Metadata on Registry:

The main issue here is the registration of the Registry and the related list of Registers to be federated.

In detail, we have 4 options:

  1. The manual registration of the Registry metadata using a web form including the manual specification of a list of URLs pointing to the registers’ exchange format. The RoR will then automatically harvest the files based on the update frequency specified by the user and the modification date in the exchange format.
  2. The manual registration of the Registry metadata using a web form plus a manual upload of the registers exchange format. The RoR will run the harvest only once after the upload.
  3. The manual registration of the Registry metadata using a web form containing all the information of all the federated registers (one big file). The RoR will run the harvest only once  after the upload.
  4. The automatic harvesting based on the Registry and Registers Exchange format. The only action that a user needs to perform is the specification of a URL pointing to the Registry exchange format (that can be done also manually due to its simplicity). This file can be stored in any web-accessible space. As explained above, the RoR will then read each of the Registers. 

All 4 options are actually quite similar, because they depend on the creation of the RDF/SKOS representation for each register according to the agreed exchange format. The main difference is how to communicate the registry metadata and the registers to be included in the federation to the RoR – either manually through a web form (options 1-3) or through a registry metadata file (again according to an agreed format) that is published on the web by the registry owner.

We believe that it will ultimately be less work for the registry owner to simply update the registry metadata file on their own web server than to have to manually update the information in some web form. Also, it would be less error-prone (less easy to forget the update in the RoR). Also, it will probably be less implementation effort on our side to only implement the harvesting approach (and not also a web form). Ultimately, this task will in most cases be done by some kind of registry software and not manually.

The "big file" solution has additional problems related to the performance and reliability (if the system start harvesting the file and something goes wrong, it has to start all from the beginning). In addition, if something changes you have to re-run the complete procedure even if there are only few changes.

Exchange file for Metadata on Registries:

The idea is to keep the modification date in the Register file in order to avoid frequent modifications of the Registry metadata file. So basically the idea is to start reading the Registry metadata file and to compare as starting point the dates.

The Register file has its own update frequency because it could be that different registers have different update frequencies.

Exchange file for Registers:

  • The register file has been updated in order to remove the information replicated from the extended item (#2615). Please note that we did not propose to leave out the re-used concepts completely, only to not require anything else than their id and the inScheme relationship.
  • We added the status only to the relations. Is it useful to have the status of each concept in the exchange file? You can get it from the register following the URI.
  • dcterms:source vs. rdf:about - To be discussed

RoR Interface:

The next version of the RoR will add the publisher information as well as other updates from the last meeting.

 

Best regards,

The JRC registry Team

#12 Updated by Daniele Francioli over 3 years ago

I've added some example files:

#13 Updated by Christian Ansorge over 3 years ago

Dear Daniele,

Thank you for your reply on our feedback which helps us to better understand your point of view. I would like to summarize our internal discussions. 

In our point of view the RoR should have two ways of data input:

  • manual data upload of register files (which contain the metadata of the registry and conceptscheme), either as one multi-register file or individual register-files. We are aware that this would lead to a centralistic approach, as the uploaded files not necessarily have to be hosted externally.
  • harvesting of the metadata file about the registry  (and consequently the registers listed in it).

Both approaches will find their users and for the beginning we recommend to focus on the manual upload until the register exchange gets stable. We are not against the proposed harvesting approach, but from our point of view this adds another layer of complexity and we should solve it stepwise.

Best regards

Chris and Michael

 

#14 Updated by Christian Ansorge over 3 years ago

Maybe I should add as it might become misunderstood:

The rdf/skos representation of a register (including meta information about concept scheme and registry) is the key component of this system. The same rdf/skos files used for the manual upload approach are referenced by the registry metadata file. Therefore we see the register file as first and currently most important step and the harvesting approach as a follow-up step reusing the same exchange format.

 

#15 Updated by Michael Lutz over 3 years ago

  • Status changed from New to Assigned
  • Assignee set to Daniele Francioli

Dear Christian, Michael, all,

thanks for your additional comments. 

Our aim is to create from the beginning a strong and reliable system to handle and maintain the federation. Providing two different approaches (quite different from each other) to accomplish the same task is difficult to maintain and could lead to possible inconsistencies (e.g. the federated resources that are using the harvesting system will be always up to date, whereas the resources federated using the manual uploaded files could remain outdated). We prefer to define a single approach, and go on with that, in order to provide since the beginning the most convenient solution.

What exactly do you consider a problem with hosting the register exchange files on some web server? The approach we currently propose would only require that the (relatively small) file is available somewhere on the web - at any URL. On the other hand, with the proposed approach, a user does not have to manually log into the RoR and re-upload a register file in case of updates or additions. They only need to update the file they host on their server, and the RoR will then harvest it automatically.

One main argument against the "upload approach" you propose is that this would (as you rightly point out) create a centralised database, rather than just an entry point to a federation of distributed registers, which only stores limited metadata (and indices) of the registers.

Since we seem to have quite different views on the vision (a federation of distributed registers vs. a huge centralized registry), we would like to ask also the view of the other members of the group.

Can you please share your point of views here as well (or at the next meeting)?

Thanks & best regards,

The JRC Registry Team

#16 Updated by Daniele Francioli over 3 years ago

  • Description updated (diff)

#17 Updated by Heidi Vanparys over 3 years ago

Christian Ansorge wrote:

In the example on https://ies-svn.jrc.ec.europa.eu/issues/2615 under “Section 3: Concepts”, the concept is identified by rdf:about="CurrentUseValue/2", while the actual URI and ID of the concept is described in the dcterms:source element. Is there a specific reason to do so, or can we simplify it further by removing dcterms:source and provide that information in rdf:about instead, as it was initially proposed and discussed?

There seems to be confusion here, caused by the way the code lists in the ELF project were made. See e.g. the latest version of code list CurrentUseValue. Those code lists were written manually, and condensed as much as possible to avoid too much writing and copy-pasting. This was done by adding the xml:base attribute to element <rdf:RDF> to specify a base URI for the RDF document and giving it value http://locationframework.eu/codelist/ . By doing this, the value of the rdf:about and rdf:resource attributes could be abbreviated in all cases where the full URI is a URI starting with http://locationframework.eu/codelist/.

The following code snippets are semantically equivalent:

<rdf:RDF
    xmlns:skos="http://www.w3.org/2004/02/skos/core#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dcterms="http://purl.org/dc/terms/"
    xml:base="http://www.locationframework.eu/codelist/">
    <skos:ConceptScheme rdf:about="CurrentUseValue">
        <skos:prefLabel xml:lang="en">Current use</skos:prefLabel>
        <dcterms:description xml:lang="en">This is an extension of code list http://inspire.ec.europa.eu/codelist/CurrentUseValue.</dcterms:description>
    </skos:ConceptScheme>
    <skos:Concept rdf:about="http://inspire.ec.europa.eu/codelist/CurrentUseValue/publicServices">
        <skos:inScheme rdf:resource="CurrentUseValue" />
        <skos:topConceptOf rdf:resource="CurrentUseValue" />
    </skos:Concept>
    <skos:Concept rdf:about="CurrentUseValue/1">
        <skos:inScheme rdf:resource="CurrentUseValue" />
        <skos:inScheme rdf:resource="http://inspire.ec.europa.eu/codelist/CurrentUseValue" />
        <skos:prefLabel xml:lang="en">Hospital</skos:prefLabel>
        <skos:definition xml:lang="en">An institution or establishment providing inpatient medical or surgical treatment for the ill or wounded.</skos:definition>
        <dcterms:source rdf:resource="https://www.dgiwg.org/FAD/fdd/view?i=106995" />
        <skos:broader rdf:resource="http://inspire.ec.europa.eu/codelist/CurrentUseValue/publicServices" />
    </skos:Concept>
    <!-- ... -->
</rdf:RDF>

<rdf:RDF
    xmlns:skos="http://www.w3.org/2004/02/skos/core#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dcterms="http://purl.org/dc/terms/">
    <skos:ConceptScheme rdf:about="http://locationframework.eu/codelist/CurrentUseValue">
        <skos:prefLabel xml:lang="en">Current use</skos:prefLabel>
        <dcterms:description xml:lang="en">This is an extension of code list http://inspire.ec.europa.eu/codelist/CurrentUseValue.</dcterms:description>
    </skos:ConceptScheme>
    <skos:Concept rdf:about="http://inspire.ec.europa.eu/codelist/CurrentUseValue/publicServices">
        <skos:inScheme rdf:resource="http://locationframework.eu/codelist/CurrentUseValue" />
        <skos:topConceptOf rdf:resource="http://locationframework.eu/codelist/CurrentUseValue" />
    </skos:Concept>
    <skos:Concept rdf:about="http://locationframework.eu/codelist/CurrentUseValue/1">
        <skos:inScheme rdf:resource="http://locationframework.eu/codelist/CurrentUseValue" />
        <skos:inScheme rdf:resource="http://inspire.ec.europa.eu/codelist/CurrentUseValue" />
        <skos:prefLabel xml:lang="en">Hospital</skos:prefLabel>
        <skos:definition xml:lang="en">An institution or establishment providing inpatient medical or surgical treatment for the ill or wounded.</skos:definition>
        <dcterms:source rdf:resource="https://www.dgiwg.org/FAD/fdd/view?i=106995" />
        <skos:broader rdf:resource="http://inspire.ec.europa.eu/codelist/CurrentUseValue/publicServices" />
    </skos:Concept>
    <!-- ... -->
</rdf:RDF>

The <dcterms:source> element, where used, holds a reference to the web site where the original code list or code list item is defined. In the ELF project, many code lists are based on items in the DGIWG FDD Register, others are based on information in other sources. <dcterms:source> seemed the best element to hold this kind of information (but other suggestions are welcome).

So the statement the actual URI and ID of the concept is described in the dcterms:source element is not correct. The actual URI of concept "Hospital" is http://locationframework.eu/codelist/CurrentUseValue/1.

The files with the examples for DesignationSchemeValue are not completely correct because they still have xml:base="http://locationframework.eu/codelist/". They should look like the files attached here (that is, when xml:base is used, else the base URI should be copied to all relevant places and xml:base removed).

#18 Updated by Michael Noren over 3 years ago

Some comments from me (without having consulted Christian).

Harvesting vs upload

@Michael L., I think we agree in the long-term perspective, it’s just that for the testbed it could prove easier to be able to upload the exchange files in order to verify that the exchange of content is working as it should, without also testing the mechanism for exchanging the content at the same time. For example we publish the exchange file, but nothing happens on RoR, was it then our exchange file that was badly formatted somehow, or did the harvesting fail (wrong url provided?, harvester broken?...) , or not run at all etc.? Also it would be nice not to have to wait an hour/day or similar after an update to see how it is reflected in the RoR.

 

Separate file for exchange of metadata

@Daniele, I’m sorry but I’m not sure I completely understand your four options for exchange. In terms of content I see the separation of metadata and data in two files as an optimisation done a bit early. If the reason for splitting is to avoid huge files and heavy processing (https://ies-svn.jrc.ec.europa.eu/issues/2661#Background), what amounts of data are expected to be provided by each participant? Even with 500 code list extensions, I guess the exchange file still would be less than 0.5-1 MB?

About knowing when to harvest, is there a reason not to simply harvest everything on a daily basis? With a bit of luck http-headers (e.g. Last-modified, If-modified-since, ETag…) could be used to determine if the published exchange files have been modified, worst case downloading up to 1 MB from each EU country each day would not break the infrastructure.

So, for the sake of simplicity, my preference would be that that each participant in the federation just publish the data they want to exchange, provided in the format specified in #2615? Worst case if there would be any registers in the file that does not extend an Inspire register, it would be fairly simple to filter them out in the harvesting process.

What I think we are looking for is something like:

  1. We register our organisation in a webform provided by RoR, where we supply the URL to our exchange file and a contact email address.
  2. We publish the exchange file (#2615, no separate metadata file) on a server and RoR will check it daily and harvest it if needed.

 

Centralised registry vs distributed

Is there necessarily a contradiction between these two? As a consumer of the information it would be convenient to be able to find both the Inspire registers and the full information on the federation provided extensions together in the RoR/central Inspire registry. At the same time, someone like the EEA may want to provide the same thing, but possibly a subset, and therefore access to the extensions data provided by the federation members would be essential. While it would feel better if they can be retrieved directly from the federation members (i.e. RoR must publish the exchange file URLs provided by the members for others to harvest also), it would probably work fine also if the RoR/central Inspire registry allows for download/harvesting of those.

 

 

#19 Updated by Christian Ansorge over 3 years ago

Michael and I had some internal discussions about the current RoR Architecture. Thnaks to Michael to writing it down:

 

Registry exchange file, metadata + content file(s)

  • Is the sole purpose of the metadata file to be able to provide a list of all the separate register content files to be provided to the federation?
  • Can there be more than one register contained in one of the content files?
  • Should we rather use VoID instead of DCAT to describe the metadata file, since it is RDF? See https://lists.w3.org/Archives/Public/public-gld-comments/2012Apr/0027.html.
  • How is this expected to be done in practice, in the case of EEA some options are that we update the RDF export of Data Dictionary to comply with the register content exchange file specification or we make a query from our SPARQL endpoint that has all DD data to construct the proper RDF file(s). In the latter case it is probably simpler to provide all our extensions in one file, will this be ok? 
  • Then there is the DCAT (or VoID) file, which we also might want to generate somehow to avoid manual maintenance. This can probably be done without too much complications, except that the update frequencies don’t fit so well, as they are not really part of our data? Perhaps they can be static variables in the script, but it feels like they don’t really fit here. They problem this setting is trying to solve might not be a problem as discussed in https://ies-svn.jrc.ec.europa.eu/issues/2661#note-18

 

ECAS and registration in RoR

It’s a good solution for the start, but in the long run it might be useful with a possibility to register another email address than the one that belongs to the ECAS account. This since responding to questions about the registers and registry management might be a task split among more than one person, and therefore the correspondence might need to be addressed to a functional mailbox, rather than a personal.

 

Uploading vs harvesting

Harvesting - incl. the possibility to trigger harvesting directly - is ok for us, also for the testbed. This is said without having tried it yet, but it would be useful if there is information so it is easy to track down what has possibly failed in the process, e.g. was the file not reachable at the URL, or is the file itself not valid etc.

#20 Updated by Michael Lutz over 3 years ago

Hi Christian and Michael,

Registry exchange file, metadata + content file(s)

  • Is the sole purpose of the metadata file to be able to provide a list of all the separate register content files to be provided to the federation?

The purpose of the Registry file is to simplify the registration/addition/modification of information and to automate the workflow. The main principle is that you tell the RoR once where you publish your registry descriptor (containing the location of the registers to be included in the federation) and then you simply update your descriptor and register files locally when there is a change.

This way there is no need to log in every time and update the information in a web form whenever there is a change.

  • Can there be more than one register contained in one of the content files?

No. The Registry descriptor contains the reference to the RDF file for each of the registers (one Register, one file).

DCAT can be used to describe RDF files as well. See the DCAT specification (https://www.w3.org/TR/vocab-dcat/):

Data can come in many formats, ranging from spreadsheets over XML and RDF to various speciality formats. DCAT does not make any assumptions about the format of the datasets described in a catalog. Other, complementary vocabularies may be used together with DCAT to provide more detailed format-specific information. For example, properties from the VoID vocabulary [void] can be used to express various statistics about a DCAT-described dataset if that dataset is in RDF format.

What benefits would you see from using VoID?

  • How is this expected to be done in practice, in the case of EEA some options are that we update the RDF export of Data Dictionary to comply with the register content exchange file specification or we make a query from our SPARQL endpoint that has all DD data to construct the proper RDF file(s). In the latter case it is probably simpler to provide all our extensions in one file, will this be ok? 

It's fine to generate the content through a SPARQL query, but it should be done register by register (see above).

  • Then there is the DCAT (or VoID) file, which we also might want to generate somehow to avoid manual maintenance. This can probably be done without too much complications, except that the update frequencies don’t fit so well, as they are not really part of our data? Perhaps they can be static variables in the script, but it feels like they don’t really fit here. They problem this setting is trying to solve might not be a problem as discussed in https://ies-svn.jrc.ec.europa.eu/issues/2661#note-18

After our discussions on Monday, we decided to make the update frequency field out of the core conformance class (i.e. make it "optional"). It has to be used only if you want to enable the automatic harvesting. Check the conformance classes.

 

ECAS and registration in RoR

It’s a good solution for the start, but in the long run it might be useful with a possibility to register another email address than the one that belongs to the ECAS account. This since responding to questions about the registers and registry management might be a task split among more than one person, and therefore the correspondence might need to be addressed to a functional mailbox, rather than a personal.

In the (draft) registration process we ask you to send your ECAS id and the name of the organization to which you want to be assigned to the RoR contact e-mail. If the organization is already available in the system, the new user is associated to the organization. Otherwise, the new organization is created, including a contact email related to this organization. This could be the “functional email”.

 

Uploading vs harvesting

Harvesting - incl. the possibility to trigger harvesting directly - is ok for us, also for the testbed. This is said without having tried it yet, but it would be useful if there is information so it is easy to track down what has possibly failed in the process, e.g. was the file not reachable at the URL, or is the file itself not valid etc.

The harvesting feedback will be available as a “detailed report” via mail or directly on the interface (downloadable file).

The JRC registry team

#21 Updated by Michael Noren over 3 years ago

Michael, thank you for further explaining the reasoning behind your decisions. While we might not fully agree on everything, it feels like we have more to win now by moving on and see how this will work in practice. Good to hear that the update frequency is optional as it will also provide a better separation between what is the federation content and the implementation of the federation from the RoR point of view.


Cheers,
Michael (& Christian)

#22 Updated by Daniele Francioli over 3 years ago

  • Description updated (diff)

Also available in: Atom PDF