Support #2799

IT - Ministero dell'ambiente: duplicate metadata

Added by Angelo Quaglia about 4 years ago. Updated over 2 years ago.

Status:FeedbackStart date:23 Jun 2016
Priority:UrgentDue date:
Assignee:Angelo Quaglia% Done:

0%

Category:Harvesting results
Target version:-
Submitting Organisation:IT - Ministero dell'ambiente Knowledge-Base relevant?:
Proactive:Yes Keyword #1:
Country:IT - Italy Keyword #2:
Originating UI: Keyword #3:

Description

Dear Laura,

there is a quite serious problem of duplicate fileIdentifiers coming from Italy's National Discovery Service.

I have identified at least two scenarios:

1) the same metadata documents are served multiple times. 

2) the same fileIdentifier is used in different metadata documents describing different resources

 

 

Example for case 1:

fileIdentifier:

gea:00151:20090727:081902

It appears in four metadata documents that seem to have the same content.

They come from a linked Discovery Service (RNDT - Repertorio Nazionale dei Dati Territoriali - Servizio di ricerca)

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20160620-122344/services/1/PullResults/41-60/services/17/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1

The records were returned in batches 21-40, 41-60, 61-80 and 81-100.

The look to have the same content:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20160620-122344/services/1/PullResults/41-60/services/17/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/21-40/datasets/13/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20160620-122344/services/1/PullResults/41-60/services/17/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/41-60/datasets/13/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20160620-122344/services/1/PullResults/41-60/services/17/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/61-80/datasets/13/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20160620-122344/services/1/PullResults/41-60/services/17/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/81-100/datasets/13/

This looks like a technical issue of the remote Discovery Service.

 

Example for case 2

c_c912:00001:20141006:093254

This looks like a problem with the metadata provider who reused the same fileIdentifier for different metadata records:

 

adbpo:SW01:20121212:104832

is used in 13 different metadata documents (again collected from the RNDT catalogue):

 

The INSPIRE Geoportal Resource Browser (at http://inspire-geoportal.ec.europa.eu/proxybrowser/) makes it easy to find the duplicates:

 


Related issues

Related to Geoportal Helpdesk - Support #3029: IT - RNDT: Several Abormal GetRecordsResponse received Feedback 03 Nov 2017
Related to Geoportal Helpdesk - Support #2894: EU: Summary and status of duplicate metadata fileIdentifiers Assigned 23 Dec 2016
Copied to Geoportal Helpdesk - Support #2818: IT - Ministero dell'ambiente: duplicated portions in serv... Feedback 23 Jun 2016

History

#1 Updated by Angelo Quaglia about 4 years ago

I spoke with Antonio Rotundo during the last MIWP-8 meeting.

Some of the duplicates result from splitting of a metadata document  containing a series and the dataset metadata.

The list of fileIdentifiers occurring more than once and coming from the RNDT service can be obtained with the following URL:

http://inspire-geoportal.ec.europa.eu/solr/select?facet=true&q=id:\/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20160620-122344\/services\/1\/PullResults\/41-60\/services\/17\/resourceLocator1\/discovery\/services\/1\/linkedDiscoveryService\/services\/1\/PullResults*&facet.field=remoteMetadataIdentifier&facet.limit=-1&facet.mincount=2&rows=0

For example:
<int name="adbpo:Bac01:20130910:135241">5</int>
<int name="adbpo:SW01:20121212:104832">13</int>
 
The number represents the times the identifier is found.
 

#2 Updated by Angelo Quaglia about 4 years ago

From: Rotundo Antonio [mailto:antonio.rotundo@agid.gov.it]
Sent: 12 July 2016 11:16
To: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>
Subject: R: fileIdentifier

 

Ciao Angelo,

ho visto la issue perché mi è arrivata la notifica.

Vorrei precisare direttamente lì (in modo che legga anche il MATTM) sia il fatto che i duplicati dovuti al RNDT saranno corretti a brevissimo sia la questione dei duplicati dovuti all’harvesting contestuale da parte del geoportale INSPIRE di RNDT e dei cataloghi regionali (harvestati a loro volta da RNDT) che dipende dal MATTM.

Saluti,

Antonio

 

#3 Updated by Antonio Rotundo about 4 years ago

Dear Angelo,

as agreed in the last MIWP-8 meeting, in the coming days we'll check and fix the problems concerning the duplicated fileIdentifiers coming from RNDT.
As already pointed out in my email you have posted above, in order to improve the quality of metadata coming from the Italian National Discovery Service, even the duplicates due to the harvesting of RNDT and the other regional or national catalogues (already harvested by RNDT, by operation of law establishing RNDT) should be removed/fixed changing the getCapabilities document and the linked catalogues of the National Discovery Service.

Regards,

Antonio

 

#4 Updated by Angelo Quaglia almost 4 years ago

  • Category set to Harvesting results

#5 Updated by Angelo Quaglia almost 4 years ago

  • Proactive set to Yes

#7 Updated by Antonio Rotundo almost 4 years ago

Dear Angelo,

We haven't fixed the duplicate fileIdentifiers coming from RNDT yet. 

We will do that in the coming weeks as we are to start the re-engineering of the portal.

I will inform you promptly.

#8 Updated by Angelo Quaglia almost 4 years ago

Dear Antonio,

thank you very much for the prompt update.

#9 Updated by Angelo Quaglia almost 4 years ago

  • Country set to IT - Italy

#10 Updated by Angelo Quaglia over 3 years ago

Dear Antonio,

any updates on this issue?

I still see 2172 duplicates 

 

The identfier p_pi:00001:20091203:163001 mentioned above is still used in 1 series (Piano Faunistico della Provincia di Pisa (Provincia di Pisa) and 8 datasets (e.g. AAC (Provincia di Pisa))

 

But there is another type of issue which seems to be caused by a technical issue instead of a functional one:

For example. the following fileIdentifier is received three times from the RNDT service:

in batch 2601-2650 at position 8

in batch 3151-3200 at position 16

in batch 9301-9350 at position 29

ARLPA_TO:07.03.09-D_2011-05-03_12:50   (Arpa Piemonte - Evento alluvionale 13-16 ottobre 2000 - Torrente Stura)

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20170102-003338/services/1/PullResults/1-50/services/1/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/2601-2650/datasets/8/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20170102-003338/services/1/PullResults/1-50/services/1/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/3151-3200/datasets/16/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20170102-003338/services/1/PullResults/1-50/services/1/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/9301-9350/datasets/29/

 

#11 Updated by Antonio Rotundo over 3 years ago

Dear Angelo,

as I explained during the last INSPIRE Conference, we are working for the new RNDT portal and catalogue; under that activity, that will take a few months, all the duplicates will be definitively fixed.

In the meantime, I started to check and try to fix some duplicated fileIdentifiers; as soon as I finish that, I will alert you.

I think that many duplicates are caused from the second type of issue you mentioned (i.e. the same fileIdentifier received more times from the RNDT service).

#12 Updated by Angelo Quaglia over 3 years ago

Dear Antonio,

considering that the new RNDT portal and catalogue will take a few months to be ready, it might be worth understanding the outstanding issues so that we can be really sure they will go away with the new version.

I would like to start with the "second type of issue"

 

The three metadata files above are absolutely identical but they are received in batches that are quite "distant" in terms of startPosition:

in batch 2601-2650 at position 8

in batch 3151-3200 at position 16

in batch 9301-9350 at position 29

That is strange because every GetRecords request contains a sortBy clause:

<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:ogc="http://www.opengis.net/ogc" xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:apiso="http://www.opengis.net/cat/csw/apiso/1.0" service="CSW" version="2.0.2" maxRecords="1" startPosition="6045" resultType="results" outputSchema="http://www.isotc211.org/2005/gmd" outputFormat="application/xml">

  <csw:Query typeNames="gmd:MD_Metadata">
    <csw:ElementSetName>full</csw:ElementSetName>
    <ogc:SortBy>
      <ogc:SortProperty>
        <ogc:PropertyName>apiso:Identifier</ogc:PropertyName>
        <ogc:SortOrder>ASC</ogc:SortOrder>
      </ogc:SortProperty>
    </ogc:SortBy>

  </csw:Query>
</csw:GetRecords>

 

P.S.:

I am collecting in Issue #2869 the sofwtare implementations used in INSPIRE by Member States for their National Discovery Services.

What is the current implementation RNDT is using and what will be the upcoming one?

Do you know anything about MinAmbiente?

 

 

 

 

 

#13 Updated by Antonio Rotundo over 3 years ago

Dear Angelo,

I'll check the strange behaviour of the service and I'll let you know asap. As I explained when we met during the INSPIRE Conference, I had already surmised that it could be such a problem.

Concerning the current implementation of RNDT, it is based on a self-developed solution. The upcoming one will be based on ESRI Geoportal Server (so that will ensure the resolution of the issues on the dupplicates).

 

#14 Updated by Angelo Quaglia over 3 years ago

Dear Antonio,

thank you for the informaiton.

The INSPIRE Geoportal harvests with 5 concurrent threads but the three batches above were collected at very different times.

#15 Updated by Antonio Rotundo over 3 years ago

Angelo,

a question: how are the parameters maxRecords and startPosition set in the concurrent threads? 

#16 Updated by Angelo Quaglia over 3 years ago

The maxRecords is inherited from the settings of the parent Discovery Service and is equal to 50.

The startPosition is set accordingly for each batch.

If you look at my comment above you can infer what the startPosition was for each batch (i.e. 2601, 3151 and 9301):

in batch 2601-2650 at position 8

in batch 3151-3200 at position 16

in batch 9301-9350 at position 29

ARLPA_TO:07.03.09-D_2011-05-03_12:50   (Arpa Piemonte - Evento alluvionale 13-16 ottobre 2000 - Torrente Stura)

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20170102-003338/services/1/PullResults/1-50/services/1/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/2601-2650/datasets/8/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20170102-003338/services/1/PullResults/1-50/services/1/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/3151-3200/datasets/16/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-49e7361d-2a39-11e5-928d-52540004b857_20170102-003338/services/1/PullResults/1-50/services/1/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/9301-9350/datasets/29/

#17 Updated by Antonio Rotundo over 3 years ago

Angelo,

thank you very much!

#18 Updated by Angelo Quaglia over 2 years ago

From Issue #3029

Dear Antonio,

it is important to determine whether the duplicates actually hide good records.  as it is happening with the geonetwork geonetwork bug, for example.

One efficient way to do this is to compare the expected fileIdentifiers with those that are currently in the INSPIRE Geoportal index.

You can get them with this URL:

http://inspire-geoportal.ec.europa.eu/solr/select?facet=true&facet.mincount=1&facet.limit=-1&facet.field=remoteMetadataIdentifier&q=*:*&fl=remoteMetadataIdentifier,resourceTitle&fq=memberStateCountryCode:it&fq=sourceMetadataResourceLocator:\/*&rows=0

the value of the name attributes is the fileIdentifier, while the element value is the number of occurrences of that identifier:

...
<int name="r_emiro:BGBDJ">1</int>
<int name="r_emiro:BHRCG">1</int>
<int name="r_emiro:BIVGF">1</int>
<int name="r_emiro:BJUQF">2</int>
<int name="r_emiro:BKONW">1</int>
<int name="r_emiro:BLDUJ">3</int>
<int name="r_emiro:BLNML">1</int>
<int name="r_emiro:BMJOD">1</int>
...
 
Instead, concerning the duplicates  I would rather continue the discussion in the following dedicated issue:
Issue #2799 IT - Ministero dell'ambiente: duplicate metadata
 
That issue was reported to Italy on 23rd June 2016.
I understand you fixed the duplications that were due to the splitting of series metadata you are doing in order to comply with INSPIRE requirements.
 
Please note that in
Issue #2894 Summary and status of duplicate metadata fileIdentifiers
you will see that duplicates have occurred and occur in the National Discovery Services of various countries.
that happens for a number of disparate reasons (load balancing errors, indexing errors, etc.)
This even when from the Service Provider's side everything seems to be functioning normally.
 
This URL will return only the fileIdentifiers that have a multiplicity greater than one:
 
 
I will post this information also to Issue #2799
 
Best regards,
Angelo

#19 Updated by Angelo Quaglia over 2 years ago

From Issue #3029

Antonio Rotundo wrote:

Dear Angelo, as you know, we are about to release the new version of the national Catalogue and discovery service that should to allow to overcome all the open issues. So, what you described is really a temporary situation. Since I had fixed the most part of the duplicated fileIDs, I cecked a sample of them and I can confirm that those ID are present only once in the DB, except for 150 of them, I will fix asap. All those fileIDs refer to good records. Best regards, Antonio  

Also available in: Atom PDF