Support #2900

SE: Duplicate fileIdentifiers

Added by Angelo Quaglia over 3 years ago. Updated over 2 years ago.

Status:FeedbackStart date:10 Jan 2017
Priority:NormalDue date:
Assignee:Angelo Quaglia% Done:

0%

Category:Harvesting results
Target version:-
Submitting Organisation:SE Knowledge-Base relevant?:
Proactive:Yes Keyword #1:
Country:SE - Sweden Keyword #2:
Originating UI: Keyword #3:

Description

Dear Michael,

I am hunting down duplicate file identifiers in metadata collected from Member States.

The situation status is documented in Issue #2894.

Today, the INSPIRE Geoportal finds the following three duplicates coming from https://www.geodata.se/InspireCswProxy/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities:

Started checking /INSPIREWebServices/resources/INSPIREResource/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170107-021139/services/1/PullResults ...
ERROR: duplicate fileIdentifiers were found. 
           c1e1cddc-fd37-491a-a5af-fb7f135f6eb4 2
           d51b94c9-1e71-4cf8-bcbd-706e249407bc 2
Finished checking /INSPIREWebServices/resources/INSPIREResource/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170107-021139/services/1/PullResults

 

 

c1e1cddc-fd37-491a-a5af-fb7f135f6eb4 (Nationell miljöövervakning: Landskap, sträckfågelräkning vid Falsterb)

http://inspire-geoportal.ec.europa.eu/INSPIREWebServices/resources/INSPIREResource/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170107-021139/services/1/PullResults/301-350/datasets/50/

http://inspire-geoportal.ec.europa.eu/INSPIREWebServices/resources/INSPIREResource/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170107-021139/services/1/PullResults/351-400/datasets/1/

 

d51b94c9-1e71-4cf8-bcbd-706e249407bc (Områden skyddade enligt fiskvattendirektivet )

http://inspire-geoportal.ec.europa.eu/INSPIREWebServices/resources/INSPIREResource/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170107-021139/services/1/PullResults/351-400/datasets/50/

http://inspire-geoportal.ec.europa.eu/INSPIREWebServices/resources/INSPIREResource/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170107-021139/services/1/PullResults/401-450/datasets/1/

 

This is likely due to a software problem even more so if you are using GeoNetwork.

Could you please share the vendor and version you are using?

I will update Issue #2869 where I keep track of Discovery Service implementations in Europe.

 

Best regards,

Angelo

 

439.iso19139.xml Magnifier (24.8 KB) Angelo Quaglia, 17 Jan 2017 10:51 am

36.iso19139.xml Magnifier (25.5 KB) Angelo Quaglia, 17 Jan 2017 10:51 am


Related issues

Related to Geoportal Helpdesk - Support #2894: EU: Summary and status of duplicate metadata fileIdentifiers Assigned 23 Dec 2016
Copied to Geoportal Helpdesk - Support #2914: NL - KADASTER: Duplicate fileIdentifiers Resolved 10 Jan 2017

History

#1 Updated by Angelo Quaglia over 3 years ago

  • Subject changed from BE - Brussels Region: Duplicate fileIdentifiers to SE: Duplicate fileIdentifiers
  • Submitting Organisation changed from BE - Brussels Region to SE
  • Country changed from BE - Belgium to SE - Sweden

#2 Updated by Michael Östling over 3 years ago

Hi,
The catalogue only contains a single instance of each of the above fileidentifiers.

I guess this is related to the paging function of the CSW-api and in case any record is not properly indexed in Lucene it could show up on multiple
requests. I think we had a similar issue previously when some records was not indexed correctly.

I could not access the URLs above in ticket. Are they pw-protected ?

Let me discuss with Jose Garcia at Geocat tomorrow on how we could trace this.

/Michael

#3 Updated by Angelo Quaglia over 3 years ago

  • Description updated (diff)

#4 Updated by Angelo Quaglia over 3 years ago

  • Description updated (diff)

#5 Updated by Angelo Quaglia over 3 years ago

Hi Michael,

I overrode the settings on the server to download all the records in one batch and I found out you have corrupted records in your catalogue:

 

Huvud- och delavrinningsområden, vattendelare (SVAR2012) - datamängd 

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170117-090253/services/1/PullResults/1-487/datasets/439/resourceReport/

 

Dammar (SVAR2013) - nedladdningstjänst

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170117-090253/services/1/PullResults/1-487/services/36/resourceReport/

 

I have attached the files.

#6 Updated by Angelo Quaglia over 3 years ago

The problem with the files is that they have the spurious characters        ../../

 

<?xml version="1.0" encoding="UTF-8"?>
<gmd:MD_Metadata xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:srv="http://www.isotc211.org/2005/srv" xmlns:gml="http://www.opengis.net/gml" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:geonet="http://www.fao.org/geonetwork" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.isotc211.org/2005/gmd http://schemas.opengis.net/iso/19139/20060504/gmd/gmd.xsd">
      ../../
      <gmd:fileIdentifier>
    <gco:CharacterString>e857abd7-8b9f-4450-85b6-d248f11c1eaa</gco:CharacterString>
  </gmd:fileIdentifier>

 

 

<?xml version="1.0" encoding="UTF-8"?><gmd:MD_Metadata xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:srv="http://www.isotc211.org/2005/srv" xmlns:gml="http://www.opengis.net/gml" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:geonet="http://www.fao.org/geonetwork" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.isotc211.org/2005/srv http://schemas.opengis.net/iso/19139/20060504/srv/srv.xsd">
      ../../
      <gmd:fileIdentifier>
            <gco:CharacterString>1a2f3f94-18df-46fb-bce9-407c4f466f81</gco:CharacterString>
         </gmd:fileIdentifier>

 

#7 Updated by Angelo Quaglia over 3 years ago

Dear Michael, Christina,

please note that you have 107 ISO 19139 metadata files which fail the XML validation:

 

Best regards,

Angelo

#8 Updated by Angelo Quaglia over 3 years ago

Please note that with the new MD guidelines, xml validity has become a requirement. 

#9 Updated by Angelo Quaglia over 3 years ago

From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 25 January 2017 18:24
To: 'Fredrik.persater@lm.se' <Fredrik.persater@lm.se>
Cc: Michael Östling (michael.ostling@metagis.se) <michael.ostling@metagis.se>; 'Wasström Christina' <Christina.Wasstrom@lm.se>
Subject: Many invalid metadata records coming from the Swedish National INSPIRE Discovery Service

 

Dear Fredrik,

 

I am having quite a hard time with the ISO 19139 metadata returned by the service at:

https://www.geodata.se/InspireCswProxy/csw?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

 

  1. Three xml files are returned with spurious characters (“../../”) between the two first elements:

 

     <gmd:MD_Metadata>

      ../../

      <gmd:fileIdentifier>

            <gco:CharacterString>4647d99c-960f-4537-b3de-acdb9e61e72b</gco:CharacterString>

         </gmd:fileIdentifier>

 

      </gmd:MD_Metadata>

      <gmd:MD_Metadata >

      ../../

      <gmd:fileIdentifier>

            <gco:CharacterString>90850615-fe44-42fa-b2bc-2b351e7aafdf</gco:CharacterString>

         </gmd:fileIdentifier>

 

 

 

     <gmd:MD_Metadata>

      ../../

      <gmd:fileIdentifier>

            <gco:CharacterString>e857abd7-8b9f-4450-85b6-d248f11c1eaa</gco:CharacterString>

         </gmd:fileIdentifier>

         <gmd:language>

 

You can check the downloaded file exactly how it was received here:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-c99ee4e6-66ee-11e3-8e38-52540004b857_20170125-170342/services/1/PullResults/1-487/downloaded

 

  1. 107 files fail the xml validation

 

  1. 28 metadata documents fail one critical internal processing phase of the INSPIRE Geoportal

 

  1. Probably because of the issues above, the service is returning duplicate metadata when harvested in small batches.
    However, it is difficult to analyse the situation with so many problematic metadata

 

This is being tracked here:

https://ies-svn.jrc.ec.europa.eu/issues/2900

 

If you would like to access the system, you need to create an account here using your email address and notify me when done:

https://webgate.ec.europa.eu/cas/eim/external/register.cgi

 

 

Best regards,

Angelo Quaglia

#10 Updated by Angelo Quaglia over 3 years ago

From: Persäter Fredrik [mailto:Fredrik.Persater@lm.se]
Sent: 26 January 2017 08:29
To: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>
Cc: Michael Östling <michael.ostling@metagis.se>; Wasström Christina <Christina.Wasstrom@lm.se>
Subject: SV: Many invalid metadata records coming from the Swedish National INSPIRE Discovery Service

 

Dear Angelo,

 

Thank you for your information. I will discuss this with Michael and come back to you.

 

Kind Regards

Fredrik

#11 Updated by Angelo Quaglia over 3 years ago

From: Persäter Fredrik [Fredrik.Persater@lm.se]
Sent: 01 February 2017 11:17
To: Angelo Quaglia
Subject: SV: Many invalid metadata records coming from the Swedish National INSPIRE Discovery Service
 
 

Dear Angelo,

 

I have discussed the issues you are experiencing with metadata from our national directory service with my colleagues. The problems are different in nature and require actions at different levels. What we currently have done is to correct the error with the strange characters in some metadata documents.

 

As for the 107 files that are not going through the validation we try to push this work forward. Unfortunately this takes time and is a more long term initiative. The difference with these files compared with the strange characters are that we must resolve the issues with the authorities responsible for the files, which is not so straight forward.

 

Is it possible to get a list of the 28 files that fails when processing? Both the files and also what kind of error they causes are interesting.  

 

Connected to the validation of the metadata we have the strong opinion that today’s warning regarding coordinate system for dataset is wrong. This has to be changed in the guidance documents by MIG. There can be standard coordinate systems on the distribution side but when comes to collecting data or in which coordinate system a dataset is stored in the local database that depends on local needs and requirements. Because this is not correct we think that there should not be a warning for this kind of issue.

 

And yes, it would be nice to be able to access the system. The e-mail address I am using in the EU Login is fredrik.persater@lm.se.

 

Kind Regards

 

Fredrik Persäter
Projekt Manager


Swedish Mapping, Cadastre and Land Registration authority

#12 Updated by Angelo Quaglia over 3 years ago

Dear Fredrik,

I have created a user for you and you should now be able to access and update this issue or create new ones.
 
I do understand the time that is required to get each responsible organization fix their metadata.
However, please note that the deadline for metadata was 03/12/2013.
 
You can easily get a list of the 28 most critical ones by using the Resource Browser at:
 
 
Then you select:
- "se" in Member State
- "Show only metadata resources" in Selection Criteria
- "error.geoportal.proxy.iso2inspire.failed" under Error Counts 
 
 
 
Of those 28, the nastier ones are those for which the errors were so serious that no information at all could be extracted from the metadata.
You can recognise those because they are displayed as:

no title specified* (*no Responsible Organisation Name specified*)

 
 
If you click on "Validation Report" you will see the errors encountered during the transformation.
You will also be able to see the original metadata and the fileIdentifier:
 
You might also discovery some funny things, for example that in two metadata documents the Responsible Organisation Email is set to "--- xyz@xyz.se "
 
 
The warning about the CRS is in fact reported as a warning and not an error.
It was introduced mainly to help users remember that that piece of information is going to be mandatory since it is required by the Interoperability Regulation which came later then the Metadata Regulation.
I agree the check needs to be refined but, as I said, it is just a warning.
In any case, I think it should be dicussed in a separate issue you are very welcome to create.
Please note that there are other similar issues already opened.
 
Best regards,
Angelo
 
 
 
 
 
 
 

#13 Updated by Angelo Quaglia over 3 years ago

Dear Frederik,

the problem of the spurious ../../ is a creepy one.

I have just executed from my browser a GetRecords for all the records and look at the results:


      









I have also tried to harvest in batches of 50 and I got two duplicates:
D1E0D552-4FB1-4254-9AB6-0A13BC4121ED
c1645388-b1a0-4ba6-b6bf-d75d3ad664fe

 

 

#14 Updated by Angelo Quaglia over 2 years ago

  • Estimated time set to 2.00

Dear Michael,

the INSPIRE Geoportal is still receiving duplicates from the Discovery Service of Sweden.

Since you are using Geo Network, it is likely a bug that is not only sending duplicates but the duplicates are taking the place of good records that we are not receveing.

http://inspire-geoportal.ec.europa.eu/solr/select?q=(sourceMetadataResourceLocator:\/* AND memberStateCountryCode:SE)&facet=true&facet.field=remoteMetadataIdentifier&facet.limit=-1&facet.mincount=2&rows=0

<lst name="remoteMetadataIdentifier">

<int name="c5fdfbf0-7ed9-4956-9d09-16414d4e1080">2</int>
<int name="c60c9f26-8bfc-4188-92d7-64a376dc3fd3">2</int>
</lst>
 
 
Do you have a check in place to ensure we are receiving all the records we are supposed to receive?
 
Best regards,
Angelo

#15 Updated by Angelo Quaglia over 2 years ago

  • Estimated time deleted (2.00)

Also available in: Atom PDF