Support #2796

UK: duplicate fileIdentifiers

Added by Angelo Quaglia about 4 years ago. Updated almost 4 years ago.

Status:ClosedStart date:23 Jun 2016
Priority:NormalDue date:
Assignee:Angelo Quaglia% Done:

0%

Category:Harvesting results
Target version:-
Submitting Organisation:UK Knowledge-Base relevant?:
Proactive:Yes Keyword #1:
Country:UK - United Kingdom Keyword #2:
Originating UI: Keyword #3:

Description

Der Alex,

there is a serious problem of duplicate fileIdentifiers coming from UK's National Discovery Service.

It seems the same metadata documents are served multiple times.

For example:

fileIdentifier:

cff79fcb-3ca4-3d84-8c30-a28075371fe5

It appears in three metadata documents that seem to have the same content.

They were returned in batches 6421-6440, 16601-16620, 19601-19620.

The look to have the same content:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160622-223120/services/1/PullResults/6421-6440/datasets/1/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160622-223120/services/1/PullResults/16601-16620/datasets/4/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160622-223120/services/1/PullResults/19601-19620/datasets/1/

 

The INSPIRE Geoportal Resource Browser (at http://inspire-geoportal.ec.europa.eu/proxybrowser/) makes it easy to find the duplicates:


Related issues

Related to Geoportal Helpdesk - Support #2801: PL - CODGIK: Duplicate metadata Closed 08 Sep 2016 08 Sep 2016
Related to Geoportal Helpdesk - Support #2894: EU: Summary and status of duplicate metadata fileIdentifiers Assigned 23 Dec 2016

History

#1 Updated by Angelo Quaglia about 4 years ago

FYI.

#2 Updated by Alex Ramage about 4 years ago

Angelo,

Thanks for bringing this to my attention.  I will let the UKNCP know as well and will get it investigated.

Alex Ramage

#3 Updated by Angelo Quaglia about 4 years ago

On 28 Jun 2016, at 11:29, David Read <david.read@hackneyworkshop.com> wrote:

Angelo,

 

I've been looking at these duplicate INSPIRE records from the UK. I've written some test code to examine records seen on data.gov.uk, our CSW service and your European geoportal (via the SOLR API) and have run it this morning (full detail is included below if you want it).

 

I see that there is only one record that is a duplicate on the European Geoportal - 141f4451-e028-37b3-908c-b2a531a434c7. Yesterday there were 5 e.g. 2e05984b-4674-3414-83b6-8dc87e779014. The CSW you're getting the records from has no duplicates, so either my test is wrong or something has gone awry with your CSW client.

 

I wonder if you are harvesting at the same time as the CSW's data is changing, and either the CSW server (GeoNetwork), your client, or the CSW spec can't cope well with this?

 

Comparing the numbers between our CSW and the Geoportal, the numbers of records are 21113 and 20923, so 190 records are either missed or discarded e.g. ced1ef64-cc56-4384-a679-23dfb5c10070. I remember you saying you don't accept some records that we have - perhaps because they lack an INSPIRE keyword - does that explain it?

 

Regards,

David

 

https://github.com/datagovuk/ckanext-dgu/blob/a0847390348309e80ccaeead7caa69c944bb0c28/ckanext/dgu/lib/reports.py#L999-L1272

 

2016-06-28 08:57:05,837 ERROR [ckanext.dgu.lib.reports] data.gov.uk: Duplicate guids 3 = {2a9cfb16-de9a-4ed9-8f97-f2582f7a4485} strategic-flood-risk-assessment-zone-3ai, 29dcc62a-6a3a-40b6-be13-31da66e06e07 enpa-former-orchard, {B3D1B233-ECD5-4F91-B7AA-736E185A0EFD} roads-service-assets-inspire-view-service

2016-06-28 08:57:05,869 DEBUG [ckanext.dgu.lib.reports] data.gov.uk: 21793 records, 21790 unique

2016-06-28 08:57:07,209 ERROR [ckanext.dgu.lib.reports] data.gov.uk CSW: Duplicate guids 6 = 7cddc839-8e7c-4fd2-b9b4-85e59f1ae463 producers-leasing-direct-sales-milk-quota-by-county-2000-to-2001, 4bf2043b-a2e7-4694-935d-00aa6a0c5080 tree-preservation-orders36, 01a57b8cd776c92e4c3b02e19fa5542d 1990-southampton-oceanography-centre-the-infauna-of-the-handfast-point-maerl-bed-day-grab-surve, 552baeb4-9d98-47ec-83e4-dd971bdb2cb2 agricultural-land-classification-detailed-post1988-survey-alcr01893, 61293137-fc5a-4481-badd-465df76d88b9 lidar-dsm-time-stamped-tiles-2008-50cm, 58db7066-147a-4818-8d99-71a0833f0a6b milk-quota-holding-sizes-1994-to-2010

2016-06-28 08:57:07,221 DEBUG [ckanext.dgu.lib.reports] data.gov.uk CSW: 21790 records, 21784 unique

2016-06-28 08:57:07,237 ERROR [ckanext.dgu.lib.reports] OS CSW: No duplicate guids

2016-06-28 08:57:07,241 DEBUG [ckanext.dgu.lib.reports] OS CSW: 21113 records, 21113 unique

2016-06-28 08:57:07,473 ERROR [ckanext.dgu.lib.reports] Europe: Duplicate guids 1 = 141f4451-e028-37b3-908c-b2a531a434c7 bathymetric-survey-2003-10-02-alfred-dock-entrance

2016-06-28 08:57:07,481 DEBUG [ckanext.dgu.lib.reports] Europe: 20924 records, 20923 unique

2016-06-28 08:57:07,541 DEBUG [ckanext.dgu.lib.reports] dgu->dgu_csw: Records reduced 21790->21784

2016-06-28 08:57:08,954 ERROR [ckanext.dgu.lib.reports] dgu->dgu_csw: Records missing 6 = ea528945-a6cc-4a6e-86dc-2d304aa3d950 agricultural-land-classification-detailed-post-1988-survey-alcb09294, CEFAS6e4b11b9-8279-402f-afa1-4fe27edf5d9f 2008-2008-centre-for-environment-fisheries-aquaculture-science-cefas-north-sea-conducti-18-2008, 36ca7d2b-aa7f-46ab-948a-606874830dc5 agricultural-land-classification-detailed-post-1988-survey-alcw05395, CEFAS9fa8b61e-5e5e-486f-9714-9495d3613b10 1986-1986-centre-for-environment-fisheries-aquaculture-science-cefas-survey-ecst-1-86-part-of-i, c72f02d4-60dd-44c2-a5e4-09335de8b6e1 community-uses, e927d5fe-edd6-4bb9-8574-f2e44173cd1a agricultural-land-classification-detailed-post-1988-survey-alcr16793 (created in last 24h)

2016-06-28 08:57:08,982 DEBUG [ckanext.dgu.lib.reports] dgu_csw->os_csw: Records reduced 21784->21113

2016-06-28 08:57:10,147 ERROR [ckanext.dgu.lib.reports] dgu_csw->os_csw: Records added 5 = CEFAS9fa8b61e-5e5e-486f-9714-9495d3613b10 1986-1986-centre-for-environment-fisheries-aquaculture-science-cefas-survey-ecst-1-86-part-of-i, c72f02d4-60dd-44c2-a5e4-09335de8b6e1 community-uses, CEFAS6e4b11b9-8279-402f-afa1-4fe27edf5d9f 2008-2008-centre-for-environment-fisheries-aquaculture-science-cefas-north-sea-conducti-18-2008, 36ca7d2b-aa7f-46ab-948a-606874830dc5 agricultural-land-classification-detailed-post-1988-survey-alcw05395, ea528945-a6cc-4a6e-86dc-2d304aa3d950 agricultural-land-classification-detailed-post-1988-survey-alcb09294

2016-06-28 08:59:39,372 ERROR [ckanext.dgu.lib.reports] dgu_csw->os_csw: Records missing 676 e.g. 5bb29aeb-77ea-4a5a-902a-733217b1a6fd foot-and-mouth-disease-2001-daily-overview-maps-week-commencing-24-09-2001 (created in last 24h), a297a285-6d69-4ad7-a468-bd499a4e6573 allelic-diversity-among-12-commonest-spoligotype-of-btb-in-gb (created in last 24h), 366f8e55-e59f-4d9c-89a8-7fbaf63dba4b laboratory-test-figures-international-trade-miscellaneous-species-2012 (created in last 24h), 8a9024d4-1f19-46f0-8d44-59d9d0694ae5 laboratory-tests-commercial-pigs-samples-2008 (created in last 24h), 7c7cb942-633e-4cc1-a7bf-b79a765d298c vertical-aerial-photography-rgb-2006-20cm (created in last 24h), 877089dc-d7d1-402a-b3b8-f5f22418b609 laboratory-tests-endemic-research-surveillance-miscellaneous-species-samples-2006 (created in last 24h), 9e0890ad-581d-4599-8c8e-9b579c7bf5cd vertical-aerial-photography-rgb-2008-10cm (created in last 24h), f444a4e9-9250-48d3-814b-834e8acf26c9 laboratory-test-figures-food-and-environment-surveillance-cattle-2009 (created in last 24h), f1f23192-e30a-492f-9d24-0f9b011f34ed laboratory-tests-commercial-miscellaneous-species-samples-2007 (created in last 24h), 28968494-62a4-4211-b43a-6f02e8156547 laboratory-tests-commercial-pigs-samples-2013 (created in last 24h) Counts: defaultdict(<type 'int'>, {'created in last 24h': 590})

2016-06-28 08:59:39,416 DEBUG [ckanext.dgu.lib.reports] os_csw->europe: Records reduced 21113->20923

2016-06-28 09:00:21,339 ERROR [ckanext.dgu.lib.reports] os_csw->europe: Records missing 190 e.g. ced1ef64-cc56-4384-a679-23dfb5c10070 allerdale-disabled-facilities-grant-land-charge, 83b41246-a37c-4d66-8fb2-435607b32c45 third-uk-habitats-directive-report-2013-uk-level-species-details, 80d13463-8ba2-4862-ab67-86ad5317723c allerdale-closing-order-land-charge, 1fff9004-3a27-47cc-a300-eff82b607888 allerdale-land-compensation-act-land-charge, fe58b76b-894a-4603-b21e-9d7d84490a40 allerdale-recycling-site-data, 0c3946cb-900a-4d13-bb2b-64b3f413dbd3 allerdale-policy-town-centre, 4ae910b1-c79c-426e-9a88-4b66c6fd42df allerdale-building-control-register, 9df8df51-6401-37a8-e044-0003ba9b0d98 digital-geochronological-index, d3914c1c-e252-4b49-a3b1-c2a0d180d6c5 allerdale-policy-tourism, 8bcf6f03-8954-4d61-9e16-4f322e95063a status-and-trends-for-individual-bird-species-tenth-uk-report-for-article-12-of-the-e-2008-2012 Counts: defaultdict(<type 'int'>, {})

 

 

#4 Updated by Angelo Quaglia about 4 years ago

On 28 Jun 2016, at 14:43, Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu> wrote:

Dear David,

 

Many thanks for coming back to me about this issue.

 

I do expect issues linked to reindexing and to the fact that the harvesting is not performed in a single transaction, but I would not expect those issues to happen systematically.

Is there any scheduled maintenance (i.e. re-indexing) running on the catalogue during the night?

If yes, it is possible to schedule the harvesting so that it runs inside a specific day/time window.

 

Indeed, today the only duplicate is:

141f4451-e028-37b3-908c-b2a531a434c7

 

Geoportal Representations:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160628-050108/services/1/PullResults/9601-9620/datasets/20/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160628-050108/services/1/PullResults/21101-21113/datasets/13/

 

The record was downloaded twice in two distinct GetRecords, quite far apart in terms of startPosition :

 

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160628-050108/services/1/PullResults/9601-9620/downloaded.xml

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160628-050108/services/1/PullResults/21101-21113/downloaded.xml

 

I am also reviewing duplicates coming from other catalogues.

Could you please tell me which version of GeoNetwork you are currently running?

 

Best regards,

Angelo

 

 

P.S.:

I have been asked to request that people I interact with about INSPIRE Geoportal issues, get access to the MIG collaboration space. 

 

You can do so following the instructions below:

  

 

(**)Dear discovery service contact points, MIG representatives and NCP’s, 

we would like to inform you about a number of changes in how we develop / maintain the INSPIRE geoportal (including its harvesting and validation components) and how we communicate with Member States about any questions or issues related to the harvesting and validation of metadata form the discovery services in the Member States.

Geoportal development

In the past, the INSPIRE validator, which is operated as part of the Geoportal harvester, was updated whenever new issues were identified or when the time schedule of the legal obligations required this.

In order to guarantee a more stable behaviour over longer time periods, we decided to switch to a managed release cycle, accompanied by a clear list of the changes and by a timely communication before the release. Relevant  communications will be published through the MIG collaboration space, and contact points will receive an automatic update e-mail (see below).

Communication

In order to streamline and keep better track of the communication between the geoportal team and the contact points for the discovery services in the Member States, we will start using a dedicated project on the MIG collaboration space (including a wiki, issue tracker and news section) as the main communication channel instead of e-mail. The project will be private and will be open to all registered discovery service contact points as well as interested INSPIRE NCPs or MIG representatives.

Getting access
In order to get access to this Geoportal project on the MIG collaboration space, please send an e-mail to inspire-geoportal@jrc.ec.europa.eu
 (or reply to this mail). If you have never used the MIG collaboration space for other projects, please also send us your ECAS login.

If you have any questions, please let us know.

Best regards,

The JRC INSPIRE team.

 

#5 Updated by Angelo Quaglia about 4 years ago

n 28 Jun 2016, at 16:45, David Read <david.read@hackneyworkshop.com> wrote:

 

Angelo,

 

The GeoNetwork with the CSW is managed by OS, but Peter's away for a

few days, but I do have some info. It gets its datasets every few days

from data.gov.uk:

 

https://github.com/datagovuk/ckanext-dgu/issues/441#issuecomment-229006407

 

and it shouldn't change between then. I see the Geoportal requests

were about 4am, which does avoid the regular times that the Geonetwork

updates itself, which is good. And the Geonetwork doesn't appear to

have harvested from data.gov.uk at all last night so should not have

been changing during then. So, that isn't the problem after all.

 

My tests still show that our GeoNetwork CSW doesn't respond with

duplicates, during simple GetRecords calls. I have provided a simple

script below which you can use to satisfy yourself. I suggest you

adapt it to the exact calls your software makes, to try to reproduce

the problem, if it is indeed on our side.

 

David

 

#!/bin/bash

FILE='identifiers.txt'

BATCH=20

rm $FILE

for (( i=1; i<=21113; i=i+BATCH ))

do

   echo "Getting $BATCH records from $i"

   curl -s "http://csw.data.gov.uk/geonetwork/srv/en/csw?service=CSW&request=GetRecords&constraintLanguage=CQL_TEXT&typeNames=csw%3ARecord&resultType=results&version=2.0.2&esn=brief&maxrecords=$BATCH&startposition=$i"

| grep dc:identifier >> $FILE

done

 

echo Number of identifiers: `cat $FILE | wc -l`

echo Number of unique identifiers: `cat $FILE | sort | uniq | wc -l`

#6 Updated by Angelo Quaglia about 4 years ago

On 28 Jun 2016, at 17:26, Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu> wrote:

Dear David,

 

I am not totally surprised that you cannot reproduce the same behaviour now, as we have already observed that the problem manifestation keeps changing with every harvesting.

 

I think that the evidence stored by the Geoportal is enough to locate the problem on the CSW service side.

 

The GetRecordsResponse envelopes are stored verbatim and the nextRecord value, together with numberOfRecordsReturned, gives evidence of the GetRecordsRequest that was sent to the CSW server by the INSPIRE Geoportal:

 

The same fileIdentifier 141f4451-e028-37b3-908c-b2a531a434c7 is found in the two well distinct GetRecordsRequest I already mentioned

 

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160628-050108/services/1/PullResults/9601-9620/downloaded.xml

<csw:GetRecordsResponse xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd">

<csw:SearchStatus timestamp="2016-06-28T04:39:14"/>

<csw:SearchResults numberOfRecordsMatched="21113" numberOfRecordsReturned="20" elementSet="full" nextRecord="9621">

 

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160628-050108/services/1/PullResults/21101-21113/downloaded.xml

<csw:GetRecordsResponse xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd">

<csw:SearchStatus timestamp="2016-06-28T05:25:29"/>

<csw:SearchResults numberOfRecordsMatched="21113" numberOfRecordsReturned="13" elementSet="full" nextRecord="21114">

 

It has already happened to me in the past with other countries using GeoNetwork to see  there some signs of malfunctioning under the strain of a concurrent full harvesting.

 

The INSPIRE Geoportal keeps all past harvesting so we will be able to analyse them to find some pattern.

 

Best regards,

Angelo

 

#7 Updated by Angelo Quaglia about 4 years ago

n 28 Jun 2016, at 17:51, David Read <david.read@hackneyworkshop.com> wrote:

Angelo,

And do you record requests too? Hopefully you can understand that
unless you can help us reproduce the problem ourselves, even if it
only happens occasionally, then we can't do much more from our side,
I'm afraid.

Regards,
David

#8 Updated by Angelo Quaglia about 4 years ago

n 29 Jun 2016, at 14:32, Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu> wrote:

Dear David,
I understand.
I will be keeping an eye on the results and as soon as I spot some kind of pattern in the occurrence malfunctioning or find the same happening with some other country I will come back to you.

Best regards,
Angelo

#9 Updated by Angelo Quaglia about 4 years ago

-----Original Message-----
From: Alex.Ramage@transport.gov.scot [mailto:Alex.Ramage@transport.gov.scot]
Sent: 29 June 2016 14:34
To: angelo.quaglia@ext.jrc.ec.europa.eu
Cc: david.read@hackneyworkshop.com
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Angelo, David,

 

I note that there are more recorss duplicated today in the GeoPortal if that is helpful.

 

 

 

Alexander D. Ramage

Head of Management Information Systems

Asset Management and Procurement

Trunk Road and Bus Operation

 

 

#10 Updated by Angelo Quaglia about 4 years ago

-----Original Message-----
From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 30 June 2016 10:51
To: 'David Read' <david.read@hackneyworkshop.com>
Cc: 'Alex.Ramage@transport.gov.scot' <Alex.Ramage@transport.gov.scot>; 'Peter Parslow' <Peter.Parslow@os.uk>; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Dear David,

 

The GetRecords requests appear only in the web server log.

I will add them to the report, as well.

I expect them to be all the same, though, except for the startPosition parameter value which can be inferred from the nextRecord value in the response.

 

I have identified another service where the same thing is happening.

It is a Polish catalogue linked to the National one and is harvested recursively.

I am investigating that as well in https://ies-svn.jrc.ec.europa.eu/issues/2801

However, there are evident signs of index corruption:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-d81e48c4-b4cf-11e3-a455-52540004b857_20160623-180235/services/1/PullResults/5451-5500/services/34/resourceLocator1/discovery/services/1/linkedDiscoveryService/services/1/PullResults/

 

In the meantime could you please confirm me you are using GeoNetwork?

I would need to know the exact version and whether there are customizations applied.

 

 

Best regards,

Angelo

 

#11 Updated by Angelo Quaglia about 4 years ago

  • Priority changed from Urgent to Normal

#12 Updated by Angelo Quaglia about 4 years ago

-----Original Message-----
From: Peter Parslow [mailto:Peter.Parslow@os.uk]
Sent: 30 June 2016 11:20
To: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>; 'David Read' <david.read@hackneyworkshop.com>
Cc: Alex.Ramage@transport.gov.scot; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Angelo,

A question simple enough for me to answer, even whilst away from my office (ironically, in JRC - "study on model extensions"))

 

The UK Discovery Service (http://csw.data.gov.uk/geonetwork/) is indeed running GeoNetwork. It's still on version 2.6.4. We haven't made any customisations.

 

Peter

 

#13 Updated by Angelo Quaglia about 4 years ago

-----Original Message-----
From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 01 July 2016 09:47
To: 'David Read' <david.read@hackneyworkshop.com>
Cc: 'Alex.Ramage@transport.gov.scot' <Alex.Ramage@transport.gov.scot>; 'Peter Parslow' <Peter.Parslow@os.uk>; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Dear David,

 

Interestingly enough, the linked (not the National one) Polish catalogue showing the same erratic behaviour is also running GeoNetwork 2.6.4.

 

This inconsistent sorting defeats the purpose of the startPosition parameter and makes pagination impossible.

 

I workaround could be using the sortBy parameter.

I will try that out.

 

 

Best regards,

Angelo

 

#14 Updated by Angelo Quaglia about 4 years ago

From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 01 July 2016 10:09
To: 'David Read' <david.read@hackneyworkshop.com>
Cc: 'Alex.Ramage@transport.gov.scot' <Alex.Ramage@transport.gov.scot>; 'Peter Parslow' <Peter.Parslow@os.uk>; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Dear David,

 

The geoportal will be using the following query, where the part in bold is the new one:

 

<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:ogc="http://www.opengis.net/ogc" xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:apiso="http://www.opengis.net/cat/csw/apiso/1.0" service="CSW" version="2.0.2" maxRecords="10" startPosition="5" resultType="results" outputSchema="http://www.isotc211.org/2005/gmd" outputFormat="application/xml" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd">

  <csw:Query typeNames="gmd:MD_Metadata">

    <csw:ElementSetName>full</csw:ElementSetName>

    <ogc:SortBy>

      <ogc:SortProperty>

        <ogc:PropertyName>apiso:Identifier</ogc:PropertyName>

        <ogc:SortOrder>ASC</ogc:SortOrder>

      </ogc:SortProperty>

    </ogc:SortBy>

  </csw:Query>

</csw:GetRecords>

 

 

Angelo

#15 Updated by Angelo Quaglia about 4 years ago

-----Original Message-----
From: d.t.read@gmail.com [mailto:d.t.read@gmail.com] On Behalf Of David Read
Sent: 01 July 2016 11:04
To: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>
Cc: Alex.Ramage@transport.gov.scot; Peter Parslow <Peter.Parslow@os.uk>; Johnny Dixon <john.dixon@defra.gsi.gov.uk>; King, Jason (SCFS) <Jason.King@defra.gsi.gov.uk>
Subject: Re: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Angelo,

 

Thanks for doing this - let's hope that 'sorts' out the problem.

 

By the way Peter we've been long planning to move to new CSW software (based on PyCSW) to be able to retire the OS GeoNetwork box, and we plan to do that in the next couple of months.

 

Dave

#16 Updated by Angelo Quaglia about 4 years ago

This issue was raised also for pycsw:

https://github.com/geopython/pycsw/issues/301

It was recognized as being not a bug but a shortcoming of the CSW 2.0.2 standard .

It has been addressed in CSW 3.0 as follows:

http://docs.opengeospatial.org/is/12-176r7/12-176r7.html

Requirement-108

If no sort is specified then the server shall sort the results according to its default sort which shall be declared in the capabilities doc (see Table 20).

 

Requirement-109

If no sort is specified and if no default sort is specified in the capabilities document then it is assumed that the server will sort responses alphabetically by Title in ascending order

 

#17 Updated by Angelo Quaglia about 4 years ago

Dear  David,

Last night a harvesting took place with the new "sort by" clause.

I have analyzed the results.

After reading that, I would appreciate if you could suggest a time after which it would be best to kick off the harvesting.

Pehaps 3am GMT (4am GMT+1 ,  5am GMT+2)?

 

The harvesting started at 05 Jul 2016, 22:20:36 GMT  and ended at 06 Jul 2016, 00:03:51 GMT

Your server seems to be GMT+1, so the time frame was 05 Jul 2016, 23:20:36 GMT+1  => 06 Jul 2016, 01:03:51 GMT+1

During the harvesting the numberOfRecordsMatched went from 22041 and 22045.

That can certainly happen but there was a serious inconsistency in the nextRecord value of the last GetRecordsResponse (available at http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160706-002019/services/1/PullResults/22041-22041/downloaded.xml):

<csw:SearchStatus timestamp="2016-07-06T00:54:52"/>
<csw:SearchResults numberOfRecordsMatched="22045" numberOfRecordsReturned="1" elementSet="full" nextRecord="22042">

The corresponding GetRecordsRequest (based on the initial count of 22041) had startPos=22401.

This might be due to the fact that by default GeoNetwork triggers a Lucene index optimization at midnight, every night.

Could you please confirm this is indeed the case for your implementation?

Lucene Index Optimizer

Configuration settings in this group determine when the Lucene Index Optimizer is run. By default, this takes place at midnight each day. With recent upgrades to Lucene, particularly Lucene 3.6.1, the optimizer is becoming less useful, so this configuration group will very likely be removed in future versions.

I installed GeoNetwork 2.6.4 and indeed the default is there:

 

 
 
 
 

The report is here:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160706-002019/services/1/PullResults/

First GetRecordsResponse:

<csw:GetRecordsResponse xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.opengis.net/cat/csw/2.0.2 http://schemas.opengis.net/csw/2.0.2/CSW-discovery.xsd"><csw:SearchStatus timestamp="2016-07-05T23:20:54"/><csw:SearchResults numberOfRecordsMatched="22041" numberOfRecordsReturned="20" elementSet="full" nextRecord="21">

 
Last GetRecordsResponse:
<csw:SearchStatus timestamp="2016-07-06T00:54:52"/>
<csw:SearchResults numberOfRecordsMatched="22045" numberOfRecordsReturned="1" elementSet="full" nextRecord="22042">
 

Result of the interaction with the Discovery Service

Resources available for discovery22041Expected Resource Count22041Actual Resource Count : 22041

 
 
 
Other GetRecordsRepsonses showing the increase in the number of records

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160706-002019/services/1/PullResults/13601-13620/series/20/

<csw:SearchStatus timestamp="2016-07-06T00:13:39"/>
<csw:SearchResults numberOfRecordsMatched="22043" numberOfRecordsReturned="20" elementSet="full" nextRecord="13621">
 

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160706-002019/services/1/PullResults/13621-13640/series/1/

<csw:SearchStatus timestamp="2016-07-06T00:13:45"/>
<csw:SearchResults numberOfRecordsMatched="22044" numberOfRecordsReturned="20" elementSet="full" nextRecord="13641">
 

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160706-002019/services/1/PullResults/14281-14300/services/20/

<csw:SearchStatus timestamp="2016-07-06T00:15:59"/>
<csw:SearchResults numberOfRecordsMatched="22044" numberOfRecordsReturned="20" elementSet="full" nextRecord="14301">
 

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160706-002019/services/1/PullResults/14301-14320/services/1/

<csw:SearchStatus timestamp="2016-07-06T00:16:04"/>
<csw:SearchResults numberOfRecordsMatched="22045" numberOfRecordsReturned="20" elementSet="full" nextRecord="14321">
 

 

#18 Updated by Angelo Quaglia about 4 years ago

From: d.t.read@gmail.com [mailto:d.t.read@gmail.com] On Behalf Of David Read
Sent: 06 July 2016 12:07
To: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>
Cc: Alex.Ramage@transport.gov.scot; Johnny Dixon <john.dixon@defra.gsi.gov.uk>; King, Jason (SCFS) <Jason.King@defra.gsi.gov.uk>; Peter Parslow <Peter.Parslow@os.uk>
Subject: Re: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Angelo,

 

I'll ask Peter Parslow to respond, who manages the CSW for UK. If there's a simple option to change to fix this, then great. But as previously mentioned, we will move to a new service anyway later this summer.

 

David

#19 Updated by Angelo Quaglia about 4 years ago

  • Status changed from Assigned to Feedback

From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 06 July 2016 16:46
To: 'David Read' <david.read@hackneyworkshop.com>
Cc: 'Alex.Ramage@transport.gov.scot' <Alex.Ramage@transport.gov.scot>; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>; 'Peter Parslow' <Peter.Parslow@os.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

David,

Thanks.

No modification is needed.

 

Just need a confirmation that the lucene index optimization happens at 0am as per GeoNetwork default settings.

 

Assuming that is the case, I have added a timeframe (in bold, below) so that the harvesting starts only after 3am GMT:

 

    <ns9:SchedulerDetails>
        <ns9:RecachingFrequency>daily</ns9:RecachingFrequency>
        <ns9:RecachingStartTimeFrame>
                <ns9:TimeWindow>
                        <ns9:AfterSpecificTime>03:00:00</ns9:AfterSpecificTime>
                </ns9:TimeWindow>
        </ns9:RecachingStartTimeFrame>

    </ns9:SchedulerDetails>

Best regards,

Angelo

#20 Updated by Angelo Quaglia about 4 years ago

The new harvesting started this morning just after 5am GMT+2 and ended a bit less than two hours later:

 Page created07 Jul 2016, 03:01:12 GMT     Page modified07 Jul 2016, 04:53:50 GMT

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160707-050054/services/1/PullResults/

The number of records did not change throughout the harvesting:

<csw:SearchStatus timestamp="2016-07-07T05:46:03"/>
<csw:SearchResults numberOfRecordsMatched="22050" numberOfRecordsReturned="10" elementSet="full" nextRecord="22051">
 

I see only one duplicate (fileIdentfier "e664e184-a45c-4c3d-b70e-1e6c76105610")

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160707-050054/services/1/PullResults/19041-19060/datasets/20/

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160707-050054/services/1/PullResults/19061-19080/datasets/1/

I guess the index rebuild phase was not over, yet?

I will push the harvesting start one further.

#21 Updated by Angelo Quaglia about 4 years ago

From: Peter Parslow [mailto:Peter.Parslow@os.uk]
Sent: 08 July 2016 14:57
To: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>; 'David Read' <david.read@hackneyworkshop.com>
Cc: Alex.Ramage@transport.gov.scot; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Angelo, David,

Just to let you all know; we (OS) are looking at this. Csw.data.gov.uk (GeoNetwork) has actually been harvesting from data.gov.uk (CKAN) four times per day, and re-indexing overnight. Each harvest fetches everything again (CKAN currently has limited CSW query support). This means there’s currently no window that’s good. We’ll adjust this to harvest just once/day from CKAN and reindex. We’ll run this for a couple of days & check the logs again – after which we should be able to let you know a suitable window during which Europe can best harvest.

 

Peter

#22 Updated by Angelo Quaglia about 4 years ago

From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 08 July 2016 17:00
To: 'Peter Parslow' <Peter.Parslow@os.uk>; 'David Read' <david.read@hackneyworkshop.com>
Cc: 'Alex.Ramage@transport.gov.scot' <Alex.Ramage@transport.gov.scot>; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Dear Peter,

Many thanks for the update.

 

This issue has some similarities with what is happening with one Polish linked Discovery Service running with GeoNetwork 2.6.4:

https://ies-svn.jrc.ec.europa.eu/issues/2801

 

Best regards,

Angelo

#23 Updated by Angelo Quaglia about 4 years ago

From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 12 July 2016 10:34
To: 'Peter Parslow' <Peter.Parslow@os.uk>; 'David Read' <david.read@hackneyworkshop.com>
Cc: 'Alex.Ramage@transport.gov.scot' <Alex.Ramage@transport.gov.scot>; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Peter, Dave,

The harvesting started at 6pm GMT+2 yesterday and resulted in no duplicates.

While checking I noticed that you have many different formats for the fileIdentifier.

Do you have any policy in place?

Those in curly braces can be problematic when included in URL query strings since the braces need to be percent encoded.

Here are some examples:

00013b1e-cfe1-433a-927b-2054c486b5a4
000a6fb78e0b0d0383b511fba7866401
0118d763f2e91bc64560f0359731ea8b39080f5b
01BD473DC4224F2BBEB5080086051FBD
1881
2013
2013ef14-e047-4918-b738-70e0245d3da1
CEFAS472a3cac-c4c1-4a2f-a27d-8387f96ae691
CEFAS47E82234-304F-4BAD-A5C1-3C6B0F6B4FCD
CU-LANDIS-SERIES_CORRELATIVE
ContaminatedLand
DDC_car_park2
GBWBCA
Marine_Scotland_FishDAC_1037
blackburn-playgrounds-01-12-2014
{00000000-0000-0000-0000-00000000000x}
{06C70DB1-EE2E-43A9-A705-422C58F106EA}
{cherwell-tpo}

You can extract those with this URL (add &wt=json to get the JSON format)
http://inspire-geoportal.ec.europa.eu/solr/select?facet=true&q=id:\/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857*&facet.field=remoteMetadataIdentifier&facet.limit=-1&facet.mincount=1&rows=0

 

URL for extracting duplicates:

http://inspire-geoportal.ec.europa.eu/solr/select?facet=true&q=id:\/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857*&facet.field=remoteMetadataIdentifier&facet.limit=-1&facet.mincount=2&rows=0

<response>

  <lst name="responseHeader">

    <int name="status">0</int>

    <int name="QTime">6</int>

    <lst name="params">

      <str name="q">id:\/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857*</str>

      <str name="facet.limit">-1</str>

      <str name="facet.field">remoteMetadataIdentifier</str>

      <str name="facet.mincount">2</str>

      <str name="rows">0</str>

      <str name="facet">true</str>

    </lst>

  </lst>

  <result name="response" numFound="25691" start="0"/>

  <lst name="facet_counts">

    <lst name="facet_queries"/>

    <lst name="facet_fields">

      <lst name="remoteMetadataIdentifier"/>

    </lst>

    <lst name="facet_dates"/>

    <lst name="facet_ranges"/>

  </lst>

</response>

 

Validation report for yesterday (expires tomorrow):

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160711-183033/services/1/PullResults

 

Best regards,

Angelo

 

 

#24 Updated by Angelo Quaglia about 4 years ago

From: d.t.read@gmail.com [mailto:d.t.read@gmail.com] On Behalf Of David Read
Sent: 12 July 2016 18:23
To: Peter Parslow <Peter.Parslow@os.uk>
Cc: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>; Alex.Ramage@transport.gov.scot; Johnny Dixon <john.dixon@defra.gsi.gov.uk>; King, Jason (SCFS) <Jason.King@defra.gsi.gov.uk>
Subject: Re: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Great to hear about the progress all round.

 

Peter, please can we take you up on your suggestion of you harvesting DGU at midnight, because it is not insignificant load for us, which is not ideal at peak time.

 

Regards,

David

#25 Updated by Angelo Quaglia about 4 years ago

From: Peter Parslow [mailto:Peter.Parslow@os.uk]
Sent: 12 July 2016 18:13
To: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>; 'David Read' <david.read@hackneyworkshop.com>
Cc: Alex.Ramage@transport.gov.scot; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Angelo et al,

Two things: I can confirm that csw.data.gov.uk had been harvesting (complete collection) from data.gov.uk every six hours (03:00 GMT, 09:00 GMT, 15:00 GMT, 21:00 GMT), with the optimize reindexing at midnight. This had given some problems, where a harvest was still in progress when the optimizer ran, and occasionally harvests would overlap (although usually only taking one hour).

 

We have now changed the settings, so that it harvests once per day, at noon. Harvests for the past few days have taken an hour. The optimizer is now running in about one minute, at midnight. (David: it may be more sensible for us to swap these round – what do you think?)

 

Therefore, clients harvesting from csw.data.gov.uk are much less likely to have any problems.

 

Second thing: fileIdentifiers. The current UK encoding guidance simply states “The content of the XML element shall be a unique managed identifier, such as a system generated UUID. Once the identifier has been set for a metadata instance it shall not change.” A lot of different systems are used to create the metadata instances, and each is left to manage its own identifiers. Conflicts would be reported when the records are collected to data.gov.uk.

 

If there are certain characters which give a problem to the European portal, it would be sensible to mention that in the revised European metadata guidance. However, we can include it in the next revision of the UK one (due this year) – but it may take some time for all the participating data publishers to implement whatever changes we propose.

 

It may be better to code defensively. Especially given that some systems like to wrap their UUIDs in curly braces.

 

I note I can use this URL in my browser (Firefox, Internet Explorer) http://csw.data.gov.uk/geonetwork/srv/en/csw?service=CSW&version=2.0.2&request=GetRecordById&id={cherwell-tpo} and it retrieves the record you would expect.

 

This article gives a summary of the issue: that URLs shouldn’t have curly braces, but many browsers accept them: http://stackoverflow.com/questions/23064605/when-if-ever-should-characters-like-and-curly-braces-be-percent-encoded

 

Now you have highlighted it, I see that RFC3986 is quite clear that ‘curly braces’ are not among the characters allowable in a URI. – I hadn’t realised that, and I guess that many system builders don’t either. So it makes sense to disallow them from the fileIdentifier. Perhaps only allowing characters that are allowed in URIs? Or do we have call for Internationalised Resource Identifiers (RFC 3987)? But this should really be at the European guidance level, not specifically UK – although I’m happy to include it in our revised guidance later this year..

 

I’m glad things are generally working now.

 

Peter

#26 Updated by Angelo Quaglia about 4 years ago

From: Angelo Quaglia [mailto:angelo.quaglia@ext.jrc.ec.europa.eu]
Sent: 12 July 2016 19:15
To: 'Peter Parslow' <Peter.Parslow@os.uk>; 'David Read' <david.read@hackneyworkshop.com>
Cc: 'Alex.Ramage@transport.gov.scot' <Alex.Ramage@transport.gov.scot>; 'Johnny Dixon' <john.dixon@defra.gsi.gov.uk>; 'King, Jason (SCFS)' <Jason.King@defra.gsi.gov.uk>
Subject: RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers

 

Peter,

 

That’s great, I will continue to keep an eye on harvesting results and report about any inconsistency.

 

About the fileIdentifiers, browsers do a very good job when it comes to making a URL work but, as you correctly point out, curly braces are unsafe characters.

For that reason the INSPIRE Geoportal Validator complains harshly when it finds those characters in query strings of URLs inside capabilities or metadata files and they are not percent-encoded.

This is based on the current Metadata guidelines which reference IETF RFC1738 and IETF RFC 2056 as domain for URLs.

 

As for the actual format, there is no recommendation in INSPIRE, with the exception of the recommendation given by ISO AP 1.0 about that they should be UUIDs.

What concerns me most is that they should be unique across the INSPIRE Infrastructure so that a GetRecordById can always return a single records also after all metadata have been collected inside the INSPIRE Geoportal.

 

Best regards,

Angelo

 

 

#27 Updated by Angelo Quaglia about 4 years ago

  • Status changed from Feedback to Resolved

From: Peter Parslow [mailto:Peter.Parslow@os.uk]
Sent: 13 July 2016 15:29
To: David Read <david.read@hackneyworkshop.com>
Cc: Angelo Quaglia <angelo.quaglia@ext.jrc.ec.europa.eu>; Alex.Ramage@transport.gov.scot; Johnny Dixon <john.dixon@defra.gsi.gov.uk>; King, Jason (SCFS) <Jason.King@defra.gsi.gov.uk>
Subject: harvest times (was RE: [Geoportal Helpdesk - Support #2796] UK: duplicate fileIdentifiers)

 

Done – we’ve swapped the harvest & optimize times.

 

As Jason says, the conversation has drifted. Perhaps this incident should now be closed.

 

Peter

#28 Updated by Angelo Quaglia almost 4 years ago

Just checked: no duplicates found, today, after a harvesting performed yesterday at 6pm:

INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857_20160802-180049

http://inspire-geoportal.ec.europa.eu/solr/select?facet=true&q=id:\/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857*&facet.field=remoteMetadataIdentifier&facet.limit=-1&facet.mincount=2&rows=0

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">6</int>
<lst name="params">
<str name="q">id:\/INSPIRE-f89f4772-05f5-11e1-b7de-52540004b857*</str>
<str name="facet.limit">-1</str>
<str name="facet.field">remoteMetadataIdentifier</str>
<str name="facet.mincount">2</str>
<str name="rows">0</str>
<str name="facet">true</str>
</lst>
</lst>
<result name="response" numFound="25730" start="0"/>
<lst name="facet_counts">
<lst name="facet_queries"/>
<lst name="facet_fields">
<lst name="remoteMetadataIdentifier"/>
</lst>
<lst name="facet_dates"/>
<lst name="facet_ranges"/>
</lst>
</response>

#29 Updated by Angelo Quaglia almost 4 years ago

  • Category set to Harvesting results

#30 Updated by Angelo Quaglia almost 4 years ago

  • Proactive set to Yes

#31 Updated by Angelo Quaglia almost 4 years ago

  • Status changed from Resolved to Closed

#32 Updated by Angelo Quaglia almost 4 years ago

  • Country set to UK - United Kingdom

Also available in: Atom PDF