Support #3439

FR: Harvesting of french geocatalogue

Added by Angelo Quaglia over 1 year ago. Updated about 1 month ago.

Status:ClosedStart date:11 Dec 2018
Priority:NormalDue date:
Assignee:Daniele Francioli% Done:

0%

Category:Harvesting process
Target version:-
Submitting Organisation:FR Knowledge-Base relevant?:No
Proactive:No Keyword #1:
Country:FR - France Keyword #2:
Originating UI: Keyword #3:

Description

From: Vilmus Thierry [T.Vilmus@brgm.fr]
Sent: 11 December 2018 12:28
To: QUAGLIA Angelo (JRC-ISPRA-EXT)
Subject: Harvesting of french geocatalogue
 
 

Dear Angelo,

 

It’s a long time we have not been in touch !

I have noticed that FR Geocatalogue is not harvested since 29th October. Is there any problem ?

Could you please try to harvest, in order for us to see if improvements we have made are working to enhance downloadable and viewable data ?

 

Many thanks in advance !

Thierry

History

#1 Updated by Angelo Quaglia over 1 year ago

Dear Thierry,

it is a long time indeed.

Yes, there is a big problem.

The last time the INSPIRE Geoportal was able to run a harvesting was in mid-November:

http://inspire-geoportal.ec.europa.eu/resources/errors/INSPIRE-5145fa60-0067-11e5-9ea6-52540004b857_20181113-035117/services/1/PullResults/

 

However, the results were discarded because only half of the records could be actually retrieved:

Resources available for discovery85638Expected Resource Count85638Actual Resource Count : 48220

The failed batches returned a GetRecordsResponse envelope with no record inside, for example:

http://inspire-geoportal.ec.europa.eu/resources/errors/INSPIRE-5145fa60-0067-11e5-9ea6-52540004b857_20181113-035117/services/1/PullResults/4701-4750/resourceReport/

 

However, that is not the main issue.

The other bugger problem is that the time required for the processing of the results (not just the time to fetch the metadata which only takes half an hour or so) has been increasing dramatically.

 

I have been investigating the issue and the reason for the huge slowdown, is in the very high number of false matches between datasets and layers and download service datasets.

The root cause is found in empty Unique Resource Identifiers declared in dataset metadata, like, for example:

<gmd:identifier xlink:type="simple">
    <gmd:MD_Identifier>
        <gmd:code xmlns:gco="http://www.isotc211.org/2005/gco" gco:nilReason="missing">
        <gco:CharacterString/>
     </gmd:code>
    </gmd:MD_Identifier>
</gmd:identifier>

All these false matches inflate all the files and make the processing very costly.

For example, for this dataset the INSPIRE Geoportal found 45 layers and 1695 download service datasets:

http://inspire-geoportal.ec.europa.eu/resources/errors/INSPIRE-5145fa60-0067-11e5-9ea6-52540004b857_20181203-125029/services/1/PullResults/22401-22450/datasets/20/

The identifier is declared as an empty code

Unique Resource Identifier
Code:

 

I have now introduced a workaround in the INSPIRE Geoportal code but it will take some time to publish the fix to production.

Therefore, yesterday, I started a fresh harvesting and I let it run alone on the server, with no other harvesting running.

I am unable to estimate the time it will take to finish 

If it completes successfully, the havresting report will be accessible here:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-5145fa60-0067-11e5-9ea6-52540004b857_20181210-121422/services/1/PullResults/

I will keep you posted.

Best regards,

Angelo

 

 

 

#2 Updated by Angelo Quaglia over 1 year ago

  • Status changed from Assigned to Feedback

Dear Thierry,

I do not have good news.

As explained in my previous comment, many French dataset and series metadata contain empty identifiers like the following one:

<gmd:identifier xlink:type="simple">
    <gmd:MD_Identifier>
        <gmd:code xmlns:gco="http://www.isotc211.org/2005/gco" gco:nilReason="missing">
        <gco:CharacterString/>
     </gmd:code>
    </gmd:MD_Identifier>
</gmd:identifier>

This causes a very high number of false positive when linking datasets with download services.

Point Of Contact: Région Guyane – Cellule SIG
E-mail: guyane-sig@cr-guyane.fr

http://inspire-geoportal.ec.europa.eu/resources/errors/INSPIRE-5145fa60-0067-11e5-9ea6-52540004b857_20181203-125029/services/1/PullResults/21151-21200/series/18/

The INSPIRE Geoportal finds 1695 dataset offered by Download Services.

I have modified the code to exclude those abnormal identifiers from being matched and I will deply the udpated code at the first opportunity.

Best regards,

Angelo

 

 

 

 

 

#3 Updated by Angelo Quaglia over 1 year ago

Dear Thierry,

ignoring the corrupt Spatial Data Set Unique Resoruce Identifiers in the metadata, i.e.  

<gmd:identifier xlink:type="simple">
    <gmd:MD_Identifier>
        <gmd:code xmlns:gco="http://www.isotc211.org/2005/gco" gco:nilReason="missing">
        <gco:CharacterString/>
     </gmd:code>
    </gmd:MD_Identifier>
</gmd:identifier>

did the trick.

The latest harvesting is available:

http://inspire-geoportal.ec.europa.eu/resources/INSPIRE-5145fa60-0067-11e5-9ea6-52540004b857_20190111-194111/services/1/PullResults/

Best regards,

Angelo

#4 Updated by Thierry Vilmus 9 months ago

Dear Angelo,

I've tried many things on my end trying to solve geoportal harvesting problem, but none seems to work...

According to Tomas, we have too many metadata ?

I'd like to try another endpoint to send at least priority datasets and national-scoped datasets. In the harvesting console, it is written that I can manage discovery endpoints but I can't find where I can change the URL of the CSW service. Is there somewhere I can do that or you have to do some action for that ?

Thanks in advance,

Kind regards,

Thierry

#5 Updated by Angelo Quaglia 9 months ago

  • Knowledge-Base relevant? set to No

Dear Thierry,

this is to let you know that I have accepted a new career opportunity in Brussels for the European commission, DG-GROW.

Your question will be addreesed by the other compenents of my old team.

Best regards,

Angelo

#6 Updated by Daniele Francioli 9 months ago

Dear Thierry,

Currently if you want to change the end point, we have to do it. If you provide us the new endpoint, we will change it for you.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#7 Updated by Daniele Francioli 9 months ago

  • Assignee changed from Angelo Quaglia to Daniele Francioli

#8 Updated by Thierry Vilmus 9 months ago

Dear Daniele,

 

As it seems we have too many metadata, is it possible to keep the same endpoint, but to filter with a certain keyword, 'directive' from example. Something like :

http://www.geocatalogue.fr/api-public/inspire/servicesRest?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetRecords&TYPENAMES=gmd:MD_Metadata&outputSchema=http://www.isotc211.org/2005/gmd&startPosition=1&maxRecords=20&resultType=results&elementSetName=full&constraintlanguage=CQL_TEXT&constraint=subject%20LIKE%20%27*directive*%27

Currently, we have 933 records that match this criteria.

Best regards,

Thierry

 

#9 Updated by Daniele Francioli 9 months ago

Dear Thierry,

It is possible to filter but through an OGC filter (XML-based Filter Encoding - https://www.opengeospatial.org/standards/filter)

Can you please implement the filter, and provide us the xml fragment?

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#10 Updated by Thierry Vilmus 9 months ago

Dear Daniele,

 

Here is the POST request that I use :

Host : http://www.geocatalogue.fr

Path : /api-public/inspire/servicesRest

Parameters :

content-type : application/xml

body :

<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:ogc="http://www.opengis.net/ogc" xmlns:gmd="http://www.isotc211.org/2005/gmd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:apiso="http://www.opengis.net/cat/csw/apiso/1.0"
                xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/"
                service="CSW" version="2.0.2" maxRecords="10" startPosition="1" resultType="results" outputSchema="http://www.isotc211.org/2005/gmd" outputFormat="application/xml">

  <csw:Query typeNames="gmd:MD_Metadata">
    <csw:ElementSetName>full</csw:ElementSetName>
    <csw:Constraint version="1.1.0">
      <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc">
        <ogc:PropertyIsLike wildCard="*" singleChar="#" escapeChar="!">
            <ogc:PropertyName>subject</ogc:PropertyName>
            <ogc:Literal>*directive*</ogc:Literal>
        </ogc:PropertyIsLike>
      </ogc:Filter>
    </csw:Constraint>
  </csw:Query>
</csw:GetRecords>

 

Maybe you just need the filter fragment :

      <ogc:Filter xmlns:ogc="http://www.opengis.net/ogc">
        <ogc:PropertyIsLike wildCard="*" singleChar="#" escapeChar="!">
            <ogc:PropertyName>subject</ogc:PropertyName>
            <ogc:Literal>*directive*</ogc:Literal>
        </ogc:PropertyIsLike>
      </ogc:Filter>
 

Please tell me if this is the expected syntax.

Best regards,

Thierry

#11 Updated by Daniele Francioli 9 months ago

Dear Thierry,

Thank you for the query. We are testing it in our staging environment.

We will come back to you once the test will be completed.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#12 Updated by Daniele Francioli 8 months ago

Dear Thierry,

The harvest with the OGC filter has been completed in our staging environment.

The number of metadatat found are 936 but the number of downloadable and viewable are 0. Are you aware of this?

If you want, we can set up the ogc filter in production so that you can start the harvest and check the results.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

 

 

#13 Updated by Thierry Vilmus 8 months ago

Dear Daniele,

I'm happy that the filter is working good.

Yes I'm aware of problems with downloadable and viewable data. In fact this is not the good filter, we should filter datasets that have 'INSPIRE priority data set' as thesaurus title, but I don't know how to do that... Do you know how to write a filter on a thesaurus title ?

If we find such a filter, I would be very interested to have it in production before friday as we have a meeting this day with data producers and I'd like to show them some results on the EU geoportal !

Best regards,

Thierry

#14 Updated by Thierry Vilmus 8 months ago

I have another request :

we have too many metadata for a harvesting in one go (about 90000).

So we should split our metadata in several endpoints.

Do you know the approx. number of metadata that your system is able to harvest ? so I'll know how many endpoints to configure.

Last question : do you know why the harvesting in one go does not work anymore ? It used to work a few months ago...

Thanks in advance, best regards !

Thierry

 

#15 Updated by Daniele Francioli 8 months ago

Dear Thierry,

Regarding the OGC filter, we are sorry but since we are short in resources we cannot help you with this.

Regarding splitting the endpoint, it is possible but please keep the number of endpoints reasonable (e.g. 10 - 15).

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#16 Updated by Thierry Vilmus 8 months ago

Dear Daniele,

I don't want to split my endpoints.

I am forced to do this because you are unable to harvest the unique endpoint we have now, for an unknown reason.

So I need a CLEAR answer from your side : how many records are you able to harvest in one go ?

Thanks for a quick and clear answer.

Regards,

Thierry

#17 Updated by Daniele Francioli 8 months ago

Dear Thierry,

Before splitting the endpoints, we can propose you 2 solutions:

    - We have one harvest (started on the 23rd of October) that has been completed in our internal staging environment. We can move this harvest in production so that you can check the results. Meanwhile, we can exceptionally start another harvest session in our internal staging environment.
    - If your system can support multiple requests in a short time frame (without banning us), we can use the approach of getting all the catalogue content in one session and then processing it locally.

Can you please tell us which scenario do you prefer?

Thank you in advance,

Daniele on behalf of the JRC INSPIRE Support team

#18 Updated by Thierry Vilmus 8 months ago

Dear Daniele,

Many thanks for your proposals.

I would prefer the second scenario. Do I have to schedule a harvest in the console or can you start it from your end ?

If it is not working we'll try to provide you with 2 or 3 endpoints that I hope will be easier to harvest...

Best regards,

Thierry

 

 

#19 Updated by Daniele Francioli 8 months ago

Dear Thierry,

We have set up the configuration in our staging enviromnent. We are running a test harvest.

We will come back to you once completed.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#20 Updated by Thierry Vilmus 7 months ago

Dear Daniele,

How was the test in your staging environment ?

Anyway, we are coming with a new endpoint, with only a hundred of records :

http://www.geocatalogue.fr/api-public/prior/servicesRest?SERVICE=CSW&VERSION=2.0.2

(inspire is replaced by prior in the URL)

this is the endpoint for our priority datasets.

 

Could you please harvest this endpoint in your production environment ?

I have scheduled a new harvest in the geoportal harvesting console, please use the new endpoint

I'm eager to see what happen !

Best regards,

Thierry

#21 Updated by Daniele Francioli 7 months ago

Dear Thierry,

Can you please check the URL (and service configuration) of the endpoint you posted?

We did a test call to the GetCapabilities and we got the following error.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

----

curl -X GET -i 'http://www.geocatalogue.fr/api-public/prior/servicesRest?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities'

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ows:ExceptionReport version="1.0.0" language="en" xmlns="http://www.w3.org/2001/SMIL20/" xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:gml="http://www.opengis.net/gml" xmlns:gts="http://www.isotc211.org/2005/gts" xmlns:ows="http://www.opengis.net/ows"
  xmlns:ogc="http://www.opengis.net/ogc" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:gmx="http://www.isotc211.org/2005/gmx" xmlns:dct="http://purl.org/dc/terms/" xmlns:srv="http://www.isotc211.org/2005/srv" xmlns:ns15="http://www.w3.org/2001/SMIL20/Language"
  xmlns:ns14="http://www.opengis.net/gml/3.2" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:gmd="http://www.isotc211.org/2005/gmd">
  <ows:Exception exceptionCode="NoApplicableCode" locator="">
    <ows:ExceptionText>the capabilities document is not found</ows:ExceptionText>
  </ows:Exception>
</ows:ExceptionReport>

#22 Updated by Thierry Vilmus 7 months ago

Dear Daniele

you're right the capabilities file has not been migrated into the production environment.

Sorry ! I keep you informed when it is ok

best regards

Thierry

#23 Updated by Thierry Vilmus 7 months ago

Dear Daniele,

it should be Ok now.

Best regards,

Thierry

 

#24 Updated by Daniele Francioli 7 months ago

Dear Thierry,

We launched the harvest in our staging environment. I will let you know once finished.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#25 Updated by Daniele Francioli 7 months ago

Dear Thierry,

The new endpoint succesfully completed the harvest in our staging environment (you have a preview of the results in the following image).

We moved the new endpoint in production. I saw there is a scheduled request in the harvest console. The next harvest will run on the new endpoint.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#26 Updated by Thierry Vilmus 7 months ago

Dear Daniele,

 

Thank you very much !

The harvest in production ran Ok with the new endpoint.

Our endpoint was missing service metadata, that's why we have so few downloadable and viewable datasets. We have fixed this, can you please run a new harvest today ? I'm not able to log in the harvesting console to schedule it...

 

Next step we'll have to set up 2 more endpoints (insp1 and insp2) for you to be able to harvest all the french inspire metadata.

How will it work in the harvesting console? Will I be able to schedule a harvest for a specific endpoint or all 3 endpoints will be harvested at the same time ?

 

Best regards,

Thierry

 

#27 Updated by Daniele Francioli 7 months ago

Dear Thierry,

We tested the current endpoint in our staging environment. You can find a screenshot of the results here.

Can you please provide the URL of the 2 additional endpoints?

Thank you,

Daniele on behalf of the JRC INSPIRE Support team

#28 Updated by Thierry Vilmus 7 months ago

Dear Daniele,

 

Thank you for the screenshot, so we still have some work about downloadable and viewable data !

Can you please run the harvest in the production environment ?

The 2 additionnal endpoints will be :

http://www.geocatalogue.fr/api-public/insp1/servicesRest?

http://www.geocatalogue.fr/api-public/insp2/servicesRest?

but they are still empty at the moment.

Best regards,

Thierry

#29 Updated by Daniele Francioli 7 months ago

Dear Thierry,

We tested both of the endpoints ([1],[2]) and we get the error below. 

[1] http://www.geocatalogue.fr/api-public/insp1/servicesRest?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

[2] http://www.geocatalogue.fr/api-public/insp2/servicesRest?SERVICE=CSW&VERSION=2.0.2&REQUEST=GetCapabilities

<ows:ExceptionReport version="1.0.0" language="en">
<ows:Exception exceptionCode="OperationNotSupported" locator="request">
<ows:ExceptionText>
The operation GetCapabilities' is not supported by the service
</ows:ExceptionText>
</ows:Exception>
</ows:ExceptionReport>

Can you please check?

Thank you,

Daniele on behalf of the JRC INSPIRE Support team

#30 Updated by Thierry Vilmus 7 months ago

This is normal the 2 endpoints are not ready yet.

I keep you informed.

Best regards,

Thierry

#31 Updated by Daniele Francioli 7 months ago

Dear Thierry,

In order for us to create the 2 new endpoints in the system, the GetCapabilities shall be available.

Can you please provide them?

Thank you,

Daniele on behalf of the JRC INSPIRE Support team

#32 Updated by Thierry Vilmus about 1 month ago

Dear JRC Team,

 

First I would like to thank you for all the work done in order to be able to handle our large Inspire endpoint, and to have reduce the time needed to harvest. It's a great thing !

BUT, I really don't understand what is happening with downloadable datasets. Their number is going down at every harvest. Last time, it was very low at 2226 downloadable datasets and now you find only 728 downloadable datasets into our 40000+ datasets.

Same thing with our priority datasets : we have not change anything, however we have now 0 downloadable datasets according to your reports !

Please inform of what we should do to fix this.

Best regards,

Thierry Vilmus

#33 Updated by Daniele Francioli about 1 month ago

  • Status changed from Feedback to Assigned

Dear Thierry,

Thank you for your message. We are checking this issue and we will come back to you as soon as possible.

Best regards,

Daniele on behalf of the JRC INSPIRE Support team

#34 Updated by Daniele Francioli about 1 month ago

  • Status changed from Assigned to Closed

I close this issue, since the last message is a duplicate of issue #3855

Also available in: Atom PDF