Discussion #2683

Feedback on v0.7 of the WCS-based download service TG

Added by Michael Lutz over 4 years ago. Updated over 4 years ago.

Status:Assigned
Priority:High
Assignee:James Passmore

Description

Please use this issue to provide feedback on the draft version of the WCS download service TG (v0.7) until 19 February.

You can provide your feedback either as tracked changes/comments in the Word document and upload it here, or also post them as simple comments to this issue.

Based on the feedback received, we will then have further discussions in the group until 11 March.

History

#1 Updated by Mikko Visa over 4 years ago

Some notes about grids especially related to AC-MF INSPIRE theme and James comment 13 (Have any other data themes defined different grids?). The WCS draft TG mentions

"The grid is hierarchical, with resolutions of 1m, 10m, 100m, 1 000m, 10 000m and 100 000m. "

This is however not the case for grids in meteorology for instance, the resolution could be anything. In INSPIRE IR it is mentioned, section 13.3 (1):

"13.3.Theme-specific Requirements

 (1)

By way of derogation from the requirements of Section2.2 of Annex II, gridded data related  to  the  themes Atmospheric Conditions  and Meteorological Geographical Features may be made available using any appropriate grid"

So in answer to your question, yes the AC-MF data theme specifies basically any kind of grid could be possible...

of
Section
2.2 of
Annex II, gridded data
related to the them
e
s
A
tmospheric
C
onditions and
M
eteorological
G
eographical
F
eatures
may be made available using any appropriate grid

#2 Updated by Mikko Visa over 4 years ago

Some notes about grids especially related to AC-MF INSPIRE theme and James comment 13 (Have any other data themes defined different grids?). The WCS draft TG mentions

 

"The grid is hierarchical, with resolutions of 1m, 10m, 100m, 1 000m, 10 000m and 100 000m. "

 

This is however not the case for grids in meteorology for instance, the resolution could be anything. In INSPIRE IR it is mentioned, section 13.3 (1):

 

"13.3.Theme-specific Requirements

 

 (1)

 

By way of derogation from the requirements of Section2.2 of Annex II, gridded data related  to  the  themes Atmospheric Conditions  and Meteorological Geographical Features may be made available using any appropriate grid"

 

So in answer to your question, yes the AC-MF data theme specifies basically any kind of grid could be possible...

#3 Updated by James Passmore over 4 years ago

Thanks Mikko,

I see there is also a discussion on the INSPIRE Geographical grid systems on the INSPIRE thematic cluster:

https://themes.jrc.ec.europa.eu/discussion/view/10935/usability-of-the-zoned-geographic-grid-grid-etrs89-grs80

 

I think the conclusion for us with respect the WCS TG, is that Geographical grid systems are not in scope here.  It is something that is more in scope for OGC WMTS when serving for example Aerial Orthophotos.

 

 

#4 Updated by Peter Baumann over 4 years ago

Editorial comments:

- p 7: "Note that while..." - this sentence seems a leftover from another document, not completely applicable to the document on hand.

- p 12: "OGC 09-146r1 (also known as Coverage Implementation Schema, and commonly known as GMLCOV) tells us that: " - sugges to rephrase this so as to be in sync with OGC and ISO in future: "OGC 09-146r1 (Coverage Implementation Schema 1.0 or CIS 1.0, formerly known as GML 3.2.1 Application Schema Coverages or GMLCOV) tells us that: "

- p 12: "However, ..." is in italics indicating we are still in citation mode, which AFAICS is not the case

- p 14: References include OGC O&M, which does not apply here AFAICS

- p 15: (10) features is singular, phenomena plural - intended?

- p 16: (17) "range <coverage>" is confusing syntax - to a programmer it looks like a template. Suggestion: "range [of a coverage]"

- p 16: (18) register is talking about files, which is really low-level (what about databases?) - can this be made more bit-independent = high-level?

 

Technical discussion:

- p 12: as the comment on p 13 indicates, discussion should be reviewed and possibly adjusted wrt CIS 1.1.

- p 16 (12) geo grid systems: As is underlined by the corresponding discussion item, it is IMHO questionable whether this should be included here as (i) it is a completely orthogonal issue and (ii) concerns about practical feasibility of DGGS, for example, have been raised during the OGC RFC period.

- p 16 (13) GML coverage - this definition is misleading as it couples coverages to GML in the eye of the reader, which is not the case as coverages essentially are format independent. If you want to denominate what the definition body says I suggest "CIS 1.0 coverage" as opposed to "CIS 1.1 coverage". Generally, as these constitute OGC's coverage model it can also be referred to as "OGC coverage" (ignoring the fact that ISO is adopting the OGC standard as well). Further, giving two uncorrelated definitions of a coverage in (6) and (13) is confusing to the reader. I suggest to talk about coverages only and point out that an abstract definition is given in ISO 19123 and a concrete, interoperable one in OGC 09-146rX. This is in line with ISO's path towards 19123-1 (=19123) and 19123-2 (=09-146).

 

(apologies, had to stop here for time constraints)

 

 

#5 Updated by James Passmore over 4 years ago

Thanks Peter

Considering the technical comments, but looking at the editorial comments initially:

- p 7: "Note that while..." - this sentence seems a leftover from another document, not completely applicable to the document on hand.

Agreed, the whole section is from another document and I originally thought that the whole section might get culled (as there is a generic guidance in draft) but to avoid confusion, until such point we agree or not to keep the section, I have now cut that whole sentance.

- p 12: "OGC 09-146r1 (also known as Coverage Implementation Schema, and commonly known as GMLCOV) tells us that: " - suggest to rephrase this so as to be in sync with OGC and ISO in future: "OGC 09-146r1 (Coverage Implementation Schema 1.0 or CIS 1.0, formerly known as GML 3.2.1 Application Schema Coverages or GMLCOV) tells us that: "

I've changed the text to this

- p 12: "However, ..." is in italics indicating we are still in citation mode, which AFAICS is not the case

The text comes from the Introduction of OGC 09-146r1 2nd paragraph

p 14: References include OGC O&M, which does not apply here AFAICS

Cut

p 15: (10) features is singular, phenomena plural - intended?

Good catch, I have changed to "features" and kept "abstraction of real world phenomena"

p 16: (17) "range <coverage>" is confusing syntax - to a programmer it looks like a template. Suggestion: "range [of a coverage]"

I've changed this as suggested

p 16: (18) register is talking about files, which is really low-level (what about databases?) - can this be made more bit-independent = high-level?

I've changed to database as suggested.

 

You can access (READ-ONLY) my working copy of the document here:

https://dl.dropboxusercontent.com/u/71658964/Technical-Guidance-for-INSPIRE-Download-Services_WCS_Draft-latest.docx

 

#6 Updated by Mikko Visa over 4 years ago

A few more comments regarding chapter 7 / QoS.

Chapter 7.1 General requirements:

As responses can be very large sometimes, at least in theory if not specifying a bbox or so (not the usual case but possible). I think we cannot require an initial response of 10 or so seconds. Preparing of tera/petabyte-magnitude data is very unlikely to go under that. Should it be stated something like "for requests > 1GB" (just an example) the initial response times can not be promised to be under X seconds? Because it is still valid and possible to query the whole data set or at least a very big subset, unlikely of course but still..

Chapter 7.3.2 / TG Requirement 19:

We believe "A measurement shall take place at least once before launching the service in a production environment and monitored at regular intervals thereof to ensure that the compliance with the capacity requirement is still ensured" should not be a requirement but rather a recommendation. You could require that the service adhers to the response time requirements, but it should ne up to the service provider how he is going to fulfill the requirements.

BR,

Mikko

#7 Updated by Ilkka Rinne over 4 years ago

Mikko Visa wrote:

As responses can be very large sometimes, at least in theory if not specifying a bbox or so (not the usual case but possible). I think we cannot require an initial response of 10 or so seconds. Preparing of tera/petabyte-magnitude data is very unlikely to go under that.

From a practical point of view I would like to disagree:

Firstly, if the service allows huge responses to be delivered as responses, it should probably use some kind of streaming or chuncked response, and thus be able to return the initial response (bytes) within 10 (or 30) seconds. This is also important for the functinality of the server, as the servives are unlikely to be able to keep the entire tera/petabyte result in memory anyway for preparing the response.

Another related the case is where the calculations required for preparing the result would take a long time due to the huge volume of the processed data, even if the final result would only be a limited-size subset. This can certainly happen for particular combinations of source datasets and query paramaters, especially if a lot of internal processing is required. However, I would say that for relatively open-access services like the INSPIRE Download servies such services would be an unintentional DOS attack waiting waiting to happen. It would be better if it would not be possible for uneducated users to bring the services down just by making a few requests.

So I would say that if an INSPIRE Download Service is not able to return the initial bytes under 10 seconds, it's a problem. This is not to say that some other WCS could not take longer, if it's clients (and the server technology) can be expected to handle it.

#8 Updated by Peter Baumann over 4 years ago

This is why I discussed whether such a general approach is feasible: just stating a response time, but not the task associated. Among the impact factors (several listed above already) are: download volume, which depends on the data format (compression lossless? lossy?) and its encoding effort; CRS reprojection requested? How many bands selected? Item in cache already? Metadata requested, too? Regular or irregular grid? (the latter has more data associated even when containing the same amount of pixels - a ReferenceableGridCoverage, for example, can double its volume this way; and for sure more.

Measuring the first byte arriving is technically not that easy, in particular for naive service providers without deep system knowledge.

I'd rather suggest that INSPIRE provides (smallish) sample coverages which providers can import together with the concrete requests to be benchmarked, and then comparable statements can be made about download speed (ignoring potential network latencies).

#9 Updated by Ilkka Rinne over 4 years ago

Peter Baumann wrote:

Measuring the first byte arriving is technically not that easy, in particular for naive service providers without deep system knowledge. I'd rather suggest that INSPIRE provides (smallish) sample coverages which providers can import together with the concrete requests to be benchmarked, and then comparable statements can be made about download speed (ignoring potential network latencies).

Sure, it's not easy to measure the first byte, but this is how the Performance QoS indicator testing procedures for all INSPIRE Download Service types have been defined AFAIK. If you look at the table for allowed download times for the complete results of a single requests at the end of 7.2.2, download of 1GB dataset is allowed to take 34 minutes. If the services are allowed to take that long before they return any data to the users, they are completely useless.

I also understand your sample data sets & predefined test queries approach: Indeed it would make comparing (benchmarking) the server products and the online services a lot easier, if that was the goal. However, If I have understood correctly, the idea of the QoS criteria of the INSPIRE Network services is not to allow benchmarking between services, but to define a minimum level of experienced service quality from the users' point of view. In the extreme this view could result in data providers having to limit the selection of INSPIRE data sets and/or query functionality to stay compliant with the INSPIRE Performance limits. I'm not entirely sure this would be bad thing, as not all services have be published as INSPIRE compliant anyway.

 

#10 Updated by Jukka Rahkonen over 4 years ago

Here is a quick analysis about large GeoTIFF responses and possible streaming from a MapServer developer. Some issues are specific for MapServer like the creation of in-memory image first which is memory bound and makes it impossible to serve very large responses. Creation of temporary in-memory image also means that 10 second initial response time is impossible with the current WCS implementation of MapServer. From my own experience I am pretty sure that GeoServer is also creating a local in-memory or in-disk file first before it starts to send the WCS response and it can't fulfill the requirement of 10 seconds initial response either.

The other comment about compressed GeoTIFFs is generic: even it would probably be possible to stream lossless LZW or deflate compressed images the clients should store the whole stream and wait untill getting the final bits because they contain essential info about offsets and sizes. Therefore GeoTIFFs should be streamed uncompressed. Compered to LZW compression it would mean roughly 40% more bandwidth for aerial images which is not so huge difference. I do not know how well other formats like NetCDF suit for streaming.

I do not mean that the Quality of Service should be written to match with the capabilities of MapServer and GeoServer. There are other WCS servers which are better with handling large responses. With MapServer and GeoServer the practical size limit of GeoTIFF output seems to be about 1-5 GB.

The developers analysis:

There are several issues :

- MapServer currently composes a in-memory image of the result, that is limited to MAXSIZE x MAXSIZE pixels (defaults to 2048 if unspecified). If you want to generate really big images, then the architecture of the WCS renderer should be significantly changed. A possibility would perhaps be to use GDAL VRT to have a on-the-fly result raster, not memory bound.

- if the above is not a problem or solved, you also need an output format with streaming capabilities. For GeoTIFF, GDAL 2.0 supports creating *uncompressed* GeoTIFF ( see http://gdal.org/frmt_gtiff.html ). It could likely be extended to generate compressed GeoTIFF in streaming mode, but I didn't go to that point as such GeoTIFFs couldn't be read in a streaming way (since the tables with tiles/strips offsets and sizes must be written at the end of the file, once all offsets and sizes are known) and the objective was to be able to pipe out / pipe in the files.

But even if the output format is streamable, MapServer currently either generates a in-memory or on-disk temporary output file before sending the bytes. This would require changes in it too.

 

#11 Updated by James Passmore over 4 years ago

Comment on QoS

The QoS section in the download service WCS TG document is taken from the existing download service (WFS +Atom) TG, and as mentioned in a note is unchanged (other than comments) because it was originally thought that the QoS section was generic.

It should be noted that this TG is not legally binding, it is an attempt to quantify how the legal requirements can be mapped to required operations and in the QoS section how the performance MIGHT be measured; so the QoS section as it stands is an interpretation, and it may be that an alternate interpretation is possible for WCS (and other download services).

The INSPIRE Network Service regulation has the following requirement for Get Spatial Data Set operation and for the Get Spatial Object operation

For the Get Spatial Data Set operation and for the Get Spatial Object operation, and for a query consisting exclusively of a bounding box, the response time for sending the initial response shall be maximum 30 seconds in normal situation then, and still in normal situation, the download service shall maintain a sustained response greater than 0,5 Megabytes per second or greater than 500 Spatial Objects per second.

As I read it there is no requirement for a service to (for example) provide a coverage of any size, in any time period, just for it to start to respond within 30 seconds and then maintain a steady reponse in the order of 0.5 MB/sec.

I'm not sure it is pursuant on the service provider to work out how to actually measure this level of service, in a live service, that's something for the commission testers. However as per Peter's suggestion a provider might be able to work out if their service is in theory conformant by uploading a test coverage of some size and firing the service at some INSPIRE test suite.

Interstingly I see for a GetMap request the INSPIRE Network Service regulation has:

For a 470 Kilobytes image (e.g. 800 × 600 pixels with a colour depth of 8 bits), the response time for sending the initial response to a Get Map Request to a view service shall be maximum 5 seconds in normal situation.

So it's a bit odd that there isn't somthing similar for a Download service

 

 

 

 

 

#12 Updated by Peter Baumann over 4 years ago

IMHO the MapServer example underlines the difficulty in defining response times without specifying the task. Whatever tool, there will always be massive differences between downloading a MB or a GB.

I like the idea of following WMS: specify one result image (ideally same size as WMS). If nothing else is said vendors will by themselves omit CRS changes, difficult formats, even subsetting, etc., so that it will boil down to a simple delivery. At this simple bottom line, practically acceptable thresholds could be set (such as 1-digit seconds - who wants to wait 30 seconds today... ;-) ).

#13 Updated by Ilkka Rinne over 4 years ago

Peter Baumann wrote:

IMHO the MapServer example underlines the difficulty in defining response times without specifying the task. Whatever tool, there will always be massive differences between downloading a MB or a GB.

Yes, it would seen a bit stupid to me to require something that the currently widely used software products cannot reach in practice. However, the QoS criteria for INSPIRE Download Service operations is currently set in the legal regulations (No 1088/201 amending 976/2009, http://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:02009R0976-20101228&from=EN), Annex I:

1 Performance

- -

For the Get Download Service Metadata operation, the response time for
sending the initial response shall be maximum 10 seconds in normal
situation.

For the Get Spatial Data Set operation and for the Get Spatial Object
operation, and for a query consisting exclusively of a bounding box, the
response time for sending the initial response shall be maximum 30 seconds
in normal situation then, and still in normal situation, the download service
shall maintain a sustained response greater than 0,5 Megabytes per second or
greater than 500 Spatial Objects per second.

For the Describe Spatial Data Set operation and for the Describe Spatial
Object Type operation, the response time for sending the initial response
shall be maximum 10 seconds in normal situation then, and still in normal
situation, the download service shall maintain a sustained response greater
than 0,5 Megabytes per second or greater than 500 descriptions of Spatial
Objects per second.

If we want to declare something else for WCS, we probably have to change the Inspire regulation too, not just the Technical Guidance.
I like the idea of following WMS: specify one result image (ideally same size as WMS). If nothing else is said vendors will by themselves omit CRS changes, difficult formats, even subsetting, etc., so that it will boil down to a simple delivery. At this simple bottom line, practically acceptable thresholds could be set (such as 1-digit seconds - who wants to wait 30 seconds today... ;-) ).

"The 470 kilobyte image in 5 seconds" requirement for Get Map (not just WMS, but any INSPIRE View Service) is also explicitly mentined in the legal text (which I find all too technical for a legal text, but that's another issue). However, I think this kind of technical requirement is ok, because it does not require any predifined data sets to made available just for evaluating the QoS: Any INSPIRE View Service can be tested against this requirement by creating enough Get Map requests, discarding the ones returning considerably smaller an bigger images that 470 kilobytes, and comparing the response times to the required 10s maximum. While making this continuously, at least 90% of the qualifying requests shall take less than 10s (this is the definition of "normal situation" excluding 10% of time allowed for degraded QoS due to traffic peaks).

If we could come up with a similar technical requirement for the WCS operations, without requiring data providers or vendors to set up something specifically for this testing, it would be great.

#14 Updated by Peter Baumann over 4 years ago

Seems like technically we are not far away. Maybe the WCS case could constitute an opportunity to "escalate up" the issue to be on stock for any potential revision/amendment.

#15 Updated by James Passmore over 4 years ago

For practical advice on how to measure the initial response (albeit approximate), could we not suggest using an HTTP HEAD request?

So using a WFS download service (test) example for 500 features we might use:

curl --silent --write-out "%{time_starttransfer}" --request HEAD "http://194.66.252.155/m4eu/services?service=WFS&request=GetFeature&version=2.0.0&typename=ge:MappedFeature&MAXFEATURES=500&"

#16 Updated by Ilkka Rinne over 4 years ago

James Passmore wrote:

For practical advice on how to measure the initial response (albeit approximate), could we not suggest using an HTTP HEAD request? So using a WFS download service (test) example for 500 features we might use: curl --silent --write-out "%{time_starttransfer}" --request HEAD "http://194.66.252.155/m4eu/services?service=WFS&request=GetFeature&version=2.0.0&typename=ge:MappedFeature&MAXFEATURES=500&"

This would be too specific for TG text IMHO. We had discussions abount using HTTP HEAD for verifying of the URLs actually resolve to a resource when drafting the abstract tests for the Network Services. The problem was that not all HTTP servers support HEAD request. Even if the WCS servers (and their front-ends) do the response time would probably hava nothing to do with actual requests, which require data fetching & processing. 

#17 Updated by James Passmore over 4 years ago

OK, I take on board that it might be too technical for the technical guidance, but on the other hand it's probably better than just having a statement about there being a mandated response time, and then providing no help at all on how a service provider might test it.

For someone who is testing their own service, (rather than someone doing a remote test) I think they should be able to configure their service to accept a HEAD request even if they later block that ability at some later stage when it's in production.

My understanding of the HEAD request is that it is the same as a GET response, but without the actual content, so you can get the content-length as a response (depending on how the data will be sent); so this implies that the server must have done the processing, just not sent the data.  So as an approximation of how to measure the requirement, this is probably a help.  

Let's say you do the test and the response time is way over 30 seconds, then you know you have a problem.

But maybe it isn't the purpose of the technical guidance to even mention how to measure the requirements at all, should we cut most of the text, just leave the requirements and perhaps provide some tips on how to achieve a performant service?

 

There is a I think a second issue regarding whether we think that the regulations need to be changed, or better quantified

 

 

 

 

 

 

#18 Updated by James Passmore over 4 years ago

[double posting due to some rails issue]

 

 

 

 

 

 

#19 Updated by Jukka Rahkonen over 4 years ago

More thoughts about streaming output with existing WCS implementations, this time from GeoServer developers

https://sourceforge.net/p/geoserver/mailman/geoserver-devel/thread/8afc75e8125e4c81a4f4bc5891a3dec3%40C119S212VM022.msvyvi.vaha.local/#msg34839687

Summary: Current implementation does not support streaming but it should be doable for some output formats: uncompressed TIFF and GML and maybe for some NetCDF variants (nobody thought about ASCII grid whic for me feels simple and streamable).

 

completing a bit what Even and Jody already said, GeoServer WCS can already
deal with images
larger than memory as we try not to allocate them in a single surface,
however we are writing
the output to a local file because most format cannot be written "from
beginning to end" but
requires one to go back and forth.

One example is tiled and compressed TIFF, the tile directory needs to
contain the offsets of all
the tiles, but those are known only when all of them are written, since we
cannot predict how
big they will be (each will compress by a different ratio).

With some modifications we should be able to write uncompressed GeoTiff
directly on the
output stream, since its contents are fully predictable, we'll just have to
make a special case for
it so that we don't rely on the generic JAI ImageIO architecture, which
instead assumes one
might have to go back and forth in the output file.

About NetCDF, Jody mentions DAP, but I believe you are interested in
streaming from the INSPIRE
point of view, where DAP would not be an option... or would it? In any
case, it would be
a different protocol, I did search a bit on the internet, and could fine
some experiences with
people implementing a WCS fronting a DAP server, but not a merger between
the two protocols.
Looking at the UCAR NetCDF library we are using, it demands a file as a
target, which makes
me assume the file structure is not suitable for direct streaming.

Looks like the NetCDF have been experimenting with a streamable format,
called ncstream,
but I'm unclear if this got any traction:
https://www.unidata.ucar.edu/software/thredds/v4.3/netcdf-java/stream/NcStream.html

As an aside, a format that is ugly and big, but that I believe we can
stream directly today, is the GML
coverage format. Mind, pure GML, not GML/JP2, which we do not support today
(although
it would be an interesting development, but looking at its spec, it does
not seem like
they are offering a tile-able approach that would make streaming possible).

#20 Updated by Peter Baumann over 4 years ago

yes, most formats are not made for streaming, unfortunately. CSV, XML, and a few other exceptions exist,but they are not always considered convenient.

Note that CIS 1.1 will allow partitioned coverages - this allows timeslicing, but also mosaics, including all mixings. This could help the server to stream out smaller packages one by one. However, there are still some issues left, such as marshalling asynchronous requests in OWS world. I would have loved to simply use WPS as a carrier, but long discussion in OGC (with WPS folks involved) ultimately has turned out infeasible. Therefore, the WCS group is likely to start into a spec for async requests in the near future. An opportunity to lobby INSPIRE requirements.

#21 Updated by Michael Lutz over 4 years ago

From Pete Trevelyan:

Hi Michael,

I am sorry I am a day late, but I have added a few comments. I have attached the document, but do not worry as I have only added a few.

Best regards,

Pete Trevelyan

 

Also available in: Atom PDF