question

Upvotes
Accepted
21 0 2 4

TRTH V2 REST API Download Extracted Files

After I have scheduled my report template and the extracted files are in place, how do I get the files downloaded? The API tutorial uses ExtractFiles('id')/$value. I tried to use it, but somehow it downloads partial file. However, I was concerned about this approach as the actual file size iis 500MB, zipped. What's the point to get the data though an unzipping process again? The DataScope interface seems provides such a link for direct download, which makes me believe there should be a better way in the rest API to achieve this?

tick-history-rest-api
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
Accepted
21 0 2 4

Yes, the code I give works just fine.

Further research shows that the zlib module used by python requests (v 2.8.1) has compatibility issue, as the zlib will spill out zero bytes decompressed data at some point, which essentially gave me the partial file. However, without decode on the fly (i.e. take zlilb out of the process), after the gzipped file is downloaded, the gzip/zcat etc. utility works fine on the file.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

@jsi-data

Thank you very much for the clarification. I agree that this should be compatibility issue once zlib module decompress the extracted file.

Upvotes
13.7k 26 8 12

@jsi-data, you do not mention what programming language you are using. I shall assume you are using Java. If that is not the case, then please tell us what you are using.

You can save the data file to hard disk without decompressing it, using the following code (input parameters are the output file path and name, and the file ID):

public void extractAndSaveFile( String filename, String ExtractedFileId) {
  String urlGet = urlHost + "/Extractions/ExtractedFiles('"+ExtractedFileId+"')/$value";
  try {
    URL myURL = new URL(urlGet);
    HttpURLConnection myURLConnection = (HttpURLConnection)myURL.openConnection();
    myURLConnection.setRequestProperty("Authorization", "Token "+sessionToken);
    myURLConnection.setRequestProperty("Accept-Charset", "UTF-8");
    myURLConnection.setRequestMethod("GET");
    try( DataInputStream readerIS = new DataInputStream( myURLConnection.getInputStream())) {
      Files.copy (readerIS, Paths.get (filename));
    }
  } catch (IOException e) {
    e.printStackTrace();
  }
}

This is an extract from our sample DSS2ImmediateScheduleTicksTRTHClient2 which is part of the Java code sample set available for download.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
21 0 2 4

Sorry I was using Python, direct http request.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
21 0 2 4

Here is my http response listing the files extracted, I can see the file name for the Note and the actual data zipped format, and I remember in the previous version of the soap API, I can just use wget on a FTP download link directly retrieve the zipped data file. I don't understand why I have to go thought the $value link and save the data, which as far as I can tell, is unzipped plain text.

-----------DUMP OF RESPONSE-----------
{'Content-Length': '804', 'X-Request-Execution-Correlation-Id': '5ccb6ba9-8084-4e32-ab85-4e718e65a02b', 'X-App-Id': 'Custom.RestApi', 'Set-Cookie': 'DSSAPI-
COOKIE=R3148268809; path=/', 'Expires': '-1', 'Server': 'Microsoft-IIS/7.5', 'X-App-Version': '11.0.487.64', 'Pragma': 'no-cache', 'Cache-Control': 'no-cach
e', 'Date': 'Fri, 12 May 2017 13:25:13 GMT', 'Content-Type': 'application/json; charset=utf-8'} 200 OK
{"@odata.context":"https://hosted.datascopeapi.reuters.com/RestApi/v1/$metadata#ExtractedFiles","value":[{"ExtractedFileId":"VjF8MHgwNWI3MGU4OGNiMmIyZjg2fA"
,"ReportExtractionId":"2000000000407358","ScheduleId":"0x05b70e670e9b2f86","FileType":"Note","ExtractedFileName":"9012092.DailyExtractionByTemplate_20170511
_0922.20170512.092254.2000000000407358.tm01n01.csv.gz.notes.txt","LastWriteTimeUtc":"0001-01-01T00:00:00.000Z","ContentsExists":false,"Size":0},{"ExtractedF
ileId":"VjF8MHgwNWI3MGU4ODc3NmIyZjc2fA","ReportExtractionId":"2000000000407358","ScheduleId":"0x05b70e670e9b2f86","FileType":"Full","ExtractedFileName":"901
2092.DailyExtractionByTemplate_20170511_0922.20170512.092254.2000000000407358.tm01n01.csv.gz","LastWriteTimeUtc":"2017-05-12T13:25:11.000Z","ContentsExists"
:true,"Size":87590966}]}

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
11.3k 25 8 13

@jsi-data

Do you use the Python Request module for the http request?

The API generally provides the extracted files in the format stated in the server. However, according to the guide, the Request library automatically decompress the gzip and deflate encoding data, so you will always receive unzipped plain text.

To retrieve actual data format, you can use the Raw Response Content to get actual gzip file.

Below is the sample code to write Extracted file in actual file format. You have to add stream=True in the get() method.

urlFile = "https://hosted.datascopeapi.reuters.com/RestApi/v1/Extractions/ExtractedFiles(\'VjF8MHgwNWI3MGMyOTI5YmIyZjc2fA\')/$value"
headers = {'Authorization': myToken, 'Accept-Encoding': 'compress'}
    resp = requests.get(urlFile, headers=headers, stream=True)
...
with open(extractFileName, 'wb') as fd:
    fd.write(resp.raw.read())
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
21 0 2 4

I do use python requests. I was using the session object and prepare request to get the raw content. However, I also tried what you suggested above, still partial file. I downloaded the file twice, and the actual file size is ~4M zipped, but I only got ~94K data. No error was reported. You can try this,

url="https://hosted.datascopeapi.reuters.com/RestApi/v1/Extractions/ExtractedFiles('VjF8MHgwNWI4MDk4ZjZiOWIyZjg2fA')/$value"
r = requests.get(url, headers=headers, stream=True, proxies=PROXIES, auth=("9012092","xxxxxxx"))
with gzip.open("t1", 'wb') as f:
for chunk in r.iter_content(chunk_size=1024 * 1024):
f.write(chunk)

And this is the file I got, whereas the actual file should be ~4M showing in the http response.

rw-r--r-- 1 jsi_trth jsi_data_group 94914 May 15 10:38 t1

{"@odata.context":"https://hosted.datascopeapi.reuters.com/RestApi/v1/$metadata#ExtractedFiles","value":[{"ExtractedFileId":"VjF8MHgwNWI4MDk4ZjZi
OWIyZjg2fA","ReportExtractionId":"2000000000413472","ScheduleId":"0x05b8096826bb3016","FileType":"Full","ExtractedFileName":"9012092.DailyExtract
ionByTemplate_20170511_1029.20170515.102932.2000000000413472.tm03n01.csv.gz","LastWriteTimeUtc":"2017-05-15T14:32:12.000Z","ContentsExists":true,
"Size":4317179}]}

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

@jsi-data

I have tried your code and found the same behavior. I understand that your code get the data which has already decompressed by the Python Requests and then compress it again to create a gzip file. It seems that there is a issue when the Requests tries to decompress the downloaded data.

I have tried the code I provided again with the extracted file which has 14 MB. The code can create a file which has the same size as the one shown in the http response.

Can you dump the response header to verify the actual response size?

Upvotes
13.7k 26 8 12

@jsi-data

What you describe reminds me of a case a customer had recently, where a download of a large file resulted in a very small file. This was due to a proxy, and I see you are using one.

Could you test your code on a different machine, or even maybe from your home ? The point is to check if the proxy is the issue or not.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
21 0 2 4

A little more research shows some interesting finds.

1. For the two files I got from scheduled extraction, one Note file and one data file, all gzipped. When I requested the $value, the responses are different. Note file response headers have the encoding method as 'chunked', no content size. Whereas the data file has content size, and no 'chunked' method. I suspect the content size header information confused the python requests lib, which results in the behavior of partial download. However, I think Reuters should find out why the responses are different. I can download the first file without problem. From the requests lib doc, it seems chunked method should be right header info here.

2. If you download the file as requests lib suggested using iter_contents, then the data is decoded(unzipped in my case), even I use decode_unicode=Fasle. In this case, the size returned by the lib is actually the unzipped bytes, and this may have confused the lib, as the content header is actually gzipped file size. However, the working method should be go directly to the underlying lib, and do below. It will open a generator on the data stream which you can iterate through the raw bytes and save the gzipp file.

for data in r.raw.stream(decode_content=False) :

f.write(data)

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

@jsi-data

The Note file and the data file generally are created in different format; the Note file is plain text, while the data file is gzip. so the way, the data is delivered, may be different.

Anyway, could you please confirm if you are able to retreive the correct gzip file with the code you provided in 2)?

If not, was the Content-Length of response header the same as the size returned from the /Extractions/ExtractedFiles({id}) end point?

Click below to post an Idea Post Idea