question

Upvotes
Accepted
32 3 3 7

DF Document tagging changes

On LinkQ Project we need to display following tags related to the Article:

Company Relevance (Like: Nike 80%), Topics and Events.

For Events we plan to use "detectedEvents_attr" and for Topics - "DocCat_attr"

The question is with the Organizations. We have them in "calais_relevance_*_attr" which is array of strings with Locations, Industries, Organizations etc.

To distinguish what is the organization here we need to make an additional request for each string.

DF Request: entity/search returns following relevance info:

"calais_relevance_80_attr":[

"Application Software", - Industry

"Cleveland", - Location

"National Basketball Association", - Company

],

Looks like that to distinguish the organization here we need to look into "Organization_attr" and than find such a company in "calais_relevance_*_attr". The other option is to make a request to each string and find out what is the organization.

We wander if "calais_relevance_*_attr" can contain following type description, like:

"calais_relevance_80_attr":[

{"_type": "Industry",

"label: "Application Software"},

{"_type": "Location",

"label: "Cleveland"},

{"_type": "Organization",

"OrganizatuinId or Uri" : "..........",

"label: "National Basketball Association"}

Here is an example of permid.ord tagging response for an organization:

{

"_typeGroup":"entities",

"_type":"Organization",

, "relevance":0.8

...}


data-fusion
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
Accepted
1.2k 6 10 8

Excellent question. If that's all you're doing, and you don't need further integration of the TRIT output into the Thomson Reuters Knowledge Graph, then your use case would be better served by consuming TRIT output directly and skipping Data Fusion processing altogether.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
1.2k 6 10 8

Why not query for the connected organizations (anlyze/search) and filter by score on the client?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

looking into analyze search. We mentioned some news returned without tagging (but pertId tags it) and earlieat news we receive is (now-2.5 hours). Earliest news also has no tagging. Does tagging take some time to appear?

Ingest, tagging, and indexing are independent processes. Our objective is to have the tagged news visible no later than 12 hours after ingest.

The following query is used to test for it:

GET /datafusion/api/entity/search?sort=related_uri_count&dir=desc&limit=10&extraFields=id_attr,lastModified_attr_dt&searchString=lastModified_attr_dt:[NOW-12HOURS TO NOW]

A document is considered tagged if it contains the following field:

"id_attr": "http://id.opencalais.com/
Upvotes
32 3 3 7

Currently we have to make 2 requests:

1st one to receive news (entity/search) which has connected companies name as a string, 2nd one - to receive connected company id (analyse/search). This causes an extra requests and as a result extra time, which became significant in case of thousand news.

That is why we propose to have everything in one place and for entity/search request for a document type return structured response with company type and id:

"calais_relevance_80_attr":[

{"_type": "Organization",

"OrganizatuinId or Uri" : "..........",

"label: "National Basketball Association"}

.... ]

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

That would defeat the purpose of having a graph database, wouldn't it?

If you can deal with the output in bulk you can tokenize the search (entity/search/tokenize) and then plug the token back into the second search call that you can page through. The following query will give you orgs connected to the original search list through OrganizationToDocument predicate:

GET /datafusion/api/entity/search?limit=10&parentUris=039e81f32c70ab168c8a1c8cf49cabfb&parentPredicateFilters=120|||http://s.opencalais.com/1/pred/OrganizationToDocument

Separately, we're working on improving performance of paralell queries and queries in general. ETA Q2.

Upvotes
1.2k 6 10 8

That would defeat the purpose of having a graph database, wouldn't it?

If you can deal with the output in bulk you can tokenize the search (entity/search/tokenize) and then plug the token back into the second search call that you can page through. The following query will give you orgs connected to the original search list through OrganizationToDocument predicate:

GET /datafusion/api/entity/search?limit=10&parentUris=039e81f32c70ab168c8a1c8cf49cabfb&parentPredicateFilters=120|||http://s.opencalais.com/1/pred/OrganizationToDocument

See also:

https://community.developers.refinitiv.com/questions/7229/how-do-you-limit-the-number-of-levels-returned.html

https://community.developers.refinitiv.com/questions/10447/filter-entities-by-relationship-type.html

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
32 3 3 7

Thank you, Tomasz

I understand the perpose of analyse/search request. However, If we already have all necessary information (organization name and it's relevance), why not to have this id included in entity/search response?

In our case we need to make 2nd request just to retrieve org id.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Excellent question. If that's all you're doing your use case would be better served by consuming TRIT output directly and skipping DF altogether.

Click below to post an Idea Post Idea