question

Upvotes
Accepted
5 2 2 6

Query about locked RFA threads...

We have been having an issue in our applications using RFA (versions used 7.6.0.E9/8.0.1E1 ). This issue was around the time we had a scheduled hardware upgrade on the infrastructure side. Before the upgrade, we used adsmon to disable ports on the those ADS and dropped the mounts/client connections. As mentioned in the RFA documentation and by the TR team, RFA should move over to the next ADS in the serverList. A number of applications experienced locked threads (refer the trace below) and had to be bounced.

Based on the trace, I had the following queries -
- Is the thread waiting to get the ServiceDirectory list?
- What does RFA do if the ADS does allow a Login, but not send over the service Directory information? Will it result in the issue ?
- Is there a way to move to the Next ADS in the serverList of we do not receive specific events in a certain amount of time.
- Any input on preventing this lock from happening?

Our config is in the following format -


RFASessionName = Marketdata::ADS


Connections.delayedADS1.connectionType = RSSL
Connections.delayedADS1.serverList = Delayed1_NOPT1, Delayed1_NOPT2, Delayed1_NOPT3, Delayed1_NOPT4
Connections.delayedADS1.portNumber = 1004
Connections.delayedADS1.throttleType = timer
Connections.delayedADS1.throttleTimerInterval = 1
Connections.delayedADS1.throttleRequestsPerInterval = 10000
Connections.delayedADS1.logFileName = console


Connections.delayedADS2.connectionType = RSSL
Connections.delayedADS2.serverList = Delayed2_OPT1, Delayed2_OPT2, Delayed2_OPT3, Delayed2_OPT4
Connections.delayedADS2.portNumber = 1004
Connections.delayedADS2.throttleType = timer
Connections.delayedADS2.throttleTimerInterval = 1
Connections.delayedADS2.throttleRequestsPerInterval = 10000
Connections.delayedADS2.logFileName = console

Connections.realtimeADS1.connectionType = RSSL
Connections.realtimeADS1.serverList = Realtime1_NOPT1, Realtime1_NOPT2, Realtime1_NOPT3, Realtime1_NOPT4
Connections.realtimeADS1.portNumber = 1002
Connections.realtimeADS1.throttleType = timer
Connections.realtimeADS1.throttleTimerInterval = 1
Connections.realtimeADS1.throttleRequestsPerInterval = 10000
Connections.realtimeADS1.logFileName = console

Connections.realtimeADS2.connectionType = RSSL
Connections.realtimeADS2.serverList = Realtime2_OPT1, Realtime2_OPT2, Realtime2_OPT3, Realtime2_OPT4
Connections.realtimeADS2.portNumber = 1002
Connections.realtimeADS2.throttleType = timer
Connections.realtimeADS2.throttleTimerInterval = 1
Connections.realtimeADS2.throttleRequestsPerInterval = 10000
Connections.realtimeADS2.logFileName = console

Marketdata.Sessions.ADS.connectionList = delayedADS1, delayedADS2, realtimeADS1, realtimeADS1

Thread stack -

- locked <0x000000039c73f828> (a com.reuters.rfa.internal.session.omm.OMMItemEventMsg)
at com.reuters.rfa.internal.connection.ConnectionImpl.dispatch(ConnectionImpl.java:538)
at com.reuters.rfa.internal.connection.rssl.RSSLDirectoryHandler.createRefreshForRequestFromCache(RSSLDirectoryHandler.java:1025)
at com.reuters.rfa.internal.connection.rssl.RSSLDirectoryHandler.processMsg(RSSLDirectoryHandler.java:94)
at com.reuters.rfa.internal.connection.rssl.RSSLConnection.processTransportData(RSSLConnection.java:709)
at com.reuters.rfa.internal.rssl.RsslClientConnectionManager.processTransportData(RsslClientConnectionManager.java:672)
at com.reuters.rfa.internal.rssl.RsslClientConnection.processTransportData(RsslClientConnection.java:319)
at com.reuters.ipc.ConnectionImpl.dispatchTransportData(ConnectionImpl.java:688)
at com.reuters.ipc.ConnectionImpl.dispatchData(ConnectionImpl.java:681)
at com.reuters.ipc.ConnectionImpl.readAndDispatchData(ConnectionImpl.java:304)
at com.reuters.ipc.ConnectionImpl.readDispatch(ConnectionImpl.java:140)
at com.reuters.ipc.SubConnection.readDispatch(SubConnection.java:412)
at com.reuters.mainloop.channel.ChannelSession.readDispatch(ChannelSession.java:382)
at com.reuters.mainloop.channel.ChannelSession.readSelected(ChannelSession.java:294)
at com.reuters.mainloop.channel.ChannelMainLoop.processSelection(ChannelMainLoop.java:81)
at com.reuters.mainloop.channel.ChannelMainLoop.selectFor(ChannelMainLoop.java:342)
at com.reuters.mainloop.channel.ChannelMainLoop.run(ChannelMainLoop.java:199)
at com.reuters.rfa.internal.common.EventQueueMLThread.runImpl(EventQueueMLThread.java:46)
at com.reuters.rfa.internal.common.InterruptibleThread.run(InterruptibleThread.java:80)
at java.lang.Thread.run(Thread.java:682)

"Timer-0" daemon prio=10 tid=0x00002ba0856a2340 nid=0x7870 in Object.wait() [0x00002ba0a4a43000]

treprfarfa-api
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Hi @mangesh,

As suggested, can you turn on logging within your application? We should be able to see where in the recovery/failover your application is before we see the locked thread.

Upvotes
Accepted
11.5k 16 7 10

Hi @mangesh

Q 1: Is the connectionTimeout setting for the total time for the connection (login, directory...) or is it just for the login ?

Answer: The connectionTimeout set a time that the API uses for "establishing a connection to the Provider IP and RSSL port" only, not include the login and on going process.

Q 2: Is there a timeout that I can set in properties that would control the maximum time RFA tried to connect to the next provider before moving on to the next provider in the list?

Answer: The maximum time RFA tries to connect to each server in the list can be set via "connectionTimeout" parameter. After all servers in the list are tried, the API waits for "connectRetryInterval" time to start a new try with the first server in the list.

example:
serverList: "A,B,C" 
connectionTimeout: 10000 (10 seconds)
connectRetryInterval: 30000 (30 seconds)

1. firstly the API connects to server A
2. the API waits for 10 seconds, then logs "connection fail for server A"
3. the API connects to server B
4. the API waits for 10 seconds, then logs "connection fail for server B"
5. the API connects to server C
6. the API waits for 10 seconds, then logs "connection fail for server C"
7. the API waits for 30 seconds, then connects to server A again

I am checking for the rest of question.

Please note that the API does not failover to the next ADS in your scenario because the "actual network connection" between the API and ADS IP/Port is still alive. The "disconnect" means a net connection between the ADS IP and Port with the API machine is disconnected.

Did you check on your ADS to see why it does not send Service Directory refresh message to the API yet?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Hi @Wasin

Did you manage to look into the other parts of my question?

Upvotes
11.5k 16 7 10

Hi @mangesh

The API failover is controlled by the following configuration parameter.

  • connectionTimeout: The time (in milliseconds) to wait for a connection attempt to succeed. After a connection has been established, this value is used to set a timer to check for 3 consecutive missed pings. The default value is 10000
  • maxRetryCount: Set the maximum number of times RFA will retry the unsuccessful connection. "-1" means retry indefinitely, "0" means no retry. If connection failed, RFA won’t retry. The default value is "-1"
  • connectRetryInterval: Time in milliseconds to wait before attempting to recover a failed RSSL connection. This controls how often RFA will attempt to recover a failed connection, after it has been established.This is only used after all servers in the serverList have been attempted. The default value is 15000 (15 seconds)

Basically, the connection process are following steps

  1. The API sends LOGIN request to the ADS
  2. The ADS sends LOGIN response back to API

Then the API automatic sends DIRECTORY request to ADS. However, the application can manual requests DIRECTORY information to the ADS. Does your application lets the API request DIRECTORY information or manual request it?

Can you replicate this issue on demand in any controlled environment? If so, could you please enable the RFA trace file that will let us check the issue in detail? You can configure the following RFA Java configurations to enable the XML trace file

  • <namespace>/Connections/<Connection Name>/ipcTraceFlags = 23
  • <namespace>/ Connections/<Connection Name>/mountTrace = True
  • <namespace>/ Connections/<Connection Name>/logFileName=<path to log file>
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
11.5k 16 7 10

Hi @mangesh I have replicated the scenario with RFA Java StarterConsumer-StarterProvider_Interactive. If the provider does not send the DIRECTORY response to the API, the API will send the Status message

{OPEN, SUSPECT, NONE, "Waiting for service <service name> UP. Item recovery in progress..."} 

to the application if the application subscribes data.

The API also will not perform a failover because "a connection" between the API and Provider still alive. It keeps waiting for the DIRECTORY response message from the Provider. The API switches to the next Provider when it "disconnect" with the current Provider only.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
5 2 2 6

Hi @Wasin,

Thanks for your response. It was really helpful in understanding what happens when we see the lock.

Our application does a manual request for Login, Directory, Dictionary (we don't have the dictionary stored locally).

You had mentioned the following -

"The API also will not perform a failover because "a connection" between the API and Provider still alive. It keeps waiting for the DIRECTORY response message from the Provider. The API switches to the next Provider when it "disconnect" with the current Provider only."

connectionTimeout: The time (in milliseconds) to wait for a connection attempt to succeed.

How long will the application wait for the directory message? When RFA api fails over from one provider to another, its completely a RFA only process. Is there a timeout that I can set in properties that would control the maximum time RFA tried to connect to the next provider before moving on to the next provider in the list. Is the connectionTimeout setting for the total time for the connection (login, directory...) or is it just for the login ?

Its not possible for us to enable RFA level logs in PROD without impacting clients. We will try to create something that will mimic the same scenario.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
9.7k 49 38 60

Hi @mangesh

Q: How long will the application wait for the directory message?

RFA requests for the directory message directly. The application simply registers interest in receiving the directory updates. There is no explicit timeout for a directory request but would fall under that standard item request timeout. The itemRequestTimeout configuration parameter is set to a default of 45s.

Q: Is there a timeout that I can set in properties that would control the maximum time RFA tried to connect to the next provider before moving on to the next provider in the list.

The connectRetryInterval configuration parameter is the time in milliseconds to wait before attempting to recover a failed RSSL connection. This controls how often RFA will attempt to recover a failed connection, after it has been established. This is only used after all servers in the serverList have been attempted.

Q: Is the connectionTimeout setting for the total time for the connection (login, directory...) or is it just for the login ?

The connectionTimeout setting is strictly to establish a connection to the server. The login and directory are not included in this time.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
5 2 2 6

Hi @Wasin

Could you share how you reproduced the issue of the locked Thread?

Regards,

Mangesh

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
11.5k 16 7 10

Hi @mangesh

I cannot replicate the locked thread issue. I can replicate only the scenario that the Provider (RFA Java StarterProvider_Interactive example) does not send the DIRECTORY Refresh message to the consumer.

I set the RFA Java StarterConsumer example connects to StarterProvider_Interactive example, then modify StarterProvider_Interactive's ProviderClientSession.java to not send the DIRECTORY Refresh message back to the StarterConsumer as following code:

protected void processOMMSolicitedItemEvent(OMMSolicitedItemEvent event){
   ...
    case RDMMsgTypes.DIRECTORY:
    // Reuters defined domain message model - DIRECTORY
    //For AHS_13227 test not sends DIRECTORY response to consumer
    //processDirectoryRequest(event);

When the StarterConsumer connects to the StarterProvider_Interactive, it keeps waiting for the DIRECTORY Refresh message forever, so I assume that there is no directory request timeout. The API also sends the STATUS_RESP "Waiting for service <service>. Item recovery in progress..." to the application because it does not receive a service information from the DIRECTORY Refresh message.

Please note that the API will not failover to the next Provider because the API and the StarterProvider_Interactive are still "connected".

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Hi @Wasin Waeosri,

Can you tell if RFA ever resends the directory request after a certain period of time? How long does the STATUS_RESP come after the initial request? Just want to confirm if the behavior is similar to that of an item request and associated with the itemRequestTimeout configuration parameter.

Upvotes
11.5k 16 7 10

Hi @mangesh

Sorry for my mistake. If the application manual send directory request to the ADS, , RFA Java times out at 45 seconds. RFA will close the request and send a status message to the application.

The StarterConsumer example does not make a Directory request, therefore the application will not receive any status message.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Click below to post an Idea Post Idea