question

Upvotes
Accepted
11 4 7 14

Spurious Channel "UP" event received in the middle of Elektron Edge device reboot disrupting ETA reactor "Reconnect" logic

Starting with client connected to an Elektron Edge Device using ETA that has a subscription to a BDS.

If the Elektron device is rebooted the client almost immediate receives a channel event specifying: RSSL_RC_CET_CHANNEL_DOWN_RECONNECTING indicating that the channel was reset by the peer.

Since the connect options have the reconnectAttemptLimit set to -1, the client will attempt to reconnect to the server at using the specified interval range of 100 to 60000 msec.

After some additional channel event messages specifying: RSSL_RC_CET_CHANNEL_DOWN_RECONNECTING indicating the client connect failed, another channel event arrives, this time specifying case RSSL_RC_CET_CHANNEL_UP.

Immediately following this the loginCallback set in the ConsumerRole specified when the rsslreactorConnect API call was made is invoked, with a status indicating: Closed/Suspect/Not entitled - text: "DACS DDS connection is down"

At this point the client assumes that the credentials specified in the original login message are no longer valid and aborts the connecton.

How is it possible to determine that the Elektron device is reporting a lack of entitlement for the specified username is due to an incomplete reboot process and that it should retry.

Further, in what way can it detect that the entitlement portion of the Edge device is back up and ready to process a login -- especially given that the login in this circumstance is happening automatically due to the reconnectAttemptLimit setting.

Should the client just use a loginFailure retry setting that inidcates that login failures are not to be trusted and should be retried a set number of times before giving up?

Or is there a way to detect that the error was ephemeral?

elektronrefinitiv-realtimeelektron-sdkrrteta-apielektron-transport-apiDACSconnection
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
Accepted
21 0 0 0

So yes, you hit a window where "entitlement" was going to fail. The CHANNEL_UP message simply indicates that the socket connection is set up, but the full login has not completed yet. If you have the Reactor role set to automatically login, this is purely informational, if not, then this is the indication to start the login process. The DOWN_RECONNECTING message in this case tells you that your infrastructure is down, but the Reactor will still retry based on your role settings (reconnect*Delay, etc.), so you do not need to immediately abort the connection here. Depending on what is trying to log in (a trusted app or a casual user), it may be appropriate to let the Reactor retry at least a few times presuming the condition is temporary. If the condition is not temporary, you will want to alarm for manual intervention in any case, use the retry count to stop retrying.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
11 4 7 14
<br>

Looking into this a bit more to try to determine how to work around what appears to be a bug in the ETA SDK. I've gathered some additional information, documented here.

Looking at the console of the Elektron Edge Device using TSRC provides the following information:

Last server start time: 30 May 19:15:12
Last server stop time:  30 May 19:11:14

Prior to restarting the Elektron Edge Server, a client was connected to it, and subscribed to 9780 symbols from a BDS (symbol list) with 85056 symbols in it.

The following appears in the Elektron Edge Device log

2017-05-30; 19:11:13; Harmless; The SMF Client ?????????? TRSC 5444:99 127.0.0.1 has dropped its connection to SMF.; smf; 2020; 2952; 00000e0b; 
2017-05-30; 19:11:14; Harmless; SMF is shutting down; smf; 2020; 2024; 00000e0c; 
2017-05-30; 19:11:14; Harmless; CBRS is Shutting down...; CBRS; 760; 936; 00000e0d; 

Agreeing with the information on the TSRC console.

From the client's log:

May 30 15:11:18.771 [error] Channel down, RECONNECTING: reason: errcode: SUCCESS, errtext: <Impl/ripcsrvr.c:6654> Error:1002 ripcRead() failure. Connection reset by peer (ret: -1, sys: 0), location: Reactor/Impl/rsslReactor.c:3132

This log message is printed in the client's channel event callback when the channel event type is RSSL_RC_CET_CHANNEL_DOWN_RECONNECTING. This is to be expected given that the Elektron Edge Device was shutdown for to reboot the box so that we could test the client application's recovery logic.

A second later another channel event callback with the same event type occurs:

May 30 15:11:19.804 [error] Channel down, RECONNECTING: reason: errcode: SUCCESS, errtext: <Impl/ripcsrvr.c:5160> ripcConnecting() Error: 1002 client connect() failed.  System errno: (107) (ret: -1, sys: 107), location: Reactor/Impl
/rsslReactorWorker.c:437

Again, this is expected since the reconnectMinDelay is set to 1000 msec.

The next event in the client application's log is unexpected, occurring 4 seconds later:

May 30 15:11:22.031 [info] Channel UP

This log message is printed when the channel event type in the channel event callback is RC_CET_CHANNEL_UP. The Elektron Edge Device is not back up yet according to its log. It won't be back up until 15:15:12 (or 19:15:12 UTC).

The next messages appear almost immediately in the application log (printed from the login msg callback routine this time):

May 30 15:11:22.288 [notice] Received login status msg
May 30 15:11:22.288 [error] login status: State: Closed/Suspect/Not entitled - text: "DACS DDS connection is down"

According to the documentation for the RC_CET_CHANNEL_UP event:

  • .... This will drive the process of setting up the connection by exchanging the Login, Directory, and Dictionary messages..

It is not surprising that the login failed at this point given the the Channel UP event in the login indicates that the channel is up and that the login message should be sent. However, since the Elektron Edge device is in the process of rebooting, the login cannot succeed.

The real question here is why the client application received a channel event callback indicating that the channel was now UP when, according to the Elektron Edge Device log, it was still clearly rebooting.

This spurious "UP" notification is the cause of the problem. It appears there may be some bug in the logic of the reactor.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

@joe.ferraro

The CHANNEL_UP event means the Reactor has successfully opened a connection, but has not yet sent the Login or Directory requests specified on the ConsumerRole. These will be sent after the UP event. When the Reactor gets responses from the server that the requested Login and Directory streams are open, it sends the CHANNEL_READY event to the application.

In this case, it appears the Reactor was able to reconnect and send a Login Request; the Login Status message in the log was the server's response. Is it possible the Edge also rebooted before the machine shut down completely?

Upvotes
11 4 7 14

Based on additional investigations, I've updated the title of this question.

The original title (based on what appeared to be the problem) was:

How to detect that a login callback specifying "Not entitled" during Elektron reboot is due partial readiness of the Elektron Server and not an actual entitlement error.
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
11 4 7 14

@jim.carroll

I understand the sequence of events -- I've spend too much time reading the source code for the reactor trying to debug problems like this.

It is possible that the Elektron Edge responded to a connect request while it was coming down rather than the reactor erroneously generating an "UP" event callback.

From the Elektron Edge Device log around the time of the event See the first answer I posted to my question for the correlation between these events and those of the client application. I can send both log files if necessary.

2017-05-30; 19:07:37; Harmless; Reading SQL term, termName={Q_NA_MAJOR}; CBRS; 760; 936; 00000e07;
2017-05-30; 19:07:37; Harmless; Reading SQL term completed, termName={Q_NA_MAJOR}, count={85192}, itemsFilteredOut={0}, itemFilteredOutDueToSingleDomain={137}; CBRS; 760; 936; 00000e08; 2017-05-30; 19:07:37; Harmless; Criteria <BDS_XX_XXX_XXX> completed with 85056 records; CBRS; 760; 936; 00000e09;
2017-05-30; 19:08:13; Harmless; The SMF Client ???????? TRSC 5444:99 127.0.0.1 has connected to SMF; smf; 2020; 2952; 00000e0a;
2017-05-30; 19:11:13; Harmless; The SMF Client ???????? TRSC 5444:99 127.0.0.1 has dropped its connection to SMF.; smf; 2020; 2952; 00000e0b;
2017-05-30; 19:11:14; Harmless; SMF is shutting down; smf; 2020; 2024; 00000e0c;
2017-05-30; 19:11:14; Harmless; CBRS is Shutting down...; CBRS; 760; 936; 00000e0d;
#????????; Elektron Edge v2.5.4; SMF Version: 1.8
#Date; Time; Severity; Message; Application; PID; TID; SequenceNo; AppTime
#-------------------------------------------------------------------------------
2017-05-30; 19:15:21; Harmless; The SMF Client ?????????? TRSC 11048:1 144.14.146.244 has connected to SMF; smf; 2024; 2940; 00000001;
2017-05-30; 19:15:22; Harmless; The SMF Client ???????? smfd 1792:2 127.0.0.1 has connected to SMF; smf; 2024; 2940; 00000002;
2017-05-30; 19:15:22; Harmless; The SMF Client ???????? HardwarePlatformStats 2956:3 127.0.0.1 has connected to SMF; smf; 2024; 2940; 00000003;
2017-05-30; 19:15:22; Harmless; The SMF Client ???????? snmp 1824:4 127.0.0.1 has connected to SMF; smf; 2024; 2940; 00000004;
icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
41 0 0 0

"DACS DDS connection is down" indicates that the DACS connection or service on the EED is down and in the stage of recovering. This could occur in the small window where the EED is still accepting connections during the shutdown process but has brought the DACS service down already, or the small window on startup where the EED has already begun accepting connections but the DACS connection has not been established. I think the real root of the problem here is the closed login status. In theory, if the EED was sending a closed recover login status in this case (when DACS is down, which is how TREP behaves), I assume all would be happy as I assume the reactor would continue to retry the login.

As far as the reactor sending you a channel up event when you believe the EED is not running, only comment I can make is that if your application is receiving a login reply (refresh or status), the EED is running. Again, may have been the small window while the shutdown or startup process was in progress.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
476 6 15 18

@Joe.Ferraro , we have attempted in a couple of ways to recreate this behavior, so far unsuccessfully but continuing to try. We have a theory that there is a very brief window during bootup where the EED will accept a channel connection but not a login. We have however verified that the error response in a genuine case of a login failure is different to the one you observe during bootup. That error response is:

Received Login StatusMsgState: Closed/Suspect/None - text: "Error! User baduser is not
registered"

So you may assume that a "Not entitled" means not currently in a state to accept logins, try again.

Regardless, all of the above conditions would merit the same action, i.e. trying again. It would be extremely unlikely that in the case of a production application direct connected to an Edge that a genuine "Entitlements" error would occur, and even then from a coding perspective you would want to keep trying while we rectify what would probably be an Admin problem.

We will continue to diagnose and hopefully identify the issue here but hope this helps in the interim in that your own course of action is likely not dependent on our findings.

Thanks.

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Upvotes
11 4 7 14

@jonathan.haruni
@joseph.miller

It looks like the reactor tried to reconnect to the EED in that "small window" and received a response that cause the reactor to generate the "UP" callback to the application, causing the cascade of events that cause the login that failed and caused the client application to give up.

How is it possible to detect this type of login failure from an actual login failure -- one in which the specified user either doesn't exist or isn't entitled to connect to the EED.

According to the documentation, login success is indicated by the "REFRESH" and login failures are indicated by a "STATUS" . The application treats the information from the EED as authoritative. As such, upon being notified that the user is not entitled, it closes the channel.

It is certainly possible to add logic to require multiple notifications of login failure before the application gives up, but what happens to the reactor's reconnect logic?

The application has set the reconnectAttemptLimit to -1 in the connectOptions specified at the rsslReactorConnect call. As a result, the bulk of the retry logic is handled internally by the reactor, with the client application being notified via callback when either a reconnect attempt failed or that the connection finally succeeded.

Once the reactor receives an "UP" channel event and as a results attempts a login only to have the login fail, will the reactor continue to try to reconnect?

Or will the reactor cease its attempts to reconnect causing the the application to process the login failure, close the channel, and attempt to reconnect again?

icon clock
10 |1500

Up to 2 attachments (including images) can be used with a maximum of 5.0 MiB each and 10.0 MiB total.

Click below to post an Idea Post Idea