Outage between 22:06 and 22:27 BST on the 23/10/2015

Questions about the WURFL Cloud service.
rgluk
Posts: 4
Joined: Mon Apr 13, 2015 4:21 am

Outage between 22:06 and 22:27 BST on the 23/10/2015

Postby rgluk » Fri Oct 23, 2015 6:23 pm

We had a pretty catastrophic issue with the WURFL service when it appears to have experienced an outage between the times above.

We had a spate of this error in our logs (around 44k):

Code: Select all

com.scientiamobile.wurflcloud.exc.UnreachableServerException: There is no possibility to answer. Probably the Cloud Service is unreachable, or you don't have required authorization to access it
        at com.scientiamobile.wurflcloud.ThrowExceptionRecoveryManager.getRecoveryID(ThrowExceptionRecoveryManager.java:37)
        at com.scientiamobile.wurflcloud.CloudClient.tryRecovery(CloudClient.java:487)
        at com.scientiamobile.wurflcloud.CloudClient.detectDevice(CloudClient.java:395)
        at com.scientiamobile.wurflcloud.CloudClientManager.getDeviceFromRequest(CloudClientManager.java:143)
        at co.uk.realistic.regal.decivedetection.api.DeviceDetectionWurflimpl.detect(xxxxx)
Which is fine - the service was not available for a short time... but the Java client was taking around 5 minutes to fail... which brought all of our servers to a grinding halt, using up every available thread in the pools and caused a massive spike on the number of open transactions on our db and other general nastiness.

The question for me is, how do I reduce the time out on the Java client to make WURFL fail fast and generally be kind to our servers and not bring our entire operation down?

Regards,
Jason
Last edited by rgluk on Mon Oct 26, 2015 4:52 am, edited 1 time in total.

rgluk
Posts: 4
Joined: Mon Apr 13, 2015 4:21 am

Re: Outage between 22:06 and 22:27 BST on the 23/10/2015

Postby rgluk » Mon Oct 26, 2015 4:51 am

It does not appear to be possible. The code below is responsible for setting up the URLConnection used by the Java Cloud Client:

Code: Select all

    /**
     * Create an URLConnection object using a URL string
     * 
     * @param request
     * @return
     */
    private URLConnection setupUrlConnection(String request) throws IOException {
    	URLConnection connection = null;

		if (proxy != null) {
    		connection = new URL(request).openConnection(proxy);
    	} else {
    		connection = new URL(request).openConnection();
    	}
		
		int timeout = 10000;
    	logger.debug("Setting connection timeout: " + timeout);
    	connection.setConnectTimeout(timeout);
    	
    	if (Constants.API_TYPE.equals(Constants.API_HTTP) && connection instanceof HttpURLConnection) {
    		logger.info("Explicitly setting connection method to GET");
    		((HttpURLConnection)connection).setRequestMethod("GET");
    	}
    	
        logger.info(connection.toString());
        logger.info("Incoming connection headers count: " + reqHeaders.size());
        for (Map.Entry<String, String> entry : reqHeaders.entrySet()) {
        	if (FILTERED_HEADERS.contains(entry.getKey().toLowerCase())) {
        		logger.info("filtering entry: " + entry);
        	} else {
	            logger.info("   adding entry: " + entry);
	            connection.setRequestProperty(entry.getKey(), entry.getValue());
        	}
        }
        
        Map<String, List<String>> headers = connection.getRequestProperties();
        logger.info("Outgouing connection headers count: " + headers.size());
        for (Map.Entry<String, List<String>> entry : headers.entrySet()) {
        	logger.info("Outgoing Header: " + entry.getKey() + " -> " + entry.getValue());
        }

    	return connection;
    }
It sets a hard-coded connection time out of 10 seconds, but does not set a read time out. The default connection and read time outs in our version of Java dictate that no time out is observed, so once the connection is made it could block forever waiting for a response from the other end.

I think that in the period indicated, the wurfl servers were accepting connections but running very slowly (> 5mins to respond) hence the issue we observed.

Elliotfehr

Re: Outage between 22:06 and 22:27 BST on the 23/10/2015

Postby Elliotfehr » Mon Oct 26, 2015 9:40 am

Jason,

On Friday the 23rd between 22:06 and 22:28 BST our UK region did experience a disruption in service caused by a network failure with one of our hosting providers, which caused all load balancers in this region to become unresponsive during this time.

In regards to the open connections, you are correct that this is hard coded to timeout at 10 seconds for the Java cloud client. I do also see that you may be on an older version of the WURFL Cloud client as the error message that you were seeing has since been deprecated. In any case, our engineering team is looking into this and I will let you know as soon as I have any updates. You might have already noticed that we have also recently open sourced our cloud clients and they can now be found on GitHub here.

Please accept our sincere apologies for this outage. Maintaining availability for the WURFL Cloud service is a always a major priority to us and we will re-double our efforts to ensure the fail-over systems we have in place are working properly at all times. Thank you for using the WURFL Cloud service. Your patience and understanding is much appreciated.

Thank you,

Elliot


Who is online

Users browsing this forum: No registered users and 14 guests