NorPhone v1.2/1.3 Server Outage

On December 6, 2016, NorPhones earlier than v1.3 were permanently blocked from the network. This was due to a catastrophic change in the way Second Life's server handles communications after a region restart, which caused all existing phones to get stuck in a loop after the rolling restarts on that date. All phones were repeatedly registering themselves, which overloaded our server. Linden Lab did not seriously investigate the issue and instead attributed it to a non-reproducible glitch, but this broke practically all phones and made it impossible to register new ones.

On December 8, an update to v1.3 was issued, which fixed this bug but introduced a memory leak which caused scripts to run out of memory. On January 6, 2017, an update to v1.4 was issued, which should resolve both problems. Unfortunately, because the Second Life glitch caused the phones to become unresponsive, it is not possible to recover any configuration data from existing phones. We deeply apologize for the inconvenience this will cause our customers. We are also working on a better cloud-based solution for configurations in future updates and products.

Again, we sincerely apologize for the inconvenience this will cause our customers in having to replace their phones and redo configurations. Updates are available now by rezzing your existing phones. We will also be sending phones manually to all purchasers.

Below is a more detailed explanation of the bug, written earlier and sent through our group.

 

After checking the access logs, it was clear that phones were registering themselves repeatedly. The server would register them and return the correct response, but the phone would keep trying every 5 seconds. This suggested to me that phones were no longer correctly obtaining HTTP-IN URLs from the sim server, which are used for the NBS server to send messages to individual phones. Normally, a failure to obtain a URL just sends a single registration and waits until a URL can be generated again, but in this case, the phones were repeatedly registering with no URL.

In all NorPhones, there are three scripts: the "comm" script, the "registration" script, and the "timer" script. (There are other functions in the latter two, but they're not important here.) Every 10 seconds, the "timer" script checks to see if its URL has changed or otherwise become invalidated. This happens usually due to a region restart (which invalidates all URLs), but can also occasionally happen randomly for no reason, hence the check. If the "timer" script senses that its URL changed, it signals to the "registration" script to send a registration to the server with the new URL. If the phone no longer has a URL, it still sends a registration so the server is aware that the phone is no longer accessible. The "registration" script sends a request to the NBS server, and the server responds to the "comm" script, which parses the response and signals to the other two scripts whether it was successful or not.

Usually, this works fine. The phone needs three separate scripts for different reasons (memory usage, multiple timers, etc.) so this is the best solution. However, after inspecting several broken phones today, it appears that the "comm" script in every phone crashed for no reason immediately after the rolling restart - specifically, when calling llRequestURL to obtain a new URL. Since the "comm" script is crashed, it does not receive nor parse the response from the server when the "registration" script sends a registration request. So the "timer" script is pinging away every 10 seconds expecting the "comm" script to handle the response, when it never does. Since the "timer" script never receives the verification from the "comm" script that a response was received, it is stuck in a loop. Since hundreds if not thousands of phones exist in-world, all of these phones are sending requests to register every 10 seconds, which each involve a good bit of processing power to check the database for existing records and match them. As more and more phones got stuck in the loop, more and more bandwidth and processing power was drawn from the server. Ultimately, the server crashed entirely.

Worse for you is that all phone configurations are lost. There is no way to recover them without going through the "comm" script, and there is no way to reset the "comm" script manually or add any other scripts to the phone to try to pull the configuration out manually.

Addition on January 6, 2017: v1.3 phones appear to also be crashing due to a memory leak caused by the hotfix. This should be resolved with v1.4.