Hello, I’m Sam, I’ve been part of the Greenline team from the start and formally became CTO around mid-2019. I and the entire team have learned so much over the past 2 years which is rarely discussed. We have always been an honest and transparent company, which will not change.
In the spirit of transparency and honesty, I wanted to share the story of a recent failure, some of the lessons we took from it and the improvements we’ve made. We will continue to improve in the pursuit of building the best cannabis point of sale and inventory management possible.
On Friday, October 25th we had a system failure. It was an almost 3 hour period where our system was unable to process requests efficiently.
We were alerted by the issue at around 4:40 pm Friday and instantly began investigating the scale of the problem and what was causing it. Our worst fears were confirmed when we were unable to make sales. We checked our errors and what they were telling us, the blockage was our heaviest table, Inventory Logs. Every inventory movement in Greenline is tracked to the millisecond, answering the biggest questions retailers have: who, what, when and where. We diagnosed within minutes that the table itself had run out of available numbers based on the Id type. Something that it took 2 years, processing millions of dollars in sales for us to hit.
A rock and a hard place
Option A: migrate that table to allow more logs which would take an unknown amount of time
Option B: Create a new table with the changes we wanted to make and alleviate the Friday rush hour for our customers and merge the data later
We went for option B due to not knowing how long the migration would take to fix and we wanted our customers to get back to making sales as soon as possible. We created the new table which went from a capacity of ~8.3 Million entries to ~2.4 Billion, renamed the old one and swapped them out knowing we’d merge the data later on. We got this fix in within about 20 minutes of the initial issues.
Until… new problem. We were able to make some payments and then, crawling, hanging payments. We saw logs were working, however, our service was slowing down. Once we got back online, all of the offline payments that had built up during the outage came in en mass as well as the live sales.
Was this a server load problem? We went from running 4 instances of our service to 16 to help cope with any server speed, things were still slow.
The Database was running at almost 100% CPU. Scaling the Database power isn’t as simple as scaling the servers and worse requires more downtime. It seemed the DB was getting stuck with some old connections, we restarted the DB (takes about 20 seconds, then starts slow and warms up).
Ahh payments could go through again!
Until after around 7 minutes, they started slowing down again. What was causing this? The database was back to 100% CPU (normally we run around 15%). We restarted the database, eventually, we’d be able to get through this backlog of payments. We kept investigating with half the team monitoring the service whilst others tried to figure out what was going on with the payments table.
We knew where, and then we found what. Something we were about to work on the coming Monday. How we inserted payments into the database was slightly unoptimized which on day to day use hadn’t caused any real problems but was on our performance target list. As much as we push the envelope of features our system can do, we also keep an eye on all of our current metrics and are in an endless pursuit of increasing the performance of every element in our system.
What was normally a simple inefficiency was now being problematic by the sheer volume of payments coming in at the exact same time. We knew the plan to fix it and quickly put together an implementation, tested it and got it out.
We tested and payments were able to go through. Not only did they go through, but they also went through incredibly fast. Our payments were around 500% faster than before!
The DB processed all of the remaining payments and was running at a lower CPU than usual! The service returned to normal whilst we kept a close eye on the service and began figuring out the repair steps and possible outcomes from the outage.
Over the weekend we merged our inventory logs back and worked on restoring any missing lines that had occurred during the outage.
We now have a much greater insight into our system and have put our findings to great use to ensure these never happen again.
The easier you can look into your application and all of its many many pieces, the quicker you’ll be able to diagnose issues and recover from them. We already had some but we wanted more depth and more breadth of the system. This is an endless pursuit of ours, finding the best tools, gaining the strongest knowledge to help us provide the best possible service for our customers.
We had implemented offline payments over a year ago and it had served our customers incredibly well through many issues including internet outages, power outages (serving by candlelight, though romantic, isn’t ideal) and even minor downtime we’ve had including maintenance windows.
We hadn’t seen it run through a not fully offline but also the not fully online mode and the user behavior along with that. People had closed out of the app (probably due to a slow sale) and then were unable to get back into making sales.
We’ve addressed these and continue to make improvements that will benefit our customers not just in those troublesome times, but all other times also.
There’s nothing quite like a fire, a challenge that requires more than one person. Seeing our team come together and support each other in the most testing time reveals the kind of team we’ve been building at Greenline over the past 2 years. We’re really proud of the culture and the people we work with every day to make the best experience for our customers.
I am a firm believer in we grow through discomfort and nothing is more uncomfortable than your system failing and your customers being left in difficult positions.
We don’t intend to only learn from these cases as we’re all always trying to grow, but we also take huge value away from an event like this.