Azure Doesn’t Care About Your Small Business
Microsoft is huge — everyone knows that. So, naturally, it’s easy for them to lose sight of the importance of customer service, especially when the customer in question represents a tiny fraction of a fraction of their revenue. Enter a small business I work for — our average bill is around $450, with the expectation that it’ll rise to around $1,000-$2,000 as we port old on-prem processes to the cloud. This is the story of our near-half-year (and still ongoing!) ordeal.
I hope you’ll see the unfortunate facts I’ve found in the past couple of years of using Azure: Their two biggest problems are huge amounts of missing or incomplete documentation and an entirely-inaccessible, practically-useless support structure for any product that doesn’t have a GitHub repo.
Before we get into the details, I want to impress upon you two set-in-stone facts:
- Azure services cannot bill CPU if they do not exist. (This seems obvious, but you’ll see why it’s important later.)
- Serverless Azure SQL databases should not charge CPU time while paused — the only charges should be storage charges. (Reference: “When the database is paused, the compute cost is zero and only storage costs are incurred.”)
Let’s get started.
Background
My company ingests data from a service provider. This provider does not have any sort of API we can use to access the data — instead, they push a .bacpac (a portable database backup) into a blob container on our end. We then have to restore that file to a database, copy tables out of it, and use those tables in our ETL. To keep costs down, we figured we’d do something like the following:
- Drop the previous day’s imported database
- Start an import for the current day’s database using the REST API
- On completion, copy the tables we need into our production database
- Downgrade the imported database to a serverless tier so that it’ll pause and not charge us CPU time (but still remain for auditability in case something fails over the course of the day)
Sounds simple, right? My friends, it should’ve been…
Problem #1: An Unworkable API
When I got to experimenting with the REST API, I quickly found an interesting issue. What’s supposed to happen when you start an import is that the API should return a 202 Accepted code along with the Operation ID of the import so that you can poll it for status. What actually happens is that the REST API returns a 500 error code… but works — after about 30 minutes, the import operation will have completed successfully on the database, though from the calling process’ point of view, it could never know that, since the API reports failure.
After much investigation with the Azure team, it appears the API fails on its initial attempt and reports this failure, but that the Azure SQL service automatically retries immediately, successfully starting the import. Please understand — it took three months to figure this out. When I asked how I could submit a bug asking for the issue to be fixed, I was told to post an issue on UserVoice. As in, essentially, “Make a feature request for the service-breaking bug to be fixed, and it might get fixed… someday.” The Azure SQL tech working with me didn’t even know what team owned that portion of the REST API.
No biggie. There are ways around the REST API (turns out PowerShell/az is also broken, as it seems to call the REST API on the backend, but you can still use sqlpackage installed on your own VM, which ended up being our solution).
No biggie, right? Frustrating that there’s no way to report a bug with this particular piece of the REST API, but at least there was a way around it. But oh ho ho, there is more. Early on in the testing process for the REST API, we ran into a significantly more severe bug, one that’s caused a nonexistent process to charge us CPU continually for almost half a year with no end in sight: Thousands of dollars in incorrect billings. Strap in; it gets weird.
Problem #2: Serverless Azure SQL Can Break… Badly
The day is April 9, a Friday. I’ve just finished a multi-hour session of infuriating debugging on this process: Attempt the REST call, receive an error resulting in a successful import (what an oxymoron), delete the imported database, try again with different parameters. It’s just what you have to do when there are strangely-missing pieces of documentation for the endpoint. At the end of this session, I deleted the database — let’s call it “Mephisto”. I needed the weekend to clear my head and think about what might be going wrong — at this point, I didn’t know the cause of the error stated in the previous section.
That weekend, on Sunday, I got a budget alert. Logging in, I found that Mephisto had charged us for CPU continually since Friday afternoon — something heshouldn’t be able to do, since not only was he serverless, hewas deleted. I recreated him, completely new, on Monday. What was really weird is that, upon recreating him, there was an import process running and stuck at 34%. Essentially, somehow there was a hung import running against a nonexistent database and charging us for database CPU. I tried cancelling this operation through the Azure Portal. No dice. I tried with the CLI/PowerShell. No dice. I tried with the REST API. No dice. It took through April 14 to get the operation cancelled — not even Azure Support could help.
At this point, I was ready to write off the couple hundred dollars he had charged us… until I got another budget alert two days later. I checked back in on Mephisto. He was paused (being serverless, heautopauses after an hour of inactivity), but he was still charging us CPU. As we stated at the beginning of the article, this should be impossible. I filed a billing dispute and left Mephisto alive as evidence of his misbehavior. See below for a screenshot of the CPU billing, right next to the tile showing Mephisto’s status.
The first response from billing was hilarious (direct quote):
Thank you very much for your time while we investigated this.
I would like to let you know that after investigating I have been able to confirm that your database is a Paas service and these services are never stopped, this is the reason you keep getting charges even though the database is stopped https://azure.microsoft.com/en-us/overview/what-is-paas/
Keep in mind that the amount of compute billed is the maximum of CPU used and memory used each second. If the amount of CPU used and memory used is less than the minimum amount provisioned for each, then the provisioned amount is billed. In order to compare CPU with memory for billing purposes, memory is normalized into units of vCores by rescaling the amount of memory in GB by 3 GB per vCore.
In this case I am happy to provide the following link where you can undertand more properly how this service works Serverless compute tier — Azure SQL Database | Microsoft Docs
Now, anyone who knows how Serverless services work should see how stupid this is, but let me impress upon you exactly how stupid this is. The link he provided at the end contains this quote: “When the database is paused, the compute cost is zero and only storage costs are incurred.” It was at this point that my soul deflated. I knew it was going to be hard from here on out.
Over the next month or so, we made almost no progress. Finally, I got tired of waiting: I had the logs, screenshots, and ticket documentation needed to prove what was happening, so I deleted Mephisto in the hopes that he’d stop charging us. At least that way we were only dealing with a billing dispute. The day of deletion was May 8. On May 15, Mephisto was still charging us for CPU usage, dashing my hopes for resolution.
It is now September 1, 2021. We are still being charged for CPU every day by this database. The only resolution any rep has come up with is to delete and recreate the entire Subscription, which, for obvious reasons, isn’t possible. I’ve been handed off to three different Support reps now, with probably a dozen other people “consulting” every once and a while. Every week, I get a “next week” resolution time. We’re now thousands of dollars out on this with no refund in sight. The ticket has been “escalated” more than a dozen times, “higher than I’ve ever seen a ticket escalated”, a support rep claims. Sure, buddy.
Conclusion
If you’re a small business scoping out cloud providers, take this as a dire warning: Azure services work great when they work, but God help you if you need support — they don’t seem to give a rat’s ass about their small business support cases.
If you work in the Azure organization and this story shocks you, please get in touch. I’d love to resolve the issue without consulting our legal staff. I know that the individual humans over at Azure don’t want to offer experiences like this, but something about the corporate structure seems to be preventing any sort of progress.
Note: I’ll keep this article up-to-date with any more hilariously-dumb responses to the support requests or FAQs from the comments.
Edit 2022–01–25:
This was finally resolved back in November. As I figured, it was hilariously easy to fix — it’s even a known issue for the software engineers working in the Azure SQL area. Unfortunately, none of my support reps knew where that department was, and a random GitHub bug I filed regarding this issue was what finally got the attention of the right person.
Retrospectively, this has reinforced one of my biggest complaints about the Azure SQL/SQL Server stack: There is little to no access to people who actually know how to fix and address bugs within the space. It’s the polar opposite of the open source projects where it’s as easy as filing a bug report — not only do you not know where to file a bug report, if there is a place to do so, it’s not monitored.