Zipi Improves DevOps to Fix Performance Bottlenecks and Optimize Google Cloud Application
How we guided Zipi in improving their DevOps practices to remove bugs and improve latency/scaling performance on their Google Cloud App.
February 25, 2021 | Cloud-Based App Dev
Zipi is a real estate back-office and accounting platform that simplifies the entire transaction lifecycle. Zipi’s mission is to improve the lives of people through real estate experiences. By building a comprehensive platform for real estate professionals and their customers, Zipi now serves thousands of these users everyday. The talented team at Zipi continually strives to identify additional experience opportunities and add value globally.
Google Cloud is the core of Zipi’s architecture and was responsible for a wide range of functions. However, Zipi’s team was facing numerous issues including severe latency, autoscaling errors, and monitoring.
To ensure an uninterrupted and smooth experience for their users, Zipi wanted a Google Cloud DevOps expert to help them resolve their issues. They partnered with D3V Tech to identify and understand their software issues, develop long-term solutions, and introduce mechanisms to avoid similar problems from arising in the future.
“D3V had a genuine interest in communicating and helping our team identify challenges. Moreover, they truly went above and beyond our expectations - they were passionate and committed to the project.”
Zipi had chosen Google Cloud as the core of their architecture to build scalable, resilient, and agile microservices. In order to ensure a smooth development pipeline, Zipi engineers had also set up Cloud Monitoring, Cloud Logging, Sentry, Cloud Source Repos, Cloud Trace, CI/CD pipelines, and more.
However, their engineers began seeing error reports along with end-users experiencing a number of issues such as:
- high latency and 502 timeout errors
- Timed out errors when users attempted downloading certain PDFs
- Lack of Trace Details and other setup problems with Cloud Trace
- Issues with long requests
- Parsing error when autoscaling starts
“We were convinced D3V was the right candidate to deliver this project. Unlike other options, they had clearly explained to us about the capabilities of Google Cloud Operations Suite and its rich collection of APM tools that we were not aware of ourselves.”
Zipi not only wanted to fix these issues and get user experience back to normal as quickly as possible, they also needed to ensure that similar issues would not arise in the future.
To achieve this, D3V DevOps experts came up with two goals. The first goal was to identify and resolve Zipi’s issues and second was to establish a mechanism to improve monitoring and recovery options.
Before doing any work on their infrastructure, the D3V team had several detailed conversations and meetings with Zipi in order to help understand some of their problems as well as answer questions their IT team had.
After that, the D3V team began by performing an Infrastructure Analysis to the cause of scaling and latency issues and setting up various monitoring dashboards to further narrow down the causes of performance bottlenecks before optimizing with new parameters.
D3V engineers set up multiple monitoring dashboards that visualized a large amount of data that we used to narrow down the cause of the problems.
The dashboards contained the following information:
- Chart showing the number of loading requests each day
- Analyzing existing Response Latency charts
- Analyzing existing memory/CPU utilization charts
- New chart to plot the average response time for each app service
- New chart to plot the average response time per endpoint URL per service over a 6 hour time window
These dashboards helped our engineers narrow down the problem to a few URLs that were promptly rectified.
D3V also applied Whitebox Monitoring techniques to monitor services and set up distributed tracing using the native Cloud Tracing APM tool in GCP to measure latency of HTTP requests.
Furthermore, to ensure autoscaling capabilities worked as intended, D3V’s engineers optimized the application to handle cold starts by following Site Reliability Engineering (SRE) practices such as:
- Lazy loading of app dependencies
- Use of global variables
- Fine tuning parameters
- Adding a warm up handler
- Set up distributed traces in Cloud Tracing using OpenTelemetry auto instrumentation and Zipkin
“In the end, there were no latency issues after the project was completed and the team had several dashboards to help monitor attributes associated with our applications.”
By the end of the project, Zipi engineers had access to a number of powerful Cloud Monitoring dashboards they can use to monitor application health and performance and identify problems before they affect customer experience.
More importantly, Zipi engineers were able to better utilize the services available on the Google Cloud ecosystem and had a stronger understanding of best practices that will undoubtedly come in handy as their applications continue to grow and scale.