Engineering – ORFIUM https://www.orfium.com Liberating the true value of content Fri, 01 Mar 2024 13:40:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://www.orfium.com/wp-content/uploads/2023/02/blue-logo-2.svg Engineering – ORFIUM https://www.orfium.com 32 32 Unveiling AsyncIO’s Performance Secrets: Solving Latency Issues in Rights Manager’s Migration Journey https://www.orfium.com/engineering/unveiling-asyncios-performance-secrets-solving-latency-issues-in-rights-managers-migration-journey/ Fri, 01 Mar 2024 12:08:37 +0000 https://www.orfium.com/?p=164253

đŸ§ŸIntroduction

One of the latest initiatives undertaken by the Rights Management team, was the migration of a legacy Django application to a more maintainable and scalable platform. The legacy application was reaching its limits and there was a real risk of missing revenue because the system could not handle any more clients.

To address this challenge, we developed a new platform that could follow the “deep pocket strategy.”

What does that mean for our team? Essentially, when the number of transactions we process spikes, we deploy as many resources needed in order to keep our SLAs.

With the end goal of building a platform capable of scaling, we decided to go with native cloud architecture and AsyncIO technologies. The first stress tests surprised (and baffled!) us,  because the system didn’t perform as expected even for a relatively small number of transactions per hour.

This article covers our journey towards understanding the nuances of our changing architecture and the valuable lessons learned throughout the process. In the end (spoiler alert!) the solution was surprisingly simple, even though it took us on a profound learning curve. We hope our story will provide valuable insights for newcomers venturing into the AsyncIO world!

📖Background

Orfium’s Catalog Synchronization service uses our in-house built application called Rights Cloud. It’s tasked with managing music catalogs. Additionally, there’s another system, called Rights Manager, which handles synchronizing many individual catalogs with a user-generated content platforms (UGC,) such asYouTube.

Every day, the Rights Manager system ingests Rights Cloud messages, which detail changes regarding a specific client’s composition metadata and ownership shares. For this, we had implemented the Fan-Out Pattern, so Rights Cloud sends the message to a dedicated topic (SNS) and then Rights Manager copies the message to its own queue (SQS). For every message that Rights Manager ingests successfully two things happen:

  1. The Rights Cloud composition is updated to the Rights Manager System. The responsible API is the Import Rights Cloud Composition
  2. After the successful Rights Cloud ingestion, a new action is created (delivery, update, relinquish). The API responsible for handling these actions is the Identify Actions API.

Based on this ingestion, the Rights Manager system then creates synchronization actions, such as new deliveries, update ownership shares, or relinquish.

Simplified Cloud Diagram of Rights Manager

đŸ˜”The Problem

During our summer sprint, we rolled out a major release for our Rights Manager platform. After the release, we noticed some delays for the aforementioned APIs.

We noticed (see the diagrams below) spikes in latency between 10 to 30 seconds for the aforementioned APIs during the last two weeks of August.Even worse, some requests were dropped because of the 30-second web server timeout. Overall, the P90 latency was over 2 seconds. This was alarming, especially considering that the team had designed the system to meet a maximum latency performance metric of under 1 second.

We immediately started investigating this behavior because scalability is one of the critical aspects of the new platform.

First we examined the overall requests for the upper half of August. 

At first glance, two metrics caught our attention:

  1. Alarming metric #1: The system could not  handle more than ~78K requests in two weeks although the system was designed to serve up to 1 million requests per day.
  2. Alarming metric #2: The actions/identify API was called 40K more times than import/rights_cloud API. From a business perspective, the actions/identify should have been called at most as many times as the import/rights_cloud. This indication showed us that the actions/identify was failing, leading Rights Manager to constantly re-try the requests.
Screenshot 2023-09-25 091406.png

Total number of requests between 16-31 of August.

Based on the diagrams, the system was so slow in responses that we had the impression that the system was under siege!

The following sections describe how we solved the performance issue.

Spoiler alert: not a single line of code changed!

📐Understand the bottlenecks – Dive into metrics

We started digging into the Datadog log traces. We noticed that there were two patterns associated with the slow requests.

Dissecting Request Pattern #1

First, we examined the slowest request (32.1 seconds). We discovered that sending a message to an SQS took 31.7 seconds, a duration far too long for our liking. For context, the SQS is managed by AWS and it was hard to believe that a seemingly straightforward service would need  30 seconds to reply under any load.

Examining slow request #1: Sending a message to an AWS SQS took 31.7 seconds

Dissecting Request Pattern #2

We examined another slow request (15.9 seconds) and the results were completely different. This time, we discovered a slow response from the database. The API call needed ~3 seconds to connect to the database and a SELECT statement on the Compositions table needed ~4 seconds. This troubled us because the SELECT query uses indexes and cannot be further optimized. Additionally,  3 seconds to obtain a database connection is really a lot.

Examining a slow request #2: The database connect took 2.98 seconds and an optimized SQL-select statement took 7.13 seconds.

Examining a slow request #2: Dissecting the request we found that the INSERT statement was way more efficient than the SELECT statement. Also, the database connect and select took around 64% of the total request time.

Dive further into the database metrics

Based on the previous results, we started digging further into the infrastructure components.

Unfortunately, the Amazon metrics for the SQS didn’t provide us with the insight needed to understand why it was taking 30 seconds to publish a message into the queue..

So, we shifted our focus to the database metrics. Below is the diagram from AWS Database monitor.

The diagram showed us that no load or latency existed on the database. The maximum load of the database was around 30%, which is relatively small. So the database connection should have been instant.

Our next move was to see if there were any slow queries. The below diagram shows the most “expensive” queries.

Once again, the result surprised us – No database load or slow query was detected from the AWS Performance Insights.

The most resource-intensive query was the intervacuum which took 0.18 of the total load, which is perfectly normal. The maximum average latency was 254 ms regarding an INSERT statement, once again reflecting perfectly normal behavior.Screenshot-2023-09-04-140210_final_2.png

The most expensive queries that are contributing to DB load

According to AWS documentation by default, the Top SQL tab shows the 25 queries that are contributing the most to database load. To help tune your queries, the developer can analyze information such as the query text and SQL statistics.

So at that moment, we realized that the database metrics from Datadog and AWS Performance Insights were different.

The metric that solved the mystery

We suspected that something was amiss with the metrics, so we dug deeper into the system’s status when the delays cropped up. Eventually, we pinpointed a pattern: the delays consistently occurred at the start of a request batch. But here’s the twist – as time went on, the system seemed to bounce back and the delays started to taper off.

The below diagram shows that when the Rights Cloud started to send a new batch of messages around 10:07 am, the Rights Manager APIs needed more than 10 seconds to process the message in some cases.

After a while, at around 10:10 am,  there was a drop in the P90 from 10 seconds to 5 seconds. Then, by 10:15 am, the P90 plummeted further to just 1 second.

Something peculiar was afoot – instead of the usual expectation of system performance degrading over time due to heavy loads, our system was actually recovering for the same message load. 

At this point, we decided to take a snapshot of the system load. And there it was – we finally made the connection!

Eureka! The delays vanishes when the number of ECS (FastAPI) instances increases.

We noticed that there was a direct connection between the number of API requests and the number of ECS instances. Once the one and only ECS instance could not serve the requests,  the auto scale kicked in and new ECS instances were spawned. Every ECS instance needs around 3 minutes to go live. When the instances were live, the delays decreased dramatically.

We backed up our conclusion by creating a new Datadog Heatmap. The below diagram explains the aggregation of the duration of each Import Rights Cloud Composition. It is clear, that when the 2 new FastAPI instances were spawned at 10:10 am, the delays were decreased from 10 seconds to 3 seconds.  At 10:15 am there were 5 FastAPI instances and the responses dropped to 2 seconds. Around 10:30 the system spawned 10 instances and all the responses duration was around 500ms.

At the same time, the database load was stable and between 20-30%.

That was the lightbulb moment when we realized that the delays weren’t actually database or SQS related. It was the async architecture that caused the significant delays! When the ECS instance was operating at 100% and the function was getting data from the database, the SELECT query duration was in milliseconds but the function was (a)waiting for the async scheduler to resume. In reality, the function has nothing else to do but to return the results.

This explains why some functions took so much time and the metrics didn’t make any sense. Despite the SQS responding in milliseconds, the functions were taking 30 seconds because there simply wasn’t enough CPU capacity to resume their execution.

The spawn of new instances completely resolved the problem because there was always enough CPU for the async operation.

đŸȘ Root cause

Async web servers in Python excel in performance thanks to their non-blocking request handling. This unique approach enables them to seamlessly manage incoming requests, accommodating as many as the host’s resources allow. However, unlike their synchronous counterparts, async servers lack the capability to reject new requests. Consequently, in scenarios of sustained incoming connections, the system may deplete all available resources. Although it’s possible to set a specific limit on maximum requests, determining the most efficient threshold often requires trial and error.

To broaden our understanding, we delved into the async literature and came across the Cooperative Multitasking in CircuitPython with asyncio.

Cooperative multitasking involves a programming style where multiple tasks alternate running. Each task continues until it either encounters a waiting condition or determines it has run for a sufficient duration, allowing another task to take its turn..

In cooperative multitasking, each task has to decide when to let other tasks take over, which is why it’s called “cooperative.” So, if a task isn’t managed well, it could hog all the resources. This is different from preemptive multitasking, where tasks get interrupted without asking to let other tasks run. Threads and processes are examples of preemptive multitasking.

Cooperative multitasking doesn’t mean that two tasks run at the same time in parallel. Instead, they run concurrently, meaning their executions are interleaved. This means that more than one task can be active at any given time.

In cooperative multitasking, tasks are managed by a scheduler. Only one task runs at a time. When a task decides to wait and gives up control, the scheduler picks another task that’s ready to go. It’s fair in its selection, giving every ready task a shot at running. The scheduler basically runs an event loop, repeating this process again and again for all the tasks assigned to it.

đŸȘ›Solution Approach #1

Now that we’ve identified the root cause, we’ve taken the necessary steps to address it effectively.

  1. Implement a more aggressive scale-out. Originally set at 50%, we’ve now set the threshold to 30% to facilitate a more responsive approach to scaling. With Amazon’s requirement of 3 minutes to spawn a new instance, the previous 1-minute timeframe for CPU reaching 100% left a mere 2-minute window of strain. 
  2. Implement a more defensive scale in. The rule of terminating ECS instances is CPU-based. By establishing a threshold of 40% over a 2-minute interval, instances operating below 20% for the same duration trigger termination by the auto-scaler.”.
  3. Change the load balancer algorithm. Initially using a round-robin strategy, we encountered instances where CPU usage varied significantly, with some reaching 100% while others remained at 60%. To address this, we’ve transitioned to the “least_outstanding_requests” algorithm. This ensures that the load balancer directs requests to instances with the lowest CPU usage, optimizing performance and resource utilization across the system.

By aggressively scaling out additional ECS instances, we’ve now successfully maintained a P99 latency of under 1 second, except from the first minute of the API flood request.

đŸȘ›đŸȘ›Solution Approach #2

While the initial approach yielded results, we recognized opportunities for improvement. We proceeded with the following strategy:

  1. Maintain two instances at all times, allowing the system to accommodate sudden surges in API calls. 
  2. We’ve capped the maximum ECS instances at 50. 
  3. Implementing fine-grain logging in Datadog, to differentiate between database call durations and scheduler resumption times effectively.

đŸ«Lessons learned

  1. Understanding the implications of async architecture is crucial when monitoring system performance. During times of heavy load, processes can pause without consuming resources—a key benefit to note.
  2. In contrast to a non-preemptive scheduler, which switches tasks only when necessary conditions are met, preemptive scheduling allows for task interruption during execution to prioritize other tasks. Typically, processes and threads are managed by the operating system using a preemptive scheduler. Leveraging preemptive scheduling offers a promising solution to address our current challenges effectively.
  3. We realized that while the vCPU instances are cheap, the processing power is relatively low.
  4. The ECS processing power may differ between instances, while the EC2 instances provide a stable processing power.
  5. A FastAPI server hosted on 1 vCPU instance can handle 10K requests before maxing out CPU consumption.
  6. The disparity in cloud costs between completing a task in 10 hours with 1 vCPU or in 1 hour with 10 vCPUs is evident. However, from a business perspective, the substantial difference lies in completing the job 90% faster.
  7. A r6g.large instance typically supports approximately 1000 database connections.

đŸ§ŸConclusion

Prior familiarity with async architecture would have streamlined our approach from the start but this investigation and the results proved to be more than rewarding!

AsyncIO web servers demonstrate exceptional performance, efficiently handling concurrent requests.. However, when the system is under heavy load, the async can deplete the CPU/Memory resources in the blink of an eye! —a common challenge readily addressed by serverless architecture. With processing costs as low as €30/month for 1 vCPU & 2GB memory Fargate, implementing a proactive scaling-out strategy perfectly aligns with business objectives.

]]>
Dominate YouTube Traffic: Conquer 10x More Traffic with YouTube’s CMS https://www.orfium.com/engineering/dominate-youtube-traffic-conquer-10x-more-traffic-with-youtubes-cms/ Thu, 15 Jun 2023 07:06:13 +0000 https://www.orfium.com/?p=4802

At Orfium, our mission is to unlock the full potential of content. That’s why we created Sync-Tracker, a groundbreaking solution for managing licensed and unlicensed music on YouTube. Whether you’re a music production company, a content creator, or simply someone uploading content on YouTube, Sync-Tracker is designed to cater to your needs. With Sync-Tracker, you can easily handle licensed and unlicensed music, while providing music production companies a platform to sell licenses to content creators. Additionally, users can enforce their licenses through YouTube’s Content ID, making Sync-Tracker a powerful tool for protecting your rights and ensuring compliance.

SyncTracker is a powerful licensing software that keeps a close eye on YouTube claims generated after the upload of a video, regardless of its privacy settings. Leveraging advanced audio and melody-matching algorithms, SyncTracker analyzes the content owner’s extensive catalog of assets to identify any potential matches. For more information on YouTube’s claims and policies, please refer to the detailed documentation available here

Cracking the Code: Insights into Efficient Claim Tracking đŸ§‘â€đŸ’»

To track claims efficiently, our technical approach involves leveraging the ClaimSearch.list API call. This API call allows us to retrieve claims based on specific statuses in a paginated manner. For instance, if there are 100 claims on YouTube, the initial request may only fetch 20-30 claims, accompanied by a nextPageToken. To access claims from subsequent pages, additional requests with the nextPageToken are necessary. This iterative process of making sequential ClaimSearch.list API calls enables us to gather all the claims seamlessly. Below, you’ll find a high-level diagram outlining the flow of tracking and processing claims, reflecting our recent advancements in this area.

What’s the daily maximum number of claims we can track from YouTube content ID? 

Determining the maximum number of claims we can track from YouTube’s Content ID per day involves calculating the theoretical upper limit based on response time and claim quantity. Assuming an average of 0.5 seconds for each request/response and approximately 20 claims obtained per call, we can estimate a capacity of around 3.5 million claims per day. Previously, we faced limitations in approaching this limit due to the sequential processing of each claim batch. 

The Chronicle of the Ages: Unveiling the History of Traffic Increase⏳

The solution we implemented until the end of 2022 served us flawlessly, catering to the manageable volume of claims we handled at the time. By the close of 2022, Orfium’s YouTube snapshot encompassed approximately 250,000 assets, with YouTube generating an average of around 30,000 claims per day for these assets.

However, the landscape changed in early 2023, when Orfium secured significant contracts with large Music Production Companies (MPCs). This development projected a substantial 5.5x increase in the Content ID (CiD) size, resulting in a total of 1.4 million assets. With this expansion, the claim traffic was also expected to grow. Estimating the precise increase in claim traffic proved challenging as it does not have a linear relationship with the number of assets. The quantity of claims generated depends on the popularity of each song, not just the sheer number of songs. We encountered scenarios where a vast catalog generated regular traffic due to the songs’ lack of popularity and vice versa. Consequently, we anticipated a claim traffic growth of at least 3-4x. However, our old system could only handle a maximum increase of 2x traffic 💣, prompting the need for upgrades and improvements. 

Success Blueprint: Strategizing for Achievement and Risk Mitigation 🚧

As our development team identified the impending performance bottleneck, we immediately recognized the need for proactive measures to prevent system strain and safeguard our new contracts. The urgency of the situation was acknowledged by the product Business Unit (BU), prompting swift action planning.

We initiated a series of brainstorming meetings to explore potential short-term and long-term solutions. With the adage “prevention is better than cure” guiding our approach, we devised a backlog of mitigation strategies to address the increased risk until the customer deals were officially signed. We were given a demanding timeframe of 4-6 weeks to complete all activities, including brainstorming meetings, analysis, design, implementation, testing, production delivery, and monitoring. The challenge lay in the uncertainty surrounding the effectiveness of the chosen solution.

After several productive brainstorming sessions, we compiled a comprehensive list of potential alternatives. These ranged from pure engineering decisions to YouTube-side configuration ideas, ensuring that every possible avenue was explored to ensure the safety and security of the new contracts. 

Unleashing the Power of Solutions: Overcoming Traffic Challenges and Streamlining Claims Tracking đŸ§©

With the primary goal of reducing the time spent in claims tracking, we embarked on a journey to introduce innovative solutions. One such solution involved the implementation of a new metric in Datadog, designed to detect any missing claims. This introduced a proactive approach to monitor if any claims were not being tracked by our application and enabled us to identify and address related issues promptly. By leveraging this powerful tool, we gained valuable insights into potential issues such as increased traffic and timeouts, empowering us to take swift action and maintain optimal performance.

The solutions we discussed to enhance our claims tracking process included:

  • Exclude claims from processing: Through our analysis, we discovered that approximately 70% of the claims tracked from YouTube were generated for private videos that were unlikely to be made public. Recognizing this, we proposed a solution to exclude these videos from our main processing unit and manage them separately. By doing so, we could reduce the volume of claims processed within the same timeframe by approximately 70%. This exclusion resulted in a significant 20% reduction in the end-to-end (E2E) processing time.
  • Asynchronous claim process flow: To address the long-term scalability of our system, we aimed to redesign SyncTracker’s main flow to support asynchronous operation. This change showed promising potential as it would allow us to process each claim batch from YouTube concurrently, enabling horizontal scaling through multiple workers. To ensure the sufficiency of this solution, we decided to implement a feature flag. This feature flag would provide the flexibility to switch between asynchronous and synchronous execution dynamically, preserving the system’s existing operation and quickly addressing any potential issues that may arise..

Solving the Puzzle: Addressing Production System Issues through Troubleshooting 📚

In the world of production systems, Murphy’s Law always lurks, ready to present unexpected challenges. After deploying the async implementation, everything seemed to be working smoothly for approximately 8 hours. However, we soon encountered a significant increase in the processing time of each request, disrupting the expected flow. Fortunately, thanks to our vigilant monitoring through Datadog metrics, we were able to detect the issue before it caused any service disruptions.

In light of the async solution implementation and concurrent query execution, our workflow encountered a significant setback due to time-intensive database queries for each monitored claim. This resulted in multiple query locks and idle worker processes. To tackle this problem, we strategically tackled the situation by identifying and eliminating unnecessary indexes from the license table. Additionally, we meticulously benchmarked the pertinent license-matching queries. Remarkably, we discovered that the queries filtered by video_id were the slowest.

In order to effectively address this performance bottleneck, we took proactive measures by creating a brand-new index specifically designed for license-matching queries. This optimization led to a remarkable reduction in overall processing time, with the median duration plummeting from 4-5 minutes to an impressive 1 minute. This represents a substantial decrease of 400-500%, greatly enhancing efficiency and productivity.

Closing the Chapter: Onboarding Outcomes Explored 🏁

Following the successful onboarding of new customers, we witnessed a remarkable surge in claim traffic, experiencing an eightfold increase from 50,000 to nearly 400,000. 

Presently, the SyncTracker application maintains exceptional stability and seamlessly manages the influx of claim traffic. Based on our careful projections, we have determined that our system is fully capable of accommodating claim traffic up to the maximum threshold allowed by the YouTube API, which amounts to approximately 3 million claims per day. Furthermore, we are proud to report a remarkable decrease of approximately 700% in the total number of missing claims. This figure has significantly dropped from 1,500 down to less than 100, underscoring the effectiveness and efficiency of our platform.

Considering all factors, there are additional minor optimizations that are currently being developed and are expected to yield an average time-saving of 15%-20%. These enhancements are still in the pipeline and have not yet been implemented in the production environment. It is important to note that despite our current success, this milestone does not signify the end of our journey. We remain committed to continuously improving our processes and striving for further advancements.

Abbreviations đŸ€

MPC – Music Production Companies 

CiD – YT Content ID

CMS – Content Management System -used interchangeably with CiD-

YT – YouTube

ST – SyncTracker

BU – Business Unit

E2E – End to End

Christos Chatzis

Engineering Manager

LinkedIn

Anargyros Papaefstathiou

Senior Software Backend Engineer

LinkedIn

Kostis Bourlas

Software Engineer

LinkedIn

Konstantinos Chaidos

Staff Engineer

LinkedIn

]]>
An Epic Refinement – The force awakens https://www.orfium.com/engineering/an-epic-refinement-the-force-awakens/ Fri, 05 Aug 2022 10:51:32 +0000 http://52.91.248.125/an-epic-refinement-the-force-awakens/

A long time ago, in a galaxy far, far away, there was a development team with the mission to liberate the true value of content in the entertainment industry by empowering artists and composers with solutions that optimized the monetization of their assets. The battle against the complexity of the music industry was fierce. The team was constantly getting hit through the refinement process of its development initiatives.

The refinement meeting seemed like a standardized procedure. But the team unconsciously followed many antipatterns. The meeting had become a requirements announcement event instead of a working session. This caused an increased knowledge gap, lack of challenge, and a reduction in ownership. The meeting was not tailored to the purpose it was supposed to serve. This resulted in unidimensional focus (either business or tech), lack of participation, demotivation, and low morale

To address the aforementioned issues, the team decided to strike back at these tendencies. They created a new epic refinement structure by:

  1. Introduced the Epic Ambassadors meeting
  2. Split the refinement meeting into 3 time-boxed parts with a specific agenda and outcome
  3. Leveraged the offline asynchronous technical analysis of a story by self opted-in team members

Epic ambassadors

Target

The objective of this meeting, which is inspired by the 3 amigos perspective check, is to reduce feelings of uncertainty and disconnection towards an unknown work item and save the team’s time later, by proactively answering high-level questions and doing a feasibility check.

What

This is a pre-refinement slot in which participants opt-in to act as ambassadors of the epic for the rest of the team. Ambassadors are the point of reference when the high-level tech conversation of the epic’s refinement comes to technical requirements or spontaneous peer questions. The ambassadors will assist the team to focus on the business-oriented discussion by instantly answering any technical question or addressing the team’s uncertainty.

High Council Chamber

Target

The objective of this part is to present a new work item to the team, cater to the value-based debate, and enhance active participation in later stages of the refinement.

What

During this phase, the entire team faces a new work item. It is the mechanism that ensures the business-oriented challenge of the proposed solution’s value. At last, the team members are called to voluntarily formulate pairs to further analyze the assigned stories asynchronously and offline.

Jedi Apprenticeship

Target

The objective of this activity is to increase ownership and improve the quality of work by providing an already ready-to-be-reviewed proposed solution.

What

Now the team, in pairs, dives into the details of a ticket without the anti-patterns of large group discussions, like peer/time pressure, authority, and anchoring bias. The offline and asynchronous activity improves the quality of the proposed solution and its value. This allows the team to design a high-level tech approach, identify dependencies, define spikes on unknown items and proactively resolve possible roadblocks in the process.

The mirror war

Target

The objective of this part of the journey is to validate the quality of the refinement outcome while it reduces the knowledge gap on new items and increases participation in the process.

What

During this slot, the team splits into parallel working groups. Each group consists of a member that defined the initial solution during the Jedi apprenticeship and one reviewer, who is not a member of the initial design pair. The outcome of this ritual is a fully completed user story, for which the team can confidently estimate effort.

The rise of an epic

Target

The objective of this part is to increase awareness across the team and validate that the working items have all the necessary information to be developed.

What

During this slot, the working groups are merged to share the outcome of the process, validate each item’s estimation and declare as “Ready for Development” the working items.

Future endeavors

Every era brings its challenges, but the one thing that never changes is people. As in the refinement process, the team’s traits will be further analyzed to design new working habits in other dimensions that apply to human-centered needs and principles. Otherwise, people will ignore mechanisms that ignore people.

“Much to learn you still have
my old padawan. This is just the beginning!”

Yoda

Dimitris Tagkalos

Software Engineer @ ORFIUM

https://www.linkedin.com/in/dtagkalos/ https://github.com/dtagkalos93

Yiannis Mavraganis

Director of Organizational Effectiveness @ ORFIUM

Alexandros Synadinos

Product Manager

]]>
Idempotent APIs — Tips to build resilient apps https://www.orfium.com/engineering/idempotent-apis-tips-to-build-resilient-apps/ Thu, 07 Jul 2022 08:10:46 +0000 http://52.91.248.125/idempotent-apis-tips-to-build-resilient-apps/

Photo by Tim Mossholder on Unsplash

There is a question to which typically we, as Developers, kind of turn a blind eye.

When was the last time your product had downtime? How did it affect your customers?

And why wouldn’t we?

We follow all the Agile processes, we go through multiple debates during the code review phase, we have automated and unit tests, we have QA teams that catch nasty bugs early, we write types on both Frontend and Backend, we use state-of-the-art tools, we do everything by the book. We trust that our code represents our best collective work. We expect it to simply work!

Until it doesn’t.

Photo by Elisa Ventur on Unsplash

API resiliency refers to the idea that we build APIs which are able to recover from failure. Failures can be caused by either our own or third-party service problems, server outages, DDoS attacks, network issues, and so much more. Frankly, there are innumerable reasons for which these failures occur. What’s more important is how you recover from them and ensuring they do no lasting damage. Let’s talk about an example.

Case: Artist wants to upload his song to our App

We have created an App that has a Client and an API Server.

In our case, let’s say that our user is an artist who is trying to submit their song to our platform, so we can collect royalties on their behalf.

We wrote a POST HTTP method that creates the song in the Server’s Database. This endpoint has many ways to respond back. It could return 200 for success, 400 for unauthorized access, 500 for server errors, or even worse, it might never answer back.

In cases of failure, clients either deal with it with error handling methods that give our artist some information about what went wrong or, in case we don’t want to interrupt the artist’s journey, they schedule it to try again. But beware: We want to be careful in this case to not overload the system with requests. Which is why it is recommended to implement exponential backoff.

Exponential backoff is a standard error-handling strategy for network applications. In this approach, a client periodically retries a failed request with increasing delays. Clients should use exponential backoff for all requests that return HTTP 5xx and 429 response codes, as well as for disconnections from the server. Eventually, the client should reach either a limit of maximum retries or time and stop attempting to communicate with the server. The great thing about exponential backoff is that it ensures that, when the Server is amidst an incident, it is not flooded with requests.

But, how do we know if the first request actually failed?

As we said before, there can be many reasons why an API could not respond back to the client. So, how are we sure that the server’s database hasn’t already saved the song? If one of our retries succeeds, we could end up having submitted the same song twice. This is a big problem, because what we actually wanted to do is to make sure our whole operation is idempotent.

Idempotence means that, if an identical request has been made once or several times in a row, it results in the same effect while leaving the server in the same state.

A great everyday example of idempotence is a dual button ON/OFF setup. Pressing ON once or multiple times results in only 1 result: the system is on. Same goes for the OFF button. 

When talking about idempotence in the context of HTTP, another term that pops up is data safety. In that case, safety means that the request doesn’t mutate data on invocation. The table below shows commonly used HTTP methods, their safety and idempotence.

+——————-+————–+——————-+

|    Http Method    |    Safety    |    Idempotency    |

+——————-+————–+——————-+

| GET               | Yes          | Yes               |

| PUT               | No           | Yes               |

| POST              | No           | No                |

| DELETE            | No           | Yes               |

| PATCH             | No           | No                |

+——————-+————–+——————-+

So as we can see, our POST method fails at both. Great.

Solution

Let’s go back to our whole operation. What do we have?

We have a client that tells the Server that it needs to save the song. A possible scenario of that JSON request could look like this:

{

 userID: “UID-123”,

 songTitle: “Comfortably Numb”,

 songArtist: “Pink Floyd”,

 songReleaseDate: “1979”

}

We could ask the client to perform this request as an idempotent request, by providing an additional Idempotency-Key: <key> header to the request.

An idempotency key is a unique value generated by the Client, which the server uses to recognize subsequent attempts of the same request. How you create unique keys is up to you, but it’s suggested to use V4 UUIDs, or another random string with enough entropy to avoid collisions.

When the Server receives this Idempotency-Key, it should save in the database the body of the first request made for any given idempotency key and resulting status code, regardless of whether it succeeded or failed.

Now, for every request that comes, the Server can verify if it has already mutated the data, just by checking the status in its DB. The idempotency layer should compare incoming parameters to those of the original request and errors (unless they’re the same) to prevent accidental misuse.

With this solution, we no longer have to worry about duplication of data or conflicts. it ensures that, no matter how many times we repeat this process, our operation is idempotent and the artist will receive the success message when the system is able to save their song.

These Idempotency-Keys should be eligible for removal automatically after they’re at least 24 hours old.

P.S. For the simplicity of the example we saved our songs and the keys to the same database. Ideally, you will save these keys on a cache server (e.g. Redis) with a TTL of 24hours, so the removal of the old keys happens by default.

In conclusion

An application that uses an API which implements idempotence can follow the steps below to ensure proper usage:

  1. Create idempotence keys and attach them to the header.
  2. When a request is unsuccessful, follow a retry policy such as exponential backoff.
  3. Save the request body and idempotency key in a cache server.
  4. Mutate the data.
  5. Update the idempotency key with the result of the mutation
  6. After a failure of either 5xx, 429 or no response from the server, the client retries the request.
  7. The server validates if the idempotency key exists on the cache server.
  8. The server validates if the body of the request is the same as the one in cache server
  9. If everything is the same, the server will not mutate the data but return the previously saved result of the mutation.

]]>
Java developers chase a memory leak in python  https://www.orfium.com/engineering/java-developers-chase-a-memory-leak-in-python/ Wed, 04 May 2022 08:23:18 +0000 http://52.91.248.125/java-developers-chase-a-memory-leak-in-python/
Photo by Luis Tosta on Unsplash

You take some things for granted when you work in a stack for a long time. For example, you assume that having the proper tools to do whatever it is you need to do is standard, nowadays. The truth is that sometimes you have to appreciate what you have, because you’ll miss it when you find out you don’t have it at your disposal. Like memory usage monitoring.

We’ll use a real-life example of how we troubleshot a memory issue in python, while our background and developers were in java. Our first reaction was to start monitoring the memory with a tool like Visual VM. Surprisingly, a similar tool to monitor memory usage doesn’t exist for python. That seemed a bit unbelievable but we cross-checked it with experienced python developers.

Source: https://visualvm.github.io/

So, with the help of our staff engineer, we searched for the right tool to locate the memory leak, whatever the root cause was.

We tried 3 different approaches and used a variety of tools to locate the issue.

Method #1: A simple decorator for execution time

Method #2: Using memory_profiler

Method #3: Calculating objects sizes

Method #1: A simple decorator for execution time

Decorator functions are used in python widely and they are easy, fast and convenient.

How it works: You can write a simple wrapper that calculates the execution time of a code block. Just take the time at the beginning and the end and calculate the difference. This will result in the duration of the execution in seconds.

Then, you can use the decorator in various functions that you think might be the bottleneck.

First, you create the function:

from functools import wraps
from time import time


def measure_time(f):
   @wraps(f)
   def wrap(*args, **kw):
       ts = time()
       result = f(*args, **kw)
       te = time()
       print(f.__name__ + " %s" % (te - ts))
       return result
   return wrap

Then, the usage is really simple (like a java annotation):

from foo import measure_time

@measure_time
def test(input):
   # some logic
   # a few lines more
   return output

Console (function name and execution time):

test 500.9122574329376

This would be the junior’s approach to debug memory issues, kind of like using print to debug. Nonetheless, it proved to be an effective and fast way to get an idea of what’s going on with the code.

Pros:

  • Easy to implement
  • Easy to understand

Cons:

  • Adds useless code that need to be deleted afterwards
  • If your app is big, you add too many measurements
  • Works better if you have a hunch about where the issue is located
  • It doesn’t actually measure memory
  • The method needs to be executed successfully in order to see result (if it crashes, you won’t have anything)

Method #2: Using memory_profiler

A more “memory-focused” approach is memory_profiler. It’s an app that can be installed via pip and used as a decorator.

This app will track down memory changes and create a report with measures for each line.

Installation:

pip install -U memory_profiler

Then, the usage is really simple (like java annotation):

@profile
def my_func():
   a = [1] * (10 ** 6)
   b = [2] * (2 * 10 ** 7)
   del b
   return a

Source: https://pypi.org/project/memory-profiler/

The profiler output will be something like this:

 Line #    Mem usage    Increment  Occurrences   Line Contents
============================================================
     3   38.816 MiB   38.816 MiB           1   @profile
     4                                         def my_func():
     5   46.492 MiB    7.676 MiB           1       a = [1] * (10 ** 6)
     6  199.117 MiB  152.625 MiB           1       b = [2] * (2 * 10 ** 7)
     7   46.629 MiB -152.488 MiB           1       del b
     8   46.629 MiB    0.000 MiB           1       return a

Source: https://pypi.org/project/memory-profiler/

A more realistic example:

from memory_profiler import profile

def get_c():
  return (10 ** 6)

def get_d():
  return (2 * 10 ** 7)

def calculate_a():
  return [1] * get_c()

def calculate_b():
  return [2] * get_d()

@profile
def test():
  a = calculate_a()
  b = calculate_b()
  output = a + b
  return output

The output:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    38     20.2 MiB     20.2 MiB           1   @profile
    39                                         def test():
    40     27.7 MiB      7.5 MiB           1      a = calculate_a()
    41    180.4 MiB    152.6 MiB           1      b = calculate_b()
    42    340.7 MiB    160.4 MiB           1      output = a + b
    43    340.7 MiB      0.0 MiB           1      return output

As you can see, there is no info about methods get_c() and get_d(). That’s an issue when you have a big app with many methods calling each other; you have to add the decorator to each method you think might be causing the problem (just like method #1 for time measurement).

Keep in mind that this method, depending on your app size, may make your code significantly slower. In our case, when we needed to measure the parsing of a big file that was causing the memory leak, memory_profiler didn’t finish at all.

Pros:

  • Easy to implement
  • Understandable

Cons:

  • Adds useless code that need to be deleted afterwards
  • If your app is big, you add too many measurements
  • Works better if you have a hunch about where the issue is located
  • Depending on your app size, it can be very slow or unable to measure
  • If it crashes, you learn nothing about the issue

Method #3: Calculating objects sizes

If you are storing data into objects, a good approach is to check the objects’ sizes. That way, you can narrow down the suspicious code snippets that might be causing the problem.

One way of doing this is by using sys.getsizeof(obj), which returns the size of an object in bytes.

How to implement:

import sys

def test():
  a = calculate_a()
  print(f"Size of object a:tt{sys.getsizeof(a)}")
  b = calculate_b()
  print(f"Size of object b:tt{sys.getsizeof(b)}")
  output = a + b
  print(f"Size of object output:t{sys.getsizeof(output)}")
  return output

The output:

Size of object a: 8000056
Size of object b: 160000056
Size of object output: 168000056

While working on that approach, we bumped into an article about a similar case of troubleshooting. Check it out: Unexpected Size of Python Objects in Memory.

Based on that, since our objects were really complex with many children and a lot of collections and attributes, we decided to take a look at the actual size calculation. This could help us narrow down the culprit. 

import sys
import gc

def actualsize(input_obj):
   memory_size = 0
   ids = set()
   objects = [input_obj]
   while objects:
       new = []
       for obj in objects:
           if id(obj) not in ids:
               ids.add(id(obj))
               memory_size += sys.getsizeof(obj)
               new.append(obj)
       objects = gc.get_referents(*new)
   return memory_size

Source: https://towardsdatascience.com/the-strange-size-of-python-objects-in-memory-ce87bdfbb97f

Now we have to alter the code to get the actual size like this:

def test():
  a = calculate_a()
  print(f"Size of object a:tt{actualsize(a)}")
  b = calculate_b()
  print(f"Size of object b:tt{actualsize(b)}")
  output = a + b
  print(f"Size of object output:t{actualsize(output)}")
  return output

This is producing a slightly different output:

Size of object a: 8000084
Size of object b: 160000084
Size of object output: 168000112

As you can see, there is a difference in the output object (168000112 instead of 168000056, 56 bytes more). As I mentioned earlier, our object structure is a bit complicated, with multiple references to the same instances. As a result, the actual size calculation was way off and not that helpful. However, this method really does have potential.

Pros:

  • Easy to implement
  • Easy understand

Cons:

  • Adds useless code that need to be deleted afterwards
  • If your app is big, you add too many measurements
  • Works better if you have a hunch about where the issue is located

Conclusion

We spoke about the methods, but how did we solve the problem? Which one of these roads led us to the dreaded memory hog?

Let’s talk about what the actual problem was. There was a part called copy(obj) creating clones of data. When we’ve had low volume, it wasn’t an issue. The objects that were going to be reused were just few. But when we scaled up, we had an enormous amount of cloned objects and none of them used pointers, even though we had so many identical values.

Did any of these methods solve the problem? Not really. In our case, they didn’t provide enough data or indicators to locate the issue. The value of everything above was that by studying more and more we understood better how python manages memory, the objects, the pointers, etc. This gave us enough knowledge and insights to locate the issue. After all, we were the ones implementing it. There was a hunch of where the issue might be located, but we needed proof. We didn’t get any during the troubleshooting, but after fixing the issue, we used all methods for further performance improvements. They were really helpful in the optimization process.

The biggest mistake we made was that in the early stages of implementation, we didn’t run a benchmark with huge files to check if and how we can scale up. We did it after a couple of months of implementation. Obviously, it was a bit late to consider the design a failure and start over. We had already invested too much time to just scrap it. Thankfully, it wasn’t a design flaw, but a simple bug causing a memory leak. Rookie mistake.

Lesson learned: Even if you are really confident in your design and implementation, perform stress tests and check the scalability of the app as early as possible. It’s the best way to avoid redesign and refactoring.

Dimitris Nikolouzos

Engineering Manager | Music Standards @ ΟRFIUM

https://www.linkedin.com/in/dnikolouzos/

https://github.com/jnikolouzos

Contributors

Panagiotis Badredin, Backend Software Engineer @ ORFIUM

https://www.linkedin.com/in/panagiotis-badredin-01b0754b/

Antonis Markoulis, Staff Engineer @ ORFIUM

https://www.linkedin.com/in/anmarkoulis/

Alexis Monte-Santo, Software Engineer @ ORFIUM

https://www.linkedin.com/in/alexis-monte-santo-0277b8116/

]]>
ORFIUM’s Design System — The story from first step to right now https://www.orfium.com/engineering/orfiums-design-system-the-story-from-first-step-to-right-now/ Tue, 15 Mar 2022 14:11:17 +0000 http://52.91.248.125/orfiums-design-system-the-story-from-first-step-to-right-now/

Our Design System with React Components, Figma, and Storybook.

Reasoning

When we started building our Design System at ORFIUM, I was hard-pressed to gather information from other companies that had gone through the process or colleagues with this kind of experience. I hope that you will find answers in our story to some of the questions you’ll face when you go about creating a design system of your own.

Start of everything and the big Q

Big Bang — The start of everything

Over the past couple of years, we’ve seen more and more companies adopt the idea of a Design System into their core business. Creating and using a  design system can help to solve a number of problems for design and frontend teams collaborating. As the design system handbook defines, it helps  teams to:

  • iterate quickly
  • manage debt 
  • prototype faster
  • design consistency
  • improve usability
  • focus on accessibility

But what happens when you want to sync those two departments together and have that system stand as middleware? This is the point in our research where we had a hard time finding information. Turns out, it’s pretty challenging to find a one-solution-fits-all kind of answer.

At ORFIUM, the need for a DS came up as our products grew and we had to work on different projects that focused on different user flows. Furthermore, the nice problem of rapidly growing two departments (front-end and design) led us to the need for centralized information about our user journey and a more rapid and solid solution for building components. This is where the Design System comes in. 

To do that we decided to implement two different tools

  • The first one would be a collection of design components, processes, style guides, and rules placed by the design team. These would be the one source of truth for any product design. So if we had one search page on a product every component used in that page would come for the design system’s components.
  • The second tool is for the frontend team, which has to be in total sync with the design team. The components, dev standards, and rules that will be applied here will come mainly for the first tool and ensure that everything is in place. All components served by the design team to products will be available from the frontend team. 

Our Design System 

How we did it — Front-End 

Initial analysis

The first step was to identify what we were about to build. One of our primary considerations was to not waste resources, to ensure that no extra work would be put in for unnecessary outcomes. So the first question on the team’s mind was  “ do we really need to make everything from scratch”?

There are many well-known, well-structured, and open-source design systems out there, such as Material Design or Ant Design. So, why don’t we use that kind of a tool? We spent a decent amount of time considering our v1 designs of all the products. This exercise led us to the conclusion that we couldn’t use any other system. Here’s why:

  • As our products grew we weren’t sure if we wanted certain rules for each system. Having a system and constantly trying to avoid fundamentals wouldn’t be a great fit.
  • We didn’t want to kill the design team’s creativity.
  • We wanted to keep our system lean. And carrying a package from a library that contained mostly elements we’d never use didn’t make any sense for Orfium.

Furthermore, it was time to establish how it was going to be used within the team. As we don’t currently use any sort of monorepos, we wanted to make sure that this library would have its own roadmap. We chose the use of npm and made it a public package. This way the team would have to use it as a public package contribution and we have to stick with many good processes. This move allowed us to pull Request labeling and procedures, automate our changelog and release system, and, importantly, to show our work to the world and be proud of what we deliver. 

First Baby Steps

Our implementation was broken up into 3 steps. 

  1. We started by defining all style guides and what we’ll need for the future with our design team. Margins, paddings, typography, colors, and grid definitions, all of these are crucial inner parts of the system that are going to be used in many places and have to be constant.
  2. We focused on the atomic design system for building all of our components, based on the style guides defined earlier. These parts, which we call components, were buttons, text fields, icons, chips, avatars, list items, etc. As we finished with the atom parts we moved to the molecules, which is what we call combinations of the previous atoms. Examples of those components are icon buttons, select, lists, etc.
    Pro Tip: I suggest that you do that early and define it with the team.
  3. We moved into mostly visual testing. We started by using snapshots. With that, we knew every time we changed something, that visual part had changed. It was crucial, as the library will grow and everything is linked under the hood, that a small change doesn’t affect components negatively.
    Pro Tip: A visual comparison of any external tool doesn’t hurt.

Team Arrangement

We defined several ways of communication between the Product, Design, and Frontend teams to provide a solution that will be built once and will help all products in the way.

It is important to note here that we see the design system team as a non-product team and our two teams (design and frontend) maintain it.

Use of a Federated team

Each product has its team. This team consists of people from all departments: Frontend, Backend, Design, Product, and QA. The need usually comes from the Product team and grows there. 

The flow of requests, from request to implementation and release.

So if there is a new feature, let’s just say a select input that does only filtering, it is the responsibility of the frontend engineers and designers to raise that need of a new component to the DS moderators to discuss.

You can learn more about team structures and dynamics in a great article here.

Federated team flow based on Nathan Curtis article.


Furthermore, because there is no specific team for the design system, the request will be created as a task and can be handled by either the first developer available or the first product team member that needs to serve this in their roadmap.

This way we built it once and use it when there is a need. We always have a “Design System First” rule.

Roles

We have two roles defined to smooth out our communication.

Moderators make document decisions, provide guidelines, do the technical review, and communicate updates. They also create space for doers to get the job done.

As our DS consists of two departments, Frontend and Design, there are two moderators: a technical mod and a designer.

Moderators communication between teams

Doers are the system contributors  and builders. They execute on proposals or requests, and set priorities based on the design’s needs.

Planning

Yeah, it’s not what you think it is. Or maybe it is. The DS is not a product, so we don’t have scheduled events, velocity, or retros. But we an agreed meeting every two weeks. This is  mainly driven by the moderators to discuss new requests. They are the main members who need to attend, but both teams can join optionally as their input is important. This meeting is also where we vote on story points. Remember when we said that Product Managers who serve requests first are going to have them built? Well, story points help here for the team to know what it can create for the Design System. 

Problems we encounter

Of course, all the above didn’t work out of the box from day one! The first version was in beta mode. All the communication between us was unshaped. Let’s take a look at the issues we faced while setting up our Design System and its processes. 

Versioning

As we have many products and different roadmaps, we need to keep different versions for breaking changes. Having different versions allowed us to run at different speeds. So if we had 3 products that had 3 different roadmaps, we didn’t need to force anyone to always be up to date with the latest design system changes.

Representation of versioning

At first, we were manually fixing versions and keeping track of everyone. This took a lot of time and energy and, as we try to keep our DevOps mentality, we had to automate it. If we have a version for our library, the design team has to do the same. 

Now the system has automated releases (semantic) that we track. We introduce branches for the next release (major — breaking change release) that we have in the roadmap and we also have different versions in Figma, which match what we have and what is in the next version. Design and Frontend teams must be fully aligned with these versions, otherwise QA will fail.

Wrong tools for the job

We also needed to find the perfect tool for the job. We tried a couple until we found Figma ❀

We started out with Zeplin, where we had all of our design screens. But it didn’t help our design team to keep track of their changes as their team grew. So we moved to Abstract. That tool solved the problem of version tracking, kinda. But that was right when our design system took off, and the point where we needed to split all of our product designs into smaller components. So the key that changed everything was this word, “components”. That’s when Figma comes in. With Figma, we now had the power for different versions and component-based designs. Our teams need to understand each other. Figma gave us this in a simple way.

Pro Tip: we went through 3 tools to end up at Figma because we only discovered the growing needs as we used the Design System. Learn from our experience and use Figma for Design/Frontend collaboration. 

Lack of communication 

Oh boy, two different worlds under the same roof, you can imagine. We had (and we still have) miscommunication. It’s one of the main ingredients of teams with different worldviews working together, we know it. The first and main problem was the main responsibility at the two ends. As the teams were working on the Design System, we were all questions like:

  • what should we work on next?
  • why is that there? 
  • who will say this is approved to continue to develop?

We discovered that we were tripping over our shoelaces! Thus, we assigned what we call “the Gatekeeper” a.k.a the Moderator on each team. The Moderator is the go-to person for all of these questions. That person is not responsible to run all the tickets, is a member of the team but has to align with both teams to solve any problems and communicate with the other moderators for issues and features.

These two people now held the single source of truth, but we also have to make our decision-making more transparent. The solution was pretty simple: we started a Slack group where we shared our decision-making while we wrote it down on documentation.

As these things started to work well and we gained a good speed, we had to plan ahead. Our tasks, according to our federated system, must be visible to everyone to handle. We came to a place also when we wanted to have a rough calculation of how this affects all the other products. We know that we do well and all, but wouldn’t  it be awesome if we could say that we spent 10 hours building one thing that we used in 5 places? That’s 40 hours saved! 

For this, we agreed that we had to create a “planning” event where the moderators will attend and all other members optionally can join. At this event, we are voting for each task and creating rough planning for the upcoming weeks. On the completion of each task, we also count the hours that it took to complete. This way we know what we planned and how long it took. Sounds familiar, agile people?

Lastly, as our process of doing things was

planning -> coding -> code review -> QA -> completion

We decided that we need to change things a bit to avoid doing code reviews on the wrong things. So for this case and only we switched to

planning -> coding -> Design QA -> code review -> QA -> completion

This way we give developers the sanity to review a finished solution.

So, the solutions that are in place:

  • Moderators for each team
  • Specific chat to let both teams in on the decision-making 
  • Planning and time tracking
  • Design QA before code QA

What’s next

Enough with the past. What does the Orfium Design System, Ictinus, have in its future?

Every decision we make is based on a need. Also, any change in previous decisions we make is for the same reason, a new need. Right now there are 2 of them that we need to solve.

Documentations link

Lately, we see that our two documentations (technical + design) need to be compared or be visually linked. This will help us develop, as we could visually compare in real-time with the design itself. Also, it can help any QA or viewer compare the actual outcome which adds extra value. 

The planned solution for this is to link the storybook with Figma in two-way communication. 

Enhanced QA

We have much room to expand here as we only scratched the surface. The plan is to add more QA to UX and UI. Also, as per our DevOps nature, we can add a solution to automate in any release to check if something changed as per documentation from any side and show exactly what changed.

Conclusion

Building a design system is challenging, literally. It will challenge the way the teams work and communicate, collaborate with each other and how the org as a whole thinks about more abstract solutions. It’s a long road that requires constant fixes in both tech and processes. But it will pay off all the hard work when you see it in action. When one small UI or UX change suddenly affects 5 products and their users. 

Of course, your design system doesn’t have to match our story, because every company’s DS may be bigger. Even a standalone product.

If you wanna have a glance at our project check it at npm or Github.

*Credit for all of this hard work must go to both of the teams, Frontend and Design. Plus, thanks for all the assets for this article!

References

Team Models for Scaling a Design System

Evolving Past Overlords to Centralize or Federate Design Decision-Making Across Platformsmedium.com

Defining Design System Contributions

Time to Separate the Small-and-Quick from Larger Thingsmedium.com

Panagiotis Vourtsis

Head of Front End Engineering @ ORFIUM

https://www.linkedin.com/in/panvourtsis

https://github.com/panvourtsis

]]>
API Test Automation in ORFIUM https://www.orfium.com/engineering/api-test-automation-in-orfium/ Fri, 21 Jan 2022 08:22:18 +0000 http://52.91.248.125/api-test-automation-in-orfium/

Strap in, we’re about to get technical. But that’s the kind of people we are, and we’re guessing you are too. Today we’re talking about API Test Automation, a topic which causes a frankly unexpected degree of passionate discussions here at ORFIUM.

When you start talking about Automated Tests you should always have in mind the Test Pyramid. Before implementing any test, you should first consider the Testing Layer that this test refers to. Keep in mind this amazing quote from The Practical Test Pyramid – Martin Fowler:

If a higher-level test spots an error and there’s no lower-level test failing, you need to write a lower-level test

Push your tests as far down the test pyramid as you can

In our QA team at ORFIUM, we work on implementing automated tests for the topmost layers of the Pyramid: API Tests → End to End Tests → UI tests and also perform Manual & Exploratory Testing. We mainly aim to have more API tests and fewer UI tests.

In this article, we will focus on API Test Automation, the framework we have chosen, our projects’ structure, and the test reporting. Plus a few tips & hints from our findings so far with this framework.

About Tavern API Testing

Tavern is a pytest plugin, command-line tool, and Python library for automated testing of APIs, with a simple, concise, and flexible YAML-based syntax. It’s very simple to get started, and highly customizable for complex tests. Tavern supports testing RESTful APIs, as well as MQTT-based APIs.

You can learn more about it on the Tavern official documentation and deep dive into Tavern.

The best way to use Tavern is with pytest. It comes with a pytest plugin, so all you have to do is install pytest and tavern, write your tests in test_tavern.yaml files, and run pytest. This means you get access to all of the pytest testing framework advantages, such as being able to:

  • execute multiple tests simultaneously
  • automatically discover your test files
  • provide various cli options to run your tests with different configurations
  • generate handy and pretty test reports
  • organize your tests using markers to group them based on the system under test functionality
  • use hooks and fixtures to trigger setup/teardown methods or test data ingestion
  • use plenty of extremely helpful pytest plugins, such as pytest-xdist, pytest-allure, and more

Why Tavern?

Our engineering teams mainly use Python. It stands to reason that we also should build our Test Automation projects using Python. For the UI tests, we had been already familiar with Python Selenium and BDD frameworks, such as behave & pytest-bdd. For the API tests, we researched Python API testing frameworks and we were impressed by Tavern because:

  • It is lightweight and compatible with pytest.
  • The YAML format of test files offers a higher level of abstraction in the API tests and makes it easier to read and maintain.
  • We can reuse testing stages within a test file so we can create an end-to-end test in a single yaml file.
  • We can manipulate requests and responses using Python requests to do the required validations.
  • It is integrated with allure-pytest, providing very helpful test reports with thorough details of requests and responses.
  • It has a growing and active community.

CI pipelines using Jenkins & Docker

Jenkins project structure

In collaboration with our DevOps team, we have set up a virtual server (Amazon EC2 instance – Linux OS) with a Jenkins server version installed. The Jenkins server resides in a custom host, accessed only by our engineering team. On this server, we have created multiple Jenkins projects, one for each QA automation project. Every Jenkins project contains two CI pipelines – one for the API tests and one for the UI tests. The pipelines use the Declarative Pipeline syntax, and the build script path is passed directly from the source code management (Pipeline script from SCM → Github) reading the respective Jenkinsfile of our QA automation projects.

This means that each QA automation project includes a Jenkinsfile for each type of test – one for the UI tests CI & one for the API tests CI.

A typical Jenkinsfile for our API tests includes the following stages:

  • Stage 1: Check out the Github repository for the respective QA automation project
  • Stage 2: Check out the application’s (backend) Github repository.
  • Stage 3: Run backend application by building the available docker-compose.yml along with a compose-network.yml that exists in the QA automation project. The latter compose file works as a network bridge between the two Docker containers. When these two compose files are built, a host: port is deployed within the Jenkins server and the application’s container is available in the respective port. (i)
  • Stage 4: Build the automation project docker-compose.yml. When this stage starts, a Docker container is being built with the following command arguments pytest -n 8 –env ${APP_ENV_NAME} -m ${TAG}(ii) The ${APP_ENV_NAME} is passed as Jenkins parameter which has a default value the local env, hence the tests run against the host:port that was previously deployed. (iii)
  • Stage 5: Copy the generated test reports from the QA Docker container to the Jenkins server using the docker cp command and publish the test results to the respective Google chat channel.
  • Stage 6: Teardown both containers and clean up the Jenkins workspace.

Trigger Jenkins pipelines

Our Jenkins jobs can be triggered either manually, via the Jenkins GUI, or automatically, whenever a pull request is opened to our backend application Github repository. This is feasible via the Jenkins feature that allows triggering jobs remotely via a curl command. So we have added a script in our application’s deployment process which runs this curl command. As a result, whenever a pull request or a push event happens, our Jenkins CI job is triggered and runs our tests against the specific pull request environment. When this job is finished, our test reports are published to a test reporting channel. Finally, we have added a periodic test run that runs daily once in our staging environment.


i) A bridge between the two containers’ networks is required so that the automation project container can access the application’s container port. This is required because we also build the QA automation project using a docker-compose file so both the application and QA project are built locally and need access to the same network.

ii) -n 8 uses pytest-xdist plugin enables the test run parallelization: if you have multiple CPUs or hosts you can use those for a parallel test run, significantly reducing the test execution time.

iii) wait-for-it was used as an entry point in the docker-compose.yml, since we have to wait a few seconds until the application’s container is up and running. For that purpose, a shell script was implemented doing the following:

#!/bin/bash
set -o errexit #Exit when the command returns a non-zero exit status (failed).
if [ $APP_ENV_NAME == 'local' ];then
wait-for-it --service $HOST --timeout 20
fi
exec "$@" # make the entrypoint a pass through that then runs the docker command

https://github.com/Orfium/qa-tutorial/blob/main/wait_for_local_server.sh


Why Docker?

Having Docker containers for both application and automation projects we ensure that:

  • The tests are isolated and run smoothly anywhere, independent of OS, increasing mobility and portability
  • We do not have to maintain a test environment, since containers are spawned and destroyed during the CI test run on Jenkins
  • There is no need to clean up any test data in the test environment, since it is destroyed when the test run is completed
  • We keep improving our skills in the Docker containers and CI world

Building API Test Automation projects with Tavern

Our test automation projects at Orfium follow a common architecture and structure. We started our journey in API test automation with a proof-of-concept (PoC) project for one of our products, trying Tavern for the first time. We continued working, sprint by sprint, adding new API tests and troubleshooting any issues we encountered. We managed to scale this PoC and reach an adequate number of API tests. Then, since the PoC was successful, we decided to apply this architecture in new test automation projects for other products too. Here’s the process we decided to use after we adopted our API test automation framework.

Structure

A new python project is created along with the respective python virtual environment. In the project root, we add the following:

  • a requirements.txt file which includes the pip packages that we need to install in the virtual environment
  • a Dockerfile that builds an image for our project
  • a docker-compose.yml file that builds, creates, and starts a container for our tests, passing the pytest command with the desired arguments, and this way we run our tests “containerized”
  • a compose-network.yml that is used as a bridge network between the test Docker container and the application’s container that we will see next.
  • a Jenkinsfile that includes the declarative Jenkins script used for running our tests in the Jenkins server
  • a data folder that contains any test data we may want to ingest via our API tests, e.g. csv, json, xlsx, png, jpg files
  • a new directory named api_tests. Within this directory, we add the following sub-directories:
    • tests: contains all the tavern.yaml files, e.g. test_login_tavern.yaml
    • verifications: a python package that contains all the Python verification methods
    • hooks: contains Tavern stages in yaml file format that are reused in multiple test_whatever.tavern.yaml files
    • conftest.py: includes all the pytest fixtures and hooks we use in the api_tests directory scope
    • pytest.ini: all the flags and markers we want to run along with the pytest command
    • config: this folder contains the following configuration files:
      • config.py file setting the environment handling classes, endpoints, and other useful stuff that are used across our API tests
      • routes.yaml file which includes all the endpoints-paths of the application that we are going to test as key-value pairs.
      • log_spec.yaml defining the Tavern logging which is very useful for debugging

Prerequisites

Tavern only supports Python 3.4 and up. You will also need to have virtualenv and virtualenvwrapper installed on your machine.

Install tavern in your virtual environment, assuming that virtualenvwrapper has been already installed and configured on your mac:

# on macOS / Linux with virtualenvwrapper
mkvirtualenv qa-<project-name>
cd ~/path-to-virtual-env/
# activate virtual env
workon qa-<project-name>
#install Tavern pip package
pip install tavern
 
# on Windows - with virtualenv
pip install virtualenv
virtualenv <virtual-env-name>
cd ~/path-to-virtual-env/
# activate virtual env
<virtual-env-name>\Scripts\activate
#install Tavern pip package
pip install tavern

Set up PYTHONPATH

To make sure that Tavern can find external functions, you need to make sure that it is in the Python path. Check out the Tavern docs section: Calling external functions

So, the paths api_tests/verifications & api_tests/tests (which contain the Python verification methods and the Tavern tests) have to be added to the PYTHONPATH. This PYTHONPATH setup can be handled either manually or automated as described below:

→ manually: In your .rc file (zshrc, bashrc, activate, etc.) add the following line:

# on macOS/Linux
export PYTHONPATH="$PYTHONPATH:/qa-<project-name>/tests/api_tests/tests"
# on Windows
set PYTHONPATH=%PYTHONPATH%;C:<path-to-qa-project>testsapi_teststests # or via the Advanced System Settings -> Environment Variables

→ via conftest file: Anything included in a conftest.py file is accessible across multiple files, since pytest will load the module and make it available for your tests. This is how you ensure that even if you did not set up the PYTHONPATH manually, this will be configured within your test as expected. You have to add the following in your conftest.py file: 

from pathlib import Path, WindowsPath

path = Path(__file__)
if sys.platform == "win32":
    path = WindowsPath(__file__)
INCLUDE_TESTS_IN_PYTHONPATH = path.parent / "tests"
INCLUDE_VERIFICATIONS_IN_PYTHONPATH = path.parent / "verifications"
try:
    sys.path.index(INCLUDE_TESTS_IN_PYTHONPATH.as_posix())
    sys.path.index(INCLUDE_VERIFICATIONS_IN_PYTHONPATH.as_posix())
except ValueError:
    sys.path.append(INCLUDE_TESTS_IN_PYTHONPATH.as_posix())
    sys.path.append(INCLUDE_VERIFICATIONS_IN_PYTHONPATH.as_posix())

Create your first test

Within the api_tests/tests folder, create your first test file naming it as test_<test_name> .tavern.yaml file.

Each test within this yaml file can have the following keys:

  • test_name: Add the test name with a descriptive title
  • includes: Specify other yaml files that this test will require in order to run
  • marks: Add a tag name to the test that can be used upon test execution as a pytest marker
  • stages: The key stages is a list of the stages that make up the test
    • name A sub-key of stages that describe the stage scope.
  • request: A sub-key of name key that will include the related to the request keys
    • url – a string, including the protocol, of the address of the server that will be queried
    • json – a mapping of (possibly nested) key: value pairs/lists that will be converted to JSON and sent as the request body.
    • params – a mapping of key: value pairs that will go into the query parameters.
    • data – Either a mapping of key-value pairs that will go into the body as application/x-www-url-form-encoded data or a string that will be sent by itself (with no content-type).
    • headers – a mapping of key: value pairs that will go into the headers. Defaults to adding a content-type: application/json header.
    • method – one of GET, POST, PUT, DELETE, PATCH, OPTIONS, or HEAD. Defaults to GET if not defined
  • response: A subkey of name that will include the response keys
    • status_code – an integer corresponding to the status code that we expect, or a list of status codes, if you are expecting one of a few status codes. Defaults to 200 if not defined.
    • json – Assuming the response is json, check the body against the values given. Expects a mapping (possibly nested) key: value pairs/lists. This can also use an external check function, described further down.
    • headers – a mapping of key: value pairs that will be checked against the headers.
    • redirect_query_params – Checks the query parameters of a redirect url passed in the location header (if one is returned). Expects a mapping of key: value pairs. This can be useful for testing the implementation of an OpenID connect provider, where information about the request may be returned in redirect query parameters.
    • verify_response_with: Use an external function to verify the response

Set up the test environment

In our SDLC we run the tests against multiple environments: local environment → pull request environments → staging environment. For that purpose, we have created a Python class named AppEnvSettings that handles the different environments dynamically and passes the application’s environment name as a CLI option in the pytest command. To be more specific, we have added the following pytest fixture within the api_tests/conftest.py file along with the respective pytest_addoption method.

This fixture is used whenever a pytest session starts and invokes the aforementioned class, passing the environment as an argument via a CLI option. So, if we want to trigger our tests in a specific environment we run pytest –env staging.

Since Tavern works perfectly with pytest, the app_env_settings fixture is automatically used from test_tavern.yaml files and you can use the constructed api_url within your tavern tests this way:

stages:
  - name: User is able to login with valid registration details
    request:
      url: "{app_env_settings.api_url}{login_route}"

AppEnvSettings class code -> config.py example

Run the tests

cd api_tests/tests
pytest # or pytest <test_name>.tavern.yaml to run a specific test file

Generate Test reports

When a test run is finished, we have chosen the allure-pytest plugin to generate our test reports. The generated reports are quite handy and help us identify failures easily. It is alsowell integrated with Tavern, so we see the test reports in a detailed view per Tavern stage. This way, we are able to monitor any failures while also being able to check the requests and responses for each test. In order to use that reporter, you have to install the allure-pytest package in your virtual environment and configure your pytest.ini file:

addopts=   
    # path to store the allure reports
    --alluredir="../reports/allure-api-results"

pytest.ini example 

Bottom line

Choosing Tavern as an API Testing Framework has been a great pick so far for us! We have not encountered any blockers at all as we scale our automation projects. We also utilize all the benefits of pytest upon test execution, such as pytest fixtures, parallelization with pytest-xdist, and many other features as they have been previously described. Tavern also allows us a higher level of abstraction, since the tavern.yaml files increase readability in our automation projects and provide very detailed test reports. We are able to easily identify  which stage of the end-to-end test failed and what caused the failure. Which is the whole point.

Moreover, having different stages within the .tavern.yaml file allows us to create end-to-end tests that describe a business process and are not just API tests that validate the responses of each request. We can simulate the customer’s behavior using API calls and validate that our application will work as expected in business critical scenarios.

Finally, triggering our CI jobs automatically upon new pull & merge requests provides very helpful feedback during the Software Development Lifecycle (SDLC). We ensure that our regression test suites run successfully in staging and QA environments, and this way we increase our confidence that, when a new feature is released to production, it will not affect the existing functionality of our applications.

Resources

Orfium QA tutorial project

Tavern API test example

You can find more examples and details here → Tavern test examples

Tips & Hints

]]>
⏱ Speeding up your Python & Django test suite https://www.orfium.com/engineering/speeding-up-your-python-django-test-suite/ Thu, 04 Nov 2021 11:51:59 +0000 http://52.91.248.125/speeding-up-your-python-django-test-suite/
Artwork by @vitzi_art

Less time waiting, more time hacking!

Yes yes, we all know. Writing tests and thoroughly running them on our code is important. None of us enjoy doing it but we almost all see the benefits of this process. But what isn’t as great about testing is the waiting, the context shifting and the loss of focus. At least for me, this distraction is a real drag, especially when I have to run a full test suite.

This is why I find it crucial to have a fine-tuned test suite that runs as fast as possible, and why I always put some effort into speeding up my test runs, both locally and in the CI. While working on different  Python / Django projects I’ve discovered some tips & tricks that can make your life easier. Plenty of them are included in various documentations, like the almighty Django docs, but I think there’s some value in collecting them all in a single place.

As a bonus, I’ll be sharing some examples / tips for enhancing your test runs when using Github Actions, as well as a case study to showcase the benefit of all these suggestions.

The quick wins

Running a part of the test suite

This first one is kind of obvious, but you don’t have to run the whole test suite every single time. You can run tests in a single package, module, class, or even function by using the path on the test command.

> python manage.py test package.module.class.function
System check identified no issues (0 silenced).
..
----------------------------------------------------------------------
Ran 2 tests in 6.570s

OK

Keeping the database between test runs

By default, Django creates a test database for each test run, which is destroyed at the end. This is a rather slow process, especially if you want to run just a few tests! The --keepdb option will not destroy and recreate the database locally on every run. This gives a huge speedup when running tests locally and is a pretty safe option to use in general.

> python manage.py test <path or nothing> --keepdb
Using existing test database for alias 'default'...          <--- Reused!
System check identified no issues (0 silenced).
..
----------------------------------------------------------------------
Ran 2 tests in 6.570s

OK
Preserving test database for alias 'default'...              <--- Not destroyed!

This is not as error-prone as it sounds, since every test usually takes care of restoring the state of the database, either by rolling back transactions or truncating tables. We’ll talk more about this later on.

If you see errors that may be related with the database not being recreated at the start of the test (like IntegrityError, etc), you can always remove the flag on the next run. This will destroy the database and recreate it.

Running tests in parallel

By default, Django runs tests sequentially. However, whether you’re running tests locally or in your CI (Github Actions, Jenkins CI, etc) more often than not you’ll have multiple cores. To leverage them, you can use the --parallel flag. Django will create additional processes to run your tests and additional databases to run them against.

You will see something like this:

> python3 manage.py test --parallel --keepdb
Using existing test database for alias 'default'...
Using existing clone for alias 'default'...        --
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |    => 12 processes!
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...        -- 

< running tests > 

Preserving test database for alias 'default'...
... x10 ...
Preserving test database for alias 'default'...

In Github runners, it usually spawns 2-3 processes.

When running with --parallel, Django will try to pickle tracebacks from errors to display them in the end. You’ll have to add tblib as a dependency to make this work.

Caching your Python environment (CI)

Usually when running tests in CI/CD environments a step of the process is building the Python environment (creating a virtualenv, installing dependencies etc). A common practice to speed things up here is to cache this environment and keep it between builds since it doesn’t change often. Keep in mind, you’ll need to invalidate this whenever your requirements change.

An example for Github Actions could be adding something like this to your workflow’s yaml file:

- name: Cache pip
  uses: actions/cache@v2
  with:
    # This path is specific to Ubuntu
    path: ${{ env.pythonLocation }}
    # Look to see if there is a cache hit for the corresponding requirements file
    key: ${{ env.pythonLocation }}-${{ hashFiles('requirements.txt','test_requirements.txt') }}

Make sure to include all your requirements files in hashFiles.

If you search online you’ll find various guides, including the official Github guide, advising to cache just the retrieved packages (the pip cache). I prefer the above method which caches the built packages since I saw no speedup with the suggestions there.

The slow but powerful

Prefer TestCase instead of TransactionTestCase

Django offers 2 different base classes for test cases: TestCase and TransactionTestCase. Actually it offers more, but for this case we only care about those two.

But what’s the difference? Quoting the docs:

Django’s TestCase class is a more commonly used subclass of TransactionTestCase that makes use of database transaction facilities to speed up the process of resetting the database to a known state at the beginning of each test. A consequence of this, however, is that some database behaviors cannot be tested within a Django TestCase class. For instance, you cannot test that a block of code is executing within a transaction, as is required when using select_for_update(). In those cases, you should use TransactionTestCase.

TransactionTestCase and TestCase are identical except for the manner in which the database is reset to a known state and the ability for test code to test the effects of commit and rollback:

  • A TransactionTestCase resets the database after the test runs by truncating all tables. A TransactionTestCase may call commit and rollback and observe the effects of these calls on the database.
  • A TestCase, on the other hand, does not truncate tables after a test. Instead, it encloses the test code in a database transaction that is rolled back at the end of the test. This guarantees that the rollback at the end of the test restores the database to its initial state.

In each project, there’s often this one test that breaks with TestCase but works with TransactionTestCase. When engineers see this, they consider it more reliable and switch their base test classes to TransactionTestCase, without considering the performance impact. We’ll see later on that this is not negligible at all.

TL;DR:

You probably only need TestCase for most of your tests. Use TransactionTestCase wisely!

Some additional indicators where TransactionTestCase might be useful are:
– emulating transaction errors
– using on_commit hooks
– firing async tasks (which will run outside the transaction by definition)

Try to use setUpTestData instead of setUp

Whenever you want to set up data for your tests, you usually override & use setUp. This runs before every test and creates the data you need.

However, if you don’t change the data in each test case, you can also use setupTestData (docs). This runs once for every function in the same class and creates the necessary data. This is definitely faster, but if your tests alter the test data you can end up with weird cases. Use with caution.

Finding slow tests with nose

Last but not least, remember that tests are code. So there’s always the chance that some tests are really slow because you didn’t develop them with performance in mind. If this is the case, the best thing you can do is rewrite them. But figuring out a single slow test is not that easy.

Luckily, you can use django-nose (docs) and nose-timer to find the slowest tests in your suite.

To do that:

  • add django-nose, nose-timer to your requirements
  • In your settings.py, change the test runner and use some nose-specific arguments

Example settings.py:

TEST_RUNNER = 'django_nose.NoseTestSuiteRunner'
NOSE_ARGS = [
    '--nocapture',
    '--verbosity=2',
    '--with-timer',
    '--timer-top-n=10',
    '--with-id'
]

The above arguments will make nose output:

  • The name of each test
  • The time each test takes
  • The top 10 slowest tests in the end

Example output:

....
#656 test_function_name (path.to.test.module.TestCase) ... ok (1.3138s)
#657 test_function_name (path.to.test.module.TestCase) ... ok (3.0827s)
#658 test_function_name (path.to.test.module.TestCase) ... ok (5.0743s)
#659 test_function_name (path.to.test.module.TestCase) ... ok (5.3729s)
....
#665 test_function_name (path.to.test.module.TestCase) ... ok (3.1782s)
#666 test_function_name (path.to.test.module.TestCase) ... ok (0.7577s)
#667 test_function_name (path.to.test.module.TestCase) ... ok (0.7488s)

[success] 6.67% path.to.slow.test.TestCase.function: 5.3729s    ----
[success] 6.30% path.to.slow.test.TestCase.function: 5.0743s       |
[success] 5.61% path.to.slow.test.TestCase.function: 4.5148s       |
[success] 5.50% path.to.slow.test.TestCase.function: 4.4254s       |
[success] 5.09% path.to.slow.test.TestCase.function: 4.0960s       | 10 slowest
[success] 4.32% path.to.slow.test.TestCase.function: 3.4779s       |    tests
[success] 3.95% path.to.slow.test.TestCase.function: 3.1782s       |
[success] 3.83% path.to.slow.test.TestCase.function: 3.0827s       |
[success] 3.47% path.to.slow.test.TestCase.function: 2.7970s       |
[success] 3.20% path.to.slow.test.TestCase.function: 2.5786s    ---- 
----------------------------------------------------------------------
Ran 72 tests in 80.877s

OK

Now it’s easier to find out slow tests and debug why they take so much time to run.

Case study

To showcase the value of each of these suggestions, we’ll be running a series of scenarios and measuring how much time we save with each improvement.

The scenarios

  • Locally
    • Run a single test with / without --keepdb, to measure the overhead of recreating the database
    • Run a whole test suite locally with / without --parallel, to see how much faster this is
  • On Github Actions
    • Run a whole test suite with no improvements
    • Add --parallel and re-run
    • Cache the python environment and re-run
    • Change the base test case to TestCase from TransactionTestCase

Locally

Performance of --keepdb

Let’s run a single test without --keepdb:

> time python3 manage.py test package.module.TestCase.test 
Creating test database for alias 'default'...
System check identified no issues (0 silenced).
.
----------------------------------------------------------------------
Ran 1 test in 4.297s

OK
Destroying test database for alias 'default'...

real    0m50.299s
user    1m0.945s
sys     0m1.922s

Now the same test with --keepdb:

> time python3 manage.py test package.module.TestCase.test --keepdb
Using existing test database for alias 'default'...
.
----------------------------------------------------------------------
Ran 1 test in 4.148s

OK
Preserving test database for alias 'default'...

real    0m6.899s
user    0m20.640s
sys     0m1.845s

Difference: 50 sec vs 7 sec or 7 times faster

Performance of –parallel

Without --parallel:

> python3 manage.py test --keepdb
...
----------------------------------------------------------------------
Ran 591 tests in 670.560s

With --parallel (concurrency: 6):

> python3 manage.py test --keepdb --parallel 6
...
----------------------------------------------------------------------
Ran 591 tests in 305.394s

Difference: 670 sec vs 305 sec or > 2x faster

On Github Actions

Without any improvements, the build took ~25 mins to run. The breakdown for each action can be seen below.

Note that running the test suite took 20 mins
Note that running the test suite took 20 mins

When running with --parallel, the whole build took ~17 mins to run (~30% less). Running the tests took 13 mins (vs 20 mins without --parallel, an improvement of ~40%).

By caching the python environment, we can see that the Install dependencies step takes a few seconds to run instead of ~4 mins, reducing the build time to 14 mins

Finally, by changing the base test case from TransactionTestCase to TestCase and fixing the 3 tests that required it, the time dropped again:

Neat, isn’t it?

Key takeaway: We managed to reduce the build time from ~25 mins to less than 10 mins, which is less than half of the original time.

Bonus: Using coverage.py in parallel mode

If you are using coverage.py setting the --parallel flag is not enough for your tests to run in parallel.

First, you will need to set parallel = True and concurrency = multiprocessing to your .coveragerc. For example:

# .coveragerc
[run]
branch = True
omit = */__init__*
       */test*.py
       */migrations/*
       */urls.py
       */admin.py
       */apps.p

# Required for parallel
parallel = true
# Required for parallel
concurrency = multiprocessing

[report]
precision = 1
show_missing = True
ignore_errors = True
exclude_lines =
    pragma: no cover
    raise NotImplementedError
    except ImportError
    def __repr__
    if self.logger.debug
    if __name__ == .__main__.:

Then, add a sitecustomize.py to your project’s root directory (where you’ll be running your tests from).

# sitecustomize.py
import coverage

coverage.process_startup()

Finally, you’ll need to do some extra steps to run with coverage and create a report.

# change the command to something like this
COVERAGE_PROCESS_START=./.coveragerc coverage run --parallel-mode --concurrency=multiprocessing --rcfile=./.coveragerc manage.py test --parallel
# combine individual coverage files 
coverage combine --rcfile=./.coveragerc
# and then create the coverage report
coverage report -m --rcfile=./.coveragerc

Enjoy your way faster test suite!

Sergios Aftsidis

Senior Backend Software Engineer @ ΟRFIUM

https://www.linkedin.com/in/saftsidis/

https://iamsafts.com/

https://github.com/safts

]]>
Handling Redux Side-Effects — the RxJS way https://www.orfium.com/engineering/handling-redux-side-effects-the-rxjs-way/ Tue, 12 Mar 2019 08:02:00 +0000 http://52.91.248.125/handling-redux-side-effects-the-rxjs-way/


Hello stranger ! If you are working with Redux you are probably walked into the same issue we had at Orfium FrontEnd team with redux side effects (maybe that’s why you are reading this as well ). If not, then wait for it, it will happen soon. Let me guide you through our story and findings.

Side-Effects and why we need them

Regular actions are being dispatched and as a result some state is being changed.

Imagine now that you have a list that you need to fetch information from a server to fill it. In that case you would need to dispatch an action — request from the server — on response (success/error) dispatch an other action to update the state (loading, results etc). This right there, is a side-effect.

Right now though you have an action that you can’t actually handle. You can trigger the same action if the user clicks on that button and have multiple requests and most importantly multiple state updates that can ruin your perfect application. What if you could cancel, wait, debounce and generally handle those actions. Spoiler alert, you can!

There are two major options to handle side effects actions with the above advantages (thunks are not included):

  • redux observable — based on RxJS when it uses observables to listen for actions
  • redux saga — it uses newly added javascript generators to handle side effects

RX side effects

Using redux observable you can change the way you operate. I will not go into details on how you can install it as you can easily read this here. I will go a bit deeper to the complicated parts of it like

  • how to form an epic
  • redux forms (how to handle them)
  • testing with state
  • testing actions with ease

Let’s imagine we have a simple scenario. A request to the server that on success or on fail we will dispatch some actions to update our reducer.

Let’s see how such case would be in code.

Let me break it down a bit.

In the above example we define a new epic. Then using ofType we are waiting the action with type ACTION_REQUEST to get fired. When that is triggered we go to switchMap that here we limit the request that we are getting. Switch map run one request per time so anyone else will be ignored, this way you avoid multiple requests. Then with from we are transforming the promise request to an observable. Lastly we are using mergeMap to fetch the response (imagine a .then on a promise) and we return the set of actions and we do the same with the Error.

What if you want to first show a loading to indicate the loading/fetching of the data?

Now you can easily add an action before requesting data.

We are using concat that will run the two actions in it (loadingList and then the request) in the order defined. Getting easier?

Now you have all that ready and you are using redux forms to handle your data. How you will know when you press submit in that beautiful form of yours that is getting submitted status to show that loading animation? Well, the idea remains the same. Let see how something like this can be accomplished.

Here we change a bit the logic. Again we are waiting for the same action on the ofType section. At this example, we will want the formName to be passed to the Epic in order to do the actions shown above. In the switchMap we are expecting two things, a payload and a meta! Oh yes, a meta. Meta hold the logic that payload doesn’t have to know about. You can see here more details on why and when to use it. So again using concat and the redux form action startSubmit we can define that our form started submitting something. Don’t forget to stopSubmit on both success and error. That’s it!

I would suggest that you would always have a request action dispatched with that name like the examples ACTION_REQUEST or FETCH_BLAH_LIST_REQUEST. This way you would always know that this is an epic based action for side effects.

You can now use takeUntil and stop listening to any success event after that event occur. This helps with the classic Netflix problem. When you are navigating to a details page start fetching you are going back and to another details page and the first page start resolving and messing your current page.

This problem is well explained here from Jay Phelps and I ‘d recommend to take a look.

Now we have the same example as before but we only put the takeUntiloperator. Now if this listens for the PAGE_CHANGED action it will not cancel the request but it will ignore any resolving of the current request. Yay !! Now if you want you can implement a cancelable request with axios, fetch or anything.

Testing on RxJS with ease

I found the redux-observable tutorials a bit advanced and confusing on testing. I can see why but in most of the cases you would not need such thing so i will propose a simple solution for testing.

Redux observable gives us from the modules two things ActionsObservable and StateObservable. You can use them to create observables for testing. We are using those because on the library the actions that are passing through the epics are created with those actions so because we will try to compare those two they need to be exactly the same.

Let’s try to test our first example.

Here we define two things an action$ that would be an observable with the action that triggers the epic and the expected variable that would be the array of actions that we expect that the epic will return. Furthermore we mock an axios (fetch, axios whatever you have for http requests) resolve with a mocked response. Lastly and this is where it gets interesting we pass that action$ to the epic we want to test, we transform the results of that epic to an array with .pipe(toArray()) and then we make it a promise with .toPromise() so we can await for it to finish with async await. Now we can easily compare what epic returned to what we are expecting !

Here is what a test on the error of that request should look like:

With this technique, you can easily test any epic and fix the expectations of each.

Hope you liked my examples and that redux observables make a bit more sense with those examples for a day to day use.

If you have more examples on how you use it please share on comments below.


Panagiotis Vourtsis

Head of Front End Engineering @ ORFIUM

https://www.linkedin.com/in/panvourtsis

https://github.com/panvourtsis

]]>
Introduction to Reactive Programming https://www.orfium.com/engineering/introduction-to-reactive-programming/ Sat, 09 Feb 2019 08:35:00 +0000 http://52.91.248.125/introduction-to-reactive-programming/


As programmers, many of us have to deal with handling a response data that comes from an asynchronous callback. At first, that seems to work just fine, but as the codebase gets bigger and more complicated we start wondering if there is a better way to handle that response data and surely if there is a better way to unit test it.

Such a better way exists and it’s called Reactive Programming.

In this post, we will present a brief introduction to reactive programming and examine how it can help us build more robust applications.

What is Reactive Programming

According to wikipedia, reactive programming is a declarative programming paradigm concerned with data streams and the propagation of change.

Well, that may sound a bit complicated and doesn’t really help us understand what Reactive Programming really is.

So in order to keep it sharp and clear, we can describe reactive programming as programming with asynchronous data streams that a consumer observes and reacts to them.

For sure, this is a much easier way to perceive it. However, we should elaborate on a small but crucial detail that differentiates this type of programming; the data streams and their meaning.

In a few words, streams are a sequence of events over time. That’s it, so simple.

For example, a stream could be a user’s action to fill a form on a website. Each keypress is an event that provides us with information on what the user is typing.  A submit button click is an event that notifies us when the user submitted his form and consequently when the stream completed.

Why Reactive Programming

Reactive Programming has gained a lot of popularity in recent years. You probably have read articles or attended lectures on conferences or local meetups about it. You may, though, still wonder why to use it and how it would help you?

The basic concept of Reactive Programming is that everything can be a stream of data. An API response, a list of rows from a local database, an Enum state class, even a button click can be represented as a stream of data able to be observed by a consumer. This stream can be filtered, transformed, combined with another stream or even shared with multiple consumers. Every stream runs on its own non-blocking thread, allowing us to execute multiple code blocks simultaneously, turning part of our application into asynchronous.

How to use Reactive Programming

Reactive Extensions or Rx is an implementation of Reactive Programming principles following Observer pattern, Iterator pattern and Functional Programming. In this way, Rx abstracts away any concerns about threading, synchronization, thread safety or concurrent data structures.

The two key components of Rx are Observables and Subscribers, or Observers. An observable is a class that emits the stream of data. A subscriber is a consumer that subscribes, receives and reacts to the data emitted by an Observable.

We can think observables as event emitters, that emit three events: onNext, onComplete, onError. The common flow of an observable is to emit the data, using the onNext method, and then call the onComplete or the onError method, if an error has occurred. When an observable calls onComplete or onError, it gets terminated. Observables are grouped by hot and cold observables; you can read more about it here.

A subscriber is a class that can observe these three events, by implementing these methods and then calling the subscribe method on the observable.

Rx Operators

By this point, Rx seems quite simple and may not sounds useful enough to start using it. However, this is where the interesting part begins! The Operators. One of the most important features of Rx is Operators. An operator is a function that takes one observable as its first argument and returns another observable. An operator allows us to take an observable, transform it and change its behavior according to our needs. Using an operator we can filter, map or cache the data emitted by an observable, combine two or more observables, chain different observables, specify the scheduler on which an observer will observe an observable, create, delay, repeat or retry an observable and many more. As we can see the true power of Rx comes from its operator. Here you can view the whole list of Rx’s operators

Show me some code

Let’s dive into some code to discover, through a real-world example, the true power of Rx.

In our example, we will borrow the paradigm of a shopping application. When the user enters his shopping cart the flow, described below,  must be executed :

  • Request to API to get user’s cart items
  • Show loading indicator
  • Filter the items that have zero availability
  • Show items to the user
  • Request to API to calculate total price from filtered items
  • Show total price to the user
  • Hide loading indicator

In the code below, we can see how we can execute that flow, with few lines of code, using Rx.

apiManager.getCartItems()
  .doOnSubscribe { showIndicator() }
  .flatMap { items -> Observable.fromIterable((items)) }
  .filter { item -> item.availability > 0 }
  .toList()
  .flatMap { items ->
      showItems(items)
      apiManager.calculateTotal(items)
  }
  .doAfterTerminate { hideIndicator() }
  .subscribe({ price -> showPrice(price) }, { error -> showError(error) })

Looks pretty cool, isn’t it? Well, now imagine creating this with the use of old school callbacks. What a nightmare would be in case any debugging would need!

Overview

Rx is a very powerful tool that helps us write clean, robust and scalable applications plus it tends to be a must for mobile and frontend engineers. Someone may support that familiarizing with this technology may take some time. It is though undeniable that once you reach that point, there is no way back.

How we use it

At ORFIUM we use Rx all over our new mobile applications. Rx is a big part of our new music player, leveraging the abilities of hot observables. Using subjects we can emit the latest state of media player, allowing our views and classes to observe it and react according to its state, without the need to have a reference to the media player itself. Excited as of how helpful Rx proved to be for us and the products we develop, we decided to share our experience with the community and open source a part of our music player in Android and iOS. Stay tuned for our next posts where we ‘ll further elaborate on it!


The article was written by Dimitris Konomis.

]]>