カテゴリーなし – ORFIUM https://www.orfium.com/ja/ Liberating the true value of content Fri, 04 Aug 2023 06:48:26 +0000 ja hourly 1 https://wordpress.org/?v=6.4.3 https://www.orfium.com/wp-content/uploads/2023/02/blue-logo-2.svg カテゴリーなし – ORFIUM https://www.orfium.com/ja/ 32 32 Orfium partners with PRS to expand Music Licensing in Africa https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/orfium-partners-with-prs-to-expand-music-licensing-in-africa/ Thu, 20 Apr 2023 13:59:31 +0000 https://orfium.com/?p=4677

This week, we announced our partnership with PRS for Music, the leading music rights organization. This collaboration is set to revolutionize the music industry in Africa as PRS expands its licensing coverage to music users based in Africa supported by Orfium’s licensing and technology infrastructure. This will provide access to tens of millions of works, including many of the most successful songs and compositions of today and the last century.  The thriving African music industry is set to experience a boost in innovation, efficiency, and growth through this collaboration, which will ultimately benefit the talented songwriters, composers, and publishers across the continent as well as the music users.

Orfium will be licensing the PRS repertoire and providing the technology infrastructure needed to efficiently serve the African music market. The partnership marks a significant milestone, paving the way towards equitable remuneration for talented songwriters, composers, and publishers across Africa for their creative work.

This partnership is expected to bring about significant benefits including increased speed to market for music creators, improved music discovery, as well as cost efficiencies

An exciting aspect of this partnership is the expansion of PRS for Music’s Major Live Concert Service, the royalty collection service for large concerts, which will now be available for events held across Africa. This is a significant step in expanding the global reach of PRS for Music’s services and supporting music creators in Africa. 

In addition to PRS’s existing agreement with SAMRO, the collecting society based in South Africa, the partnership with Orfium will provide a framework for PRS members to be paid when their works are used in some of the world’s fastest-growing music markets. This is a crucial development that ensures songwriters, composers, and publishers are fairly compensated for their music across the African continent.

“We’re incredibly excited to partner with PRS for Music. Orfium exists to support and improve the global entertainment ecosystem so that creators everywhere can be paid fairly for their work. Over the last three years, we have invested heavily in building a state-of-the-art rights management platform to support our partners in the licensing and remuneration of music rights in the entertainment industry. Orfium looks forward to working with PRS as their trusted partner to support this incredible region and contributing to Africa’s future as a high-growth music market.” Rob Wells, CEO, Orfium

Stay tuned to the latest news and updates from Orfium. Subscribe to our newsletter below.

]]>
ORFIUM’s BI Toolkit and Skillset https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/orfiums-bi-toolkit-and-skillset/ Wed, 19 Apr 2023 15:58:30 +0000 https://orfium.com/?p=4681

We’re back

As described in a previous blog post from ORFIUM’s Business Intelligence team, the set of tools and software used varied as time progressed. A shorter list of software was used when the team was two people strong, a much longer and more sophisticated one is being currently used.

2018-2020

As described previously, the two people in the BI team were handling multiple types of requests from various customers from within the company and from ORFIUM’s customers. For the most part, they were dealing with Data Visualization, as well as a small part of data engineering and some data analysis.

Since the team was just the two of them, tasks were more or less divided in engineering and analysis vs visualization. It is simple to guess that in order to combine data from Amazon Athena and Google Spreadsheets or ad-hoc csv’s, a lot of scripting was used in Python. Data were retrieved from these various sources and after some (complex more often than not) transformations and calculations the final deliverables were some csv’s to either send to customers, or load on a Google Sheet. In the latter case a simple pivot table was also bundled in the deliverable in order to jump start any further analysis from the, usually internal, customer.

In other cases where the customer was requesting graphs or a whole dashboard, the BI team was just using Amazon Athena’s SQL editor to run the exploratory analysis, and when the proper dataset for analysis was eventually discovered, we were saving the results to a separate SCHEMA_DATASET in Athena itself. The goal behind that approach was that we could make use of Amazon’s internal integration of their tools, so we provided our solutions into Amazon Quicksight. This seemed at that moment the decision that would provide deliverables in the fastest way, but not the most beautiful or the most scalable.

Quicksight offers a very good integration with Athena due to both being under the Amazon umbrella. To be completely honest, at that point in time the BI Analyst’s working experience was not optimal. From the consumer side, the visuals were efficient but not too beautiful to look at, and from ORFIUM’s perspective a number of full AWS accounts was needed to share our dashboards externally, which created additional cost.

This process slightly changed when we decided to evaluate Tableau as our go-to-solution for the Data Visualization process. One of the two BI members at that time leaned pretty favorably towards Tableau, so they decided to pitch it. Through an adoption proposal for Tableau, which was eventually approved by ORFIUM’s finance department, Tableau came into our quiver. Tableau soon became our main tool of choice for Data Visualization. It allows better and more educated decisions to be made from management, and is able to showcase the value that our company can offer to our current and potential future clients.

This part of BI’s evolution led to the deprecation of both Quicksight and python usage, as pure SQL queries and DML were developed in order to create tables within Athena, and some custom SQL queries were embedded on the Tableau connection with the Data Warehouse. We focused on uploading ad-hoc csv’s or data from GSheets on Athena, and from there the almighty SQL took over.

2021- 2022

The team eventually grew larger and more structured, and the company’s data vision shifted towards Data Mesh. Inevitably, we needed a new and extended set of software.

A huge initiative to migrate our whole data warehouse from Amazon Athena to Snowflake started, with BI’s main data sources playing the role of the early adopters. The YouTube reports were the first to be migrated, and shortly after the Billing reports were created in Snowflake. That was it, the road was open and well paved for the Business Intelligence team to start using the vast resources of Snowflake and start building the BI-layer.

A small project of code migration so that we use the proper source and create the same tables that Tableau was expecting from us, turned into a large project of restructuring fully the way we worked. In the past, the python code used for data manipulation and the SQL queries for the creation of the datasets to visualize were stored respectively in local Jupyter notebooks and either within View definitions in Athena or Tableau Data source connections. There was no version control; there was a Github repo but it was mainly used as a code storage for ad-hoc requests, with limited focus on keeping it up to date, or explaining the reasoning of updates. There were no feature branches, and almost all new commits on the main one were adding new ad-hoc files in the root folder and using the default commit message. This situation, despite being a clear pain point of the team’s efficiency, emerged as a huge opportunity to scrap everything, and start working properly.

We set up a working guide for our Analysts: training on usage of git and Github, working with branches, PullRequest templates, commit message guidelines, SQL formatting standards, all deriving from the concept of having an internal Staff Engineer. We started calling the role Staff BI Analyst, and we indeed currently have one person setting the team’s technical direction. We’ll discuss this role further in a future blog post.

At the same time we were exploring options on how to combine tools so that the BI Analysts are able to focus on writing proper and efficient SQL queries, without having to either be fully dependent on Data Engineers for building the infrastructure for data flows, or requiring python knowledge in order to create complex DAGs. dbt and Airflow surfaced from our research and, frankly, the overall hype, so we decided to go with the combination of the two.

Initially the idea was to just use Airflow, where an elegant loop would be written so that the dags folder would be scanned, and using folder structures and naming conventions on sql files, only a SnowflakeOperator would be needed to transform a subfolder in the dags folder to a DAG on the AirflowUI, with each file from the folder would be a SnowflakeOperator task, and the dependencies would be handled by the naming convention of the files. So, practically a folder structure as the one shown to the right would automatically create a Dynamic Dag as shown on the left.

No extra python knowledge needed, no Data Engineers needed, just store the proper files with proper names. A brief experimentation with DAGfactory was also implemented but we soon realized that the airflow should just be used as the orchestrator of the Analytics tasks, and the whole analytics logic should be handled by something else. All this was very soon abandoned when dbt was fully onboarded to our stack.

Anyone who works in the Data and Analytics field must have heard of dbt. And if they haven’t already, they should. This is why there is nothing too innovative to describe about our dbt usage. We started using dbt from early on in its development, having first installed v0.19.1, and with an initial setup period with our Data Engineers, we combined Airflow with dbt-cloud for our production data flows, and core dbt CLI for our local development. Soon after that, and in some of our repos, we started using github actions in order to schedule and automate runs of our Data products.

All of the BI Analysts in our team are now expected to attend all courses from the Learn Analytics Engineering with dbt  program offered at dbt Learn regarding its usage. The dbt Analytics Engineering Certification Exam remains optional. However, we are all fluent with using the documentation and the slack community. Generic tests dynamically created through yml, alerts in our instant messaging app in case of any DAG fails, and snapshots are just some of the features we have developed to help the team. As mentioned also above, our Staff BI Analyst plays a leading role in creating this culture of excellence.

There it was. We endorsed the Analytics engineering mindset, we reversed the ETL and implemented ELT, thus finally decoupling the absolute dependency on Data Engineers. It was time to enjoy the fruits of Data Mesh: Data Discovery and Self Service Analytics.

2023-beyond

Having more or less implemented almost all of ORFIUM’s Data Products on Snowflake with proper documentation we just needed to proceed to the long awaited data democratization. Two key pillars of democratizing data is to make them discoverable and available for analysis by non-BI Analysts too.

Data Discovery

As DataMesh principles dictate, each data product should be discoverable by its potential consumers, so we also needed to find a technical way to make that possible.

We first needed to ensure that data were discoverable. For this, we started testing out some tools for data discovery. Among the ones tested was Select Star Select Star,, which turned out to be our final choice. During the period of trying to find the proper tool for our situation, Select Star was still early in its evolution and development so, after realizing our sincere interest, they invested in building a strong relationship with us, consulting us closely when building their roadmaps, while having a very frequent communication looking to get our feedback as soon as possible. The CEO herself Shinji Kim was attending our weekly call helping us make not just our data discoverable to our users, but the tool itself easily used by our users in order to increase adoption.

Select Star offered most of the features we knew we wanted at that time, and it offered a quite attractive pricing plan which went in line with our ROI expectations.

Now, more than a year after our first implementation, we have almost 100 users active on Select Star, which is a pretty part of the internal Data consumer base within ORFIUM, given that we have a quite large operations department of people who do not need to access data or metadata.

We are looking to make it the primary gateway of our users to our data. All analysis, even thoughts, should start by using Select Star to explore if data exist.

Now, data discovery is one thing, and documentation coverage is another thing. There’s no point in making it easy for everyone to search for table and column names. We need to add metadata on our tables and columns so that the search results of Select Star parse that content too, and provide all available info to seekers. Working in this direction we have established within the Definition of Done of a new table in the production environment a clause that there should be documentation on the table and the columns too. Documentation on the table should include not only technical stuff like primary and foreign keys, level of granularity, expected update frequency etc, but also business information like what is the source for the dataset, as this varies between internal and external producers. Column documentation is expected to include expected values, data types and formats, but also business logic and insight.

The Business Intelligence team uses pre-commit hooks in order to ensure that all produced tables contain descriptions for all the columns and the tables themselves, but we cannot always be sure of what is going on in other Data products. As Data culture ambassadors (more on that on a separate post too), BI has set up a data coverage monitoring dashboard, in order to quantify the Docs coverage of tables produced by other Products, raising alerts when the coverage percentage falls below the pre-agreed threshold.

Tags and Business and Technical owners are also implemented through Select Star, making it seamless for data seekers to ask questions and start discussions on the tables with the most relevant people available to help.

Self-Service Analytics

The whole Self-Service Analytics initiative in ORFIUM, as well as Data Governance, will be depicted in their very own blog posts. For now, let’s focus on the tools used.

Having all ORFIUM Data Products accessible on Snowflake and discoverable through Select Star, we were in position to launch the Self-Service Analytics project. A decentralization of data requests from BI was necessary in order to be able to scale, but we could not just tell our non-analysts “the data is there, knock yourself out”.

We had to decide if we wanted Self-Service Analysts to work on Tableau or if we could find a better solution for them. It is interesting to tell the story of how we evaluated the candidate BI tools, as there were quite a few on our list. We do not claim this is the only correct way to do this, but it’s our take, and we must admit that we’re proud of it.

We decided to create a BI Tool Evaluation tool. We had to outline the main pillars on which we would evaluate the candidate tools. We then anonymously voted on the importance of those pillars, averaging the weights and normalizing them. We finally reached a total of 9 pillars and 9 respective weights (summing up to 100%). The pillars list contain connectivity effectiveness, sharing effectiveness, graphing, exporting, among other factors.

These pillars were then analyzed in order to come up with small testing cases, using which we would assess the performance in each pillar, not forgetting to assign weights on these cases too, so that they sum up to 100% within each pillar. Long story short we came up with 80 points to assess each one of the BI tools.

We needed to be as impartial as possible on this, so we assigned two people from the BI team to evaluate all 5 tools involved. Each BI tool was also evaluated by 5 other people from within ORFIUM but outside BI, all of them potential Self-Service Analysts.

Coming up with 3 evaluations for each tool, averaging the scores, and then weighting them with the agreed weights, led us to an amazing Radar Graph.

Though there is a clear winner in almost all pillars, it performed very poorly in the last pillar, which contained Cost per user and Ease of Use/Learning Curve.

We decided to go for the blue line which was Metabase. We found out that it would serve >80% of current needs of Self-Service Analysts, with very low cost, and almost no code at all. In fact we decided (Data Governance had a say on this too) that we would not allow users to be able to write SQL queries on Metabase to create Graphs. We wanted people to go on the Snowflake UI to write SQL, as those people were few and SQL-experienced, as they usually were backend engineers.

We wanted Self Service Analysts to use the query editor, which simulates an adequate amount of SQL features, in order to avoid coding at all. If they got accustomed to using the query builder, then for the 80% of their needs they would have achieved this with no SQL, so the rest of the Self-Service Analysts (the even-less tech savvy) would be inspired to try it out too.

After ~10 months of usage (on the Self-Hosted Open Source version costing zero dollars per user per month, which translates to *calculator clicking * zero dollars total) we have almost 100 Monthly Active Users and over 80 Weekly Active users, and a vibrant community of Self-Service Analysts looking to get more value from the data. The greatest piece of news is that the Self-Service Analysts become more and more sophisticated in their questions. This is solid proof that, within the course of 10 months, they have greatly improved their own Data Analysis skills, and subsequently the effectiveness of their day-to-day tasks.

Within those (on average) 80WAUs, the majority is Product Owners, Business Analysts, Operations Analysts, etc., but there are also around five high level executives, including a member of the BoD.

Conclusion

The BI team and ORFIUM itself have evolved in the past few years. We started from Amazon Athena and Quicksight, and after a part of the journey with python by our side, we have established Snowflake, Airflow, dbt and Tableau as the BI stack, while adding in ORFIUM’s stack Select Star for Data Discovery and Metabase for Self-Sevice Analytics.

More info on these in upcoming posts, but we have more insights to share for the Self-Service Initiative, the Staff BI role, and the Data Culture at ORFIUM.

We are only eager to find out what the future holds for us, but at the moment we feel future-proof.

Thomas Antonakis

Senior Staff BI Analyst

LinkedIn

]]>
Orfium’s BI journey: From 0 to Hero https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/orfiums-bi-journey-from-0-to-hero/ Wed, 19 Apr 2023 15:35:15 +0000 https://orfium.com/?p=4685

Foreword

This is the story of Business Intelligence and Analytics in Orfium, from before there was a single team member or a team itself within the company, to now, where we have a BI organization that scales, Data goals and excited plans for future projects.

Our story is a long one to tell, and one that makes us enthusiastic to tell. We’ll go through the timeline of our journey to the present day, but we fully plan to elaborate on the main points discussed today in their own articles.

We hope you’ll enjoy the ride. Buckle up, here we go!

Where we started

Before we formally introduced Business Intelligence to Orfium, there were a few BI-adjacent functions at the company. A number of Data Engineers, Operations Managers, and Finance execs created some initial insights with manual data pipelines.

These employees primarily gathered insights on two main parts of the business.

1. Operations Insights – Department and Employee Performance

To get base-level information on the performance of departments and specific employees, a crew of Data Engineers and Operations Managers with basic scripting skills came together, put together a python script, which didn’t lack bugs. The script pulled data from CSV exports from our internal software, as well as exports from AWS S3 that were provided by the DE teams, and connected them to produce a final table. All the transformations were performed within the script. No automation was initially required, as our needs were mostly for data on a monthly basis. The final table would then be loaded on Excel and analyzed through pivot tables and graphs.

This solution provided some useful insights. However, it certainly couldn’t scale along with the organization. A few problems came up along the way which could not be solved by this solution. Not the least of which, our need to have more frequent updates of daily data, to be able to view historical performance, and to join the data with important data from other sources were all reasons why Orfium seeked a bigger, smarter and more scalable solution to BI.

2. Clients Insights

Data Engineers put together a simple dashboard on Amazon QuickSight to give clients insights into the revenues we were generating for them. The data flowed from AWS S3 tables they had created, and they displayed bar charts of revenues over time, with some basic filtering. This dashboard was maintained for a couple of years but ultimately was replaced in March of 2022 with a more comprehensive solution that the BI team provided (spoiler alert: we created a BI team).

A new BI era 

In light of some of the issues mentioned above, a small team of BI Analysts was assembled to help with the increasing needs of the Operations team.

The first decision made by the BI Analysts was related to which tools to use for visualization and the ETL process. Nikos Kastrinakis, Director of Business Intelligence, had worked with Tableau previously, so he did a demo and trial with Tableau and ultimately convinced the team to use this as our visualization tool. We also use Tableau Prep as our ETL tool. The company was now storing all relevant data in AWS S3, and the Data Engineers used AWS Athena to create views that transformed the data provided into usable tables that BI could join in Tableau Prep.

During the Tableau trial, the BI team started working on the first dashboard, set to be used by the Operations department, replacing the aforementioned buggy script. We created one Dashboard to rule them all, with information on overall department performance, employee performance, and client performance. This gave users their first taste of the power of a BI team. Our goal was to answer many of their questions in one concise dashboard, complete with historical breakdowns of different types of data, and bring insights that users hadn’t seen before. The Tableau trial ended right before the Dashboard was set to launch. So of course, we purchased our first Tableau licenses for Orfium and onboarded the initial users with the launch of this Dashboard. 

It was a huge success! The Operations team was able to phase out their use of the script, stop wasting time monthly to generate reports for themselves, and were exposed to a new way of gaining insights.

Our work with the Operations team didn’t stop there. Over the next many months we continued to work with this data stack. We created further automation to bring daily data to the Operations team so they could manage departments and employees in real-time. But this introduced some new challenges we had to face. 

With the introduction of daily estimated data, the frequency of updates and the size of the views made the extracts unusable and obsolete, so we had to face the tradeoff of data freshness VS dashboard responsiveness. Most of the stakeholders were happy to wait 30 seconds more when they looked at their dashboards, knowing that they had the most up-to-date data possible. Operations needed to be more agile in their decisions and actions, so having fresh data was very important for them. To date, members of the Operations team remain the most active users of Tableau at Orfium and have been active participants in other data initiatives across the company.

The reception of these initial dashboards was amazing. The stakeholders could derive value and make smarter decisions faster, so the BI team gained confidence and trust. However, the BI team was still mainly serving the Operations department (with some requests completed for Clients and Corporate insights) but was starting to get many requests from Finance, Products, and other departments. We began to add additional BI Analysts to serve these needs. However, this was just the beginning of the creation of a larger team that could serve more customers more effectively, as we also began improving internal tech features and utilizing external solutions for ready-made software.

Where we are today

We had many questions to clear up: 

Where to store our data, how to transform them, who is responsible for these transformations, who is responsible for the ready and delivered data points, who has access and how do they get it, where do we make our analyses, how do data move around platforms and tools, how do our data customers discover our work?

Months and months of discussions between departments on all these questions lead to a series of decisions and commitments about our strategic data plans.

Where we stand now is still a transition from the previous step, as we decided to take a giant step forward, by embracing the Data Mesh Initiative. We’ll have the chance to talk about some of the terms and combinations of software that we’re about to mention in future blog posts, but we can run through the basics right now.

Our company is growing very quickly and, given the fact that we prefer being Data-(insert cliche buzzword here), the needs and requests for BI and Data analysis are growing at double the speed.

The increase in the number of BI Analysts was inevitable with the increasing requests and addition of new departments in the company that needed answers for their data questions.

By hiring more BI Analysts, we split our workforce between our two main Data customers, and thus created two BI Squads.

One is focused on finance and external client requests. We named it the Corporate squad, and it consists of a BI Manager and 2 BI Analysts. This is the team that prepares the monthly Board meeting presentation materials (P&L, Balance Sheets), and the dashboards shared with our external customers so that we can use data to demonstrate the impact of our work on their revenue and so that they get a better understanding of their performance on YouTube. This squad also undertakes many urgent ad-hoc requests on a monthly basis. This squad has a zero tolerance policy for mistakes and usually works on a monthly revision/request cycle.

The second squad is more focused on analyzing, and evaluating the performance and usage of our internal products, and connecting that info with the performance of our Operations teams, which generate the largest portion of our revenue. This squad, which happens to work from two different time zones, again consists of a BI Manager and 2 BI Analysts, and has more frequent deadlines, as new features come up very fast and need evaluation. The nature of data and continuous evolution of the data model results in less robust data.

In the meantime, we had already realized that we had to bulletproof our infrastructure and technical skills before scale gets to us. We decided to have some team members focused on delivering value by creating useful analyses, as described above, but we also reserved time and people who were more focused on paving the way for the rest of the analysts to be able to create more value, more efficiently.

We researched the community’s thoughts on this and we found the term Analytics Engineer, which seemed very close to what we were looking for. We thought this would be very important for the team and decided to go one step further and create a separate role that would be the equivalent of the Staff Engineer for software engineering teams. This role is more focused on setting the technical direction of the department, researching new technologies, consulting on the way projects should be driven, and enforcing best practices within the Analytics Chapter of the Data Unit. Quality, performance, and repeatability are the three core values that the code produced by this department should have.

We currently have a team of 8 people including the BI Director, two squads, each with one BI Manager and two BI Analysts, plus the Staff BI Analyst.

In terms of skillsets, we left python behind. Instead, we’re focusing more on writing reliable and performant SQL code and collaborating on git efficiently as a team. Our new toolkit is also more or less co-decided by the endorsement of a centralized data mesh, which is currently hosted by Snowflake. Nothing is being hosted/processed locally anymore, we develop data pipelines using SQL, apply them to our dev/staging/production environments through dbt and orchestrate the scheduling and data freshness using Airflow. We are the owners of our own Data Product, which is Orfium’s BI layer. It is a schema in Orfium’s production database where we store our fresh, quality, and documented data resulting from our processes. This set of tables connects data from other teams’ data products (internal products, external reports, data science results) and creates interoperating tables. These tables are the base for all our Tableau Dashboards, and help other teams use curated data without having to reinvent the wheel of the Orfium data model on their own again. Our Data product and our Tableau sites with all of our dashboards are fully documented and enriched with metadata so that our data discovery tool Select Star allows stakeholders to search and find all aspects of our work.

The future

Data Mesh was a big bet for Orfium and we will continue to build on it. The principles are applied and we are in the process of onboarding all departments on the initiative to be able to take advantage of the outcomes to the fullest extent. When this hard process is completed, all teams will enjoy the centralized data and the interoperability that derives from it, the Domain-Driven Data ownership, ensuring the agreed levels of quality on data, and it will help Orfium become more data-powered.

In addition to the obvious outcomes of applying the Data Mesh principles in a company, we believe that we need to follow up with two more major bets.

We decided to initiate, propose and promote Data Culture in Orfium. This is a very big project and is so deep that all employees need to get out of their comfort zones to achieve it. We need to change the way we work, to start planting data seeds very early in all our projects, products, initiatives, and working behavior so that we can eventually enjoy the results later on. This initiative will come with a Manifesto, which is being actively written and soon will be published. It will require commitment and follow-up on the principles proposed so that we achieve our vision.

Self-Service Analytics is also one of the Principles that Data Mesh is based on, and we decided to move forward emphatically with this too. Data will be generally accessible on Snowflake by everyone, but Data Analysis requires data literacy, SQL chops, and infrastructure that can host large amounts of data in an analysis. We decided to use Metabase as the proxy and facilitator for Self-Service Analytics. It provides the infrastructure by analyzing the data server-side, and not locally, and its query builder for creating questions is an excellent tool to create no-code analyses. Surely, it is not as customizable as SQL, but it will cover 85% of business users’ needs with superior usability for non-technical users.

This leaves us with data literacy and consultancy. For this, we have set up a library of best practices, examples, tutorials, and courses explaining how to handle business questions, analyses, limitations of tools, etc. At Orfium we always want to take a step further though, and we have been working to formulate a new role that will provide in-depth consultancy on data issues. This role will act as a general family doctor, who you know personally and trust, and will handle all incoming requests on data problems. Even if the data doctor cannot directly help you, they can direct you to a more suitable “doctor”, a set of more specialized experts, each one in their data sector (Infrastructure, SQL, Data Visualization, Analysis Requirement, you name it).

To infinite data and beyond

What a journey this has been over the last 3-4 years for BI in Orfium! We have gone through a lot, from not having official Business Intelligence to a BI team that has plans, adds value for the organization and inspires all teams to embrace the data-driven lifestyle. We’ve done a great job so far, and we have great plans for the future too.

It’s a long way to the top, if you want to rock and roll, ACDC :zap:

Stephen Dorn

Senior Business Intelligence Analyst

LinkedIn

Thomas Antonakis

Senior Staff BI Analyst

LinkedIn

]]>
ORFIUM acquires Soundmouse to unlock even more value across the entertainment landscape https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/orfium-acquires-soundmouse-to-unlock-even-more-value-across-the-entertainment-landscape/ Mon, 16 Jan 2023 09:56:55 +0000 https://orfium.com/?p=4689

Today, we’re delighted to share that ORFIUM has acquired Soundmouse, bringing together the global market leaders in digital music and broadcast rights management and reporting.

Who is Soundmouse?

Soundmouse is a global leader in music cue sheet reporting and monitoring for the broadcast and entertainment production space. They share our vision to revolutionize digital music and broadcast rights management and will join ORFIUM to deliver even more benefits to creators, rights holders, broadcasters and collecting societies with cutting edge technology and industry expertise.

Soundmouse has set the global standard for cue sheet and music reporting around the world, connecting all stakeholders in the reporting process including broadcasters, producers, collecting societies, distributors, program makers and music creators themselves. It works for major broadcasters, media companies and streaming platforms.

Why is ORFIUM acquiring Soundmouse?

Bringing Soundmouse into the ORFIUM family, we’re moving to a place where we can serve the entire entertainment ecosystem across mainstream and digital media. By connecting creators, rights holders, and music users, we can deliver even more value to stakeholders across the board.

Combining Soundmouse’s leadership in cue sheet management and monitoring for the broadcast and entertainment production space and ORFIUM’s expertise in UGC tracking and claiming for publishers, labels and production music companies, we bring the worlds of digital and broadcast together in an integrated way. This will allow us to scale our product offering and expand deeper into the complex infrastructure of the entertainment industry, streamlining content creation and management for program makers, broadcasters and music rights holders.

What does the Soundmouse acquisition mean for ORFIUM?

Since 2015, ORFIUM has innovated to bring cutting edge technology to the music rights management space and has generated hundreds of millions of dollars in additional revenue for its partners, which includes top global record labels, music publishers, production music companies, and collecting societies. 

Making music easier to find, use, track and monetize across all channels is one of the core problems we’re helping to solve for the industry. Acquiring Soundmouse enables us to scale our product offering and expand deeper into the complex infrastructure of the entertainment industry, streamlining content creation and management for program makers, broadcasters and music rights holders.

What does the future look like for ORFIUM following the Soundmouse acquisition?

We’re committed to solving the entertainment industry’s most complex problems. We continue to develop technology solutions built on the latest in machine learning and AI, empowering rights owners, creators and key stakeholders to realize more value as new platforms for media consumption emerge and scale.

At a time when it has never been harder for creators, rights holders and media companies to track and monetize usage with the proliferation of new channels, platforms and the growth of the Metaverse and Web3, there is no other company in this space building and investing in technology like ORFIUM. ORFIUM is committed to delivering the technology needed to support the entertainment industry of today and the future.

Acquiring Soundmouse is a great start to the year for ORFIUM. We’re excited to welcome the Soundmouse team to join ours and integrate our combined technology, teams and expertise to bring even more value to the entertainment ecosystem.

Stay tuned, lots more still to come from ORFIUM in 2023!

The ORFIUM team

To learn more about how ORFIUM can support you in unlocking more value, contact our team today!

]]>
What’s in the box? https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/whats-in-the-box/ Wed, 28 Dec 2022 11:35:31 +0000 https://orfium.com/?p=4693

Black box testing for non-data engineers with DBT

Black box testing is a software testing method in which the functionalities of software applications are tested without having knowledge of internal code structure, implementation details, and internal paths. Let’s borrow that term and use the same analogy to test our black boxes, meaning our dbt models.

So, adapting the lexicon from software engineering, we have:

Black Box: the dbt model that plays the role of transformation

Input(s): the different tables that are used in the query. In dbt we call these sources or references

Output: the table formed after the dbt model has transformed the data

DBT and testing

DBT is a data build tool. We use it, due to its simplicity, to perform transformation in our snowflake data cloud for analytical purposes. Not to overstate the matter, but we love it.

What, exactly, is dbt?

Building a Mature Analytics Workflow

DBT already offers dbt tests, and performing them is a great way to test the data on your tables. By default, the available tests are unique, not_null, accepted_values, relationship. We can even create custom tests, and there are a variety of extensions out there that stretch dbt functionality with additional tests, such as great expectations and dbt-utils. These kinds of tests examine the values of your tables, and they are a great way to identify any critical data quality issues. DBT tests look at the output. However, what we want to do is to test the black box, the transformation.

TDD and Data

Working with Large Tables

More often than not, the tables that we need to build models upon are huge, and accessing billion of rows and performing transformation upon them takes a long time. A 30 minute transformation might be acceptable when it is a part of a production pipeline, but having to wait for half an hour to develop and test the correctness of your transformation is, well, less than ideal. 

Of course you are going to run it against the table, but minimizing the number of runs makes everyone happy. This also limits your Snowflake Warehouse Usage which can save cost and make accountants happy as well.

Edge cases not covered in actual data

Another problem we often face is having a dbt model that works for all intents and purposes for multiple months, only to later discover that there are cases which we didn’t think of. Unsurprisingly, having billions of rows of data means that all the possible scenarios are not at all easy to cover. If only there was a way to test for those cases, as well. The solution we at Orfium use is to generate mock data. They may not be real, but they work well enough to cover our edge cases and future-proof our dbt instances.

Good Tests VS Bad Tests

Writing tests for the sake of writing them is worse than not writing them at all. There, we said it. 

Let’s face it, how many times do we introduce tests on a piece of software, get excited and, thanks to the quick TDD process, we just gleam with self-confidence? Before you know it, we’re writing tests that have no value at all and inventing a fantastic metric called coverage. Coverage is important but not as a single metric. It is only a first indication and should not be used as a goal in itself. Good tests are the ones that provide value. Bad tests, on the other hand, only add to the debt and the maintenance. Remember, tests are a means to an end. To what end? Writing robust code.

Tests as a requirements gathering tool

How many times have we found ourselves sitting in a room with a stakeholder who provides information about a new report that they need. We start formulating questions, and after some back and forth, sooner or later we are reaching the final requirements of the report. So, happily enough after the meeting, we go to our favorite warehouse, only to discover some flaw in the original request that we didn’t think of when we did our requirements gathering. Working in an agile environment that’s no issue. We just schedule a follow-up meeting and reach a consensus for the edge cases. Final delivery is reached. However, wouldn’t it be better if actual cases could be drafted in that first meeting? Business and engineering minds often don’t mesh well, so we can use all the help we can get.

Establishing actual scenarios of how a table could look like and what the result would be, helps a lot in the process of gathering requirements.

Consider the following imaginary scenario:

Stakeholder:

For our table that contains our daily revenue for all the videos, I would like a monthly summary revenue per video for advertisement category.

Engineer (gotcha):

1select

2 video_id,

3 year(date_rev) as year,

4 month(date_rev) as month,

5 sum(revenue) revenue

6from

7 fct_videos_rev

8where

9 category = 'advertisement'

10group by 

11 video_id,

12 year(date_rev),

13 month(date_rev)

14

Stakeholder

I would also like to see how many records the summation was comprised of.

Engineer (gotcha):

1select

2 video_id,

3 year(date_rev) as year,

4 month(date_rev) as month,

5 sum(revenue) revenue,

6 count(*) counts

7from

8 fct_videos_rev

9where

10 category = 'advertisement'

11group by 

12 video_id,

13 year(date_rev),

14 month(date_rev)

15

Stakeholder

That can’t be right. Why so many counts?

Engineer

There are many rows with zero revenues, I see. You don’t want them to count towards your total count, is that right?

Stakeholder

Yes.

Engineer (gotcha):

1select

2 video_id,

3 year(date_rev) as year,

4 month(date_rev) as month,

5 sum(revenue) revenue,

6 count(*) counts

7from

8 fct_videos_rev

9where

10 category = 'advertisement'

11 and revenue > 0

12group by 

13 video_id,

14 year(date_rev),

15 month(date_rev)

16

Of course, this is an exaggerated example. However, imagine if the same dialog went a different way.

Stakeholder:

For our table that contains our daily revenue for all the videos, I would like a monthly summary on a monthly basis per video for the advertisement category.

Engineer:

If table has the form:

video_iddate_revcategoryrevenue
video_a2022-02-12advertisement10
video_a2022-02-12advertisement0
video_a2022-03-12subscription15
video_a2022-03-12advertisement1

Is the result you want like the following?

video_idyearmonthrevenue
video_a20220210
video_a2022031

Stakeholder

I would also like to see how many records the summation was comprised of.

So the result you want it to be like:

video_idyearmonthrevenuecounts
video_a202202102
video_a20220311

Stakeholder

Why does the first row have 2 counts?

Engineer

There are two with zero revenues, I see. You don’t you want them to count towards your total count, is that right?

Stakeholder

Yes.

Engineer (gotcha):

video_idyearmonthrevenuecounts
video_a202202101
video_a20220311

And all that, without having to write a single line of code. Not that an engineer is afraid to write SQL queries. But really, a lot of time is lost in translating business requirements into SQL queries. They are never that simple and they are almost never correct at first try either.

Tests so software engineers can get onboard in SQL

Orfium is a company which, at the time of writing this post, consists of more than 150 engineers. Only 6 of those are data engineers. That might sound strange, given that we are a data-heavy company dealing with billions of rows of data on a monthly basis. So, a new initiative has emerged called data-mesh. This is a program which we practice on a daily basis and are super proud of. One consequence of data mesh is that there are multiple teams handling their own instance of dbt. But, this will be discussed in detail in another post. Stay tuned!

For the most part, software engineers are not familiar with writing complex SQL queries. That’s not their fault, due to the variety of ORM tools available. However, something that software engineers do know how to do very well is to write tests.

In order to bridge that gap, practicing test-driven development on writing SQL is something that can help a lot of engineers to get onboard.

Let the fun begin

We designed a way to test dbt models (the black box). Our main drivers are:

  • Introduce a few changes so that new or mature projects can start using it, without breaking existing behavior.
  • Find a way to define test scenarios and identify which of them failed.

We start by introducing the following macros:

1{%- macro ref_t(table_name) -%}

2    {%- if var('model_name','') == this.table -%}

3        {%- if var('test_mode',false) -%}

4            {%- if var('test_id','not_provided') == 'not_provided' -%}

5                {%- do exceptions.warn("WARNING: test_mode is true but test_id is not provided, rolling back to normal behavior") -%}

6                {{ ref(table_name) }} 

7            {%- else -%}

8                {%- do log("stab ON, replace table: ["+table_name+"] --> ["+this.table+"_MOCK_"+table_name+"_"+var('test_id')+"]", info=True) -%}

9                {{ ref(this.table+'_MOCK_'+table_name+'_'+var('test_id')) }}

10            {%- endif -%}

11        {%- else -%}

12            {{ ref(table_name) }} 

13        {%- endif -%}

14    {%- else -%}

15        {{ ref(table_name) }} 

16    {%- endif -%}

17        

18{%- endmacro -%}

19

20{%- macro source_t(schema, table_name) -%}

21

22    {%- if var('model_name','') == this.table -%}

23        {%- if var('test_mode',false) -%}

24            {%- if var('test_id','not_provided') == 'not_provided' -%}

25                {%- do exceptions.warn("WARNING: test_mode is true but test_id is not provided, rolling back to normal behavior") -%}

26                {{ builtins.source(schema,table_name) }}

27            {%- else -%}

28                {%- do log("stab ON, replace table: ["+schema+"."+table_name+"] --> ["+this.table+"_MOCK_"+table_name+"_"+var('test_id')+"]", info=True) -%}

29                {{ ref(this.table+'_MOCK_'+table_name+'_'+var('test_id')) }}

30            {%- endif -%}

31        {%- else -%}

32            {{ builtins.source(schema,table_name) }}

33        {%- endif -%}

34    {%- else -%}

35        {{ builtins.source(schema,table_name) }}

36    {%- endif -%}

37        

38{%- endmacro -%}

The macros are able to optionally change the behavior of the macros of source and ref.

  • model_name: refers to the model actually been tested
  • test_mode: is a flag that helps identifying if the test_mode is enabled
  • test_id: the test scenario that is going to be mocked
  • table_name(argument): is the source table that is either going to be the true source, or we stab it and use one of our own.

Prefer multiple small test cases over few large test cases

Test cases should test something specific. Generating Mock data that contain hundreds of records that test multiple business rules should be avoided. Should the test case fail, it should be easy to identify the cause and its impact.

Suppose we would like to create a test_id with the name: MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE for our model VIDEOS_INFO_SUMMARY which uses a source VIDEOS_INFO

We create a new folder under seeds MOCK_VIDEOS_INFO_SUMMARY

  1. We create the input seed seeds/MOCK_VIDEOS_INFO_SUMMARY/VIDEOS_INFO_SUMMARY_MOCK_VIDEOS_INFO_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE.csv which plays the role of input

1VIDEO_ID,DATE_REV,CATEGORY,REVENUE 

2video_a,2022-02-12,advertisement,10

3video_a,2022-02-12,advertisement,0

4video_a,2022-03-12,other,15

5video_a,2022-03-12,advertisement,1
  1. We create the output seed seeds/MOCK_VIDEOS_INFO_SUMMARY/VIDEOS_INFO_SUMMARY_MOCK_RESULTS_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE.csv which plays the role of output we would like to have once

1VIDEO_ID,YEAR,MONTH,REVENUE,COUNTS

2video_a,2022,2,10,1

3video_a,2022,3,1,1
  1. We also create a yml seeds/MOCK_VIDEOS_INFO_SUMMARY/VIDEOS_INFO_SUMMARY.yml as follows:
1version: 2

2

3seeds:

4  - name: VIDEOS_INFO_SUMMARY_MOCK_RESULTS_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE

5    config:

6      enabled: "{{ var('test_mode', false) }}"

7

8  - name: VIDEOS_INFO_SUMMARY_MOCK_VIDEOS_INFO_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE

9    config:

10      enabled: "{{ var('test_mode', false) }}"

Notice that the seeds are created only on test_mode. This allows us to omit creating those seeds on default behavior.

  1. Now we define the test inside our yml model definition:
1models:

2  - name: VIDEOS_INFO_SUMMARY

3    description: "Summary of VIDEOS_INFO"

4    tests:

5        - dbt_utils.equality:

6            tags: ['test_VIDEOS_INFO_SUMMARY_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE']

7            compare_model: ref('VIDEOS_INFO_SUMMARY_MOCK_RESULTS_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE')

8            compare_columns:

9              - VIDEO_ID

10              - YEAR

11              - MONTH

12              - REVENUE

13              - COUNTS

14            enabled: "{{ var('test_mode', false) }}"
  1. Our model:
1{{

2    config

3    (

4        materialized = 'table'

5    )

6}}

7

8SELECT

9 VIDEO_ID,

10 YEAR(DATE_REV) AS YEAR,

11 MONTH(DATE_REV) AS MONTH,

12 SUM(REVENUE) REVENUE,

13 COUNT(*) COUNTS

14FROM

15 {{ source_t('MY_SCHEMA','VIDEOS_INFO') }}

16WHERE

17 CATEGORY = 'advertisement'

18 AND REVENUE > 0

19GROUP BY 

20 VIDEO_ID,

21 YEAR(DATE_REV),

22 MONTH(DATE_REV)

Notice the source_t usage instead of using the default source macro.

Now in order to follow the test process we have to go through the following process.

  1. Load up our seeds as:

1dbt seed –full-refresh -m MOCK_VIDEOS_INFO_SUMMARY –vars ‘{“test_mode”:true}’

  1. Then execute our model as:

1dbt run -m VIDEOS_INFO_SUMMARY –vars ‘{“test_mode”:true,”test_id”:”MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE”,”model_name”:”VIDEOS_INFO_SUMMARY”}’

  1. And then execute dbt test to check if our black box behaved as it should:

1dbt test –select tag:test_VIDEOS_INFO_SUMMARY_MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE –vars ‘{“test_mode”:true,”test_id”:”MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE”,”model_name”:”VIDEOS_INFO_SUMMARY”}’

Note: Because the whole process is a bit tedious with writing all those big commands, we wrote a bash script which automates all three steps:

The requirement is to create a file conf_test/tests_definitions.csv which has the format:

1# MODEL_NAME,TEST_ID

2VIDEOS_INFO_SUMMARY,MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE

  1. Script reads this file and executes all the tests defined in the file in order
  2. Executing tests of only a specific model is supported by passing -m flag ./dbt_test.sh -m VIDEOS_INFO_SUMMARY
  3. Executing a specific test case is supported by passing -t flag ./dbt_test.sh -t MULTIPLE_VIDEOS_HAVE_ZERO_REVENUE
  4. Lines that start with # are skipped

In the whole set-up described above there are some conventions that are important to be followed, otherwise the script/macros might not work

  1. The seed folder must be named MOCK_{model_we_test}
  2. The seed which plays the role of input must be named {model_we_test}_MOCK_{model_we_stab}_{test_id}
  3. The result which plays the role of wanted result must be named {model_we_test}_MOCK_RESULTS_{test_id}

All the code exists in the following repo: https://github.com/vasilisgav/dbt_tdd_example – Connect to preview  

To see it in practice:

  • set up a tdd_example profile
  • make sure you run dbt deps to install dbt_utils
  • make the script executable chmod +x dbt_test.sh
  • and finally execute the script ./dbt_test.sh

RESULTS:

What we have found by working with this approach, as it is expected with any TDD approach. The result was a big win into how we release our dbt models

Pros

  • models have grown to become quite clean with their business clearly depicted
  • business rules can easily be verified, especially their changes
  • business voids are identified faster
  • business requirements are generated in a cleaner, more efficient way
  • quick development, yes it’s surprising but we deal with billion of rows, the less runs we are going to perform on the full load of table the quicker the development
  • regression tests are handled by our github actions ensuring our models behave as expected (multiple puns here 😀 )
  • QA can happen independently of our dev
  • Warehouse usage is limited

Cons:

  • Tables sources with multiple columns sometimes are cumbersome to mock, although if columns are not selected then defining them in the mock csv’s is not required
  • It’s somewhat difficult to start

So, what are the key takeaways? That testing is important but good, smart testing can truly free an organization of a lot of daily tedium and allow it, as it has us, to focus more on serving the business efficiently and with the least amount of friction.

Vasilis Gavriilidis

Senior Data Engineer @ ORFIUM

https://www.linkedin.com/in/vgavriilidis/

https://github.com/vasilisgav/

]]>
What is the shape of you, Ed Sheeran? An introduction to NER https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/what-is-the-shape-of-you-ed-sheeran-an-introduction-to-ner/ Wed, 16 Nov 2022 08:14:10 +0000 https://orfium.com/?p=4697

Introduction

He is, of course, a recording artist and a guest actor in Game of Thrones. His shape, without going into details, is pretty human. But you knew that already. You made the connection between these words and the entity they represent. However, this isn’t quite as easy and straightforward a task for a computer. Enter Named Entity Recognition (NER) to save the day. NER is essentially a way to teach a computer what words mean.

What is NER?

We can first look at the formal definition:

“NER is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.” (Wiki)

That wasn’t very helpful at first glance. Let’s try a simple example:

In 2025, John Doe traveled to Greece and visited the Acropolis, where the Parthenon is.

Given the context, we might be interested in different types of entities. If what we are after are semantics, then we simply need to understand which words signify persons, which signify places, etc.

On the other hand, in some cases we might need syntactic entities like nouns, verbs, etc.

Why NER?

Okay, now we know what NER is, but what does it do, in real life? Well, plenty of things:

  • Efficient searching: This applies to any service that uses a search engine and has to answer a large number of queries. By extracting the relevant entities from a document corpus, we can split it into smaller homogeneous segments. Then, at query time we can reduce the search space and time by only looking into the most relevant segments.
  • Recommendation systems: News publishers, streaming services and online shops are just a few examples of services that could benefit from NER. Clustering articles, shows or products by the entities they contain helps a recommendation engine to deliver the great suggestions to users, based on the content they prefer.
  • Research: Each year the already tremendous volume of papers, research journals and publications increases further. Automatically identifying entities such as research areas, topics, institutions, and authors can help researchers navigate through this vast interconnected network of publications and references.

Now we’re getting somewhere. We know what NER is and have a few good ideas about where it can be used. But why and where are we at ORFIUM using it?

NER applications at ORFIUM

Text matching

In some of our services we use Natural Language Processing (NLP) methodologies to match recording or composition catalogs with other catalogs, internally and externally. NER can aid this process by extracting the most relevant industry-related entities, like song titles and artists, which can then be used as features for our current algorithms and models.

Data cleaning

The great volume of data we ingest daily often contain irrelevant and superfluous information. An example of this are YouTube catalogs, where usually video titles contain more than just song titles or artist names, and might have no useful information at all. By extracting the entities most relevant to the music industry, we essentially remove any noise, which will lead to better metadata, as well as a more trustworthy knowledge base.

Approaches and Limitations

Depending on the context and the text structure, there are various approaches that can be employed, but they usually are grouped into two general categories, each with its own strengths and drawbacks: rule-based and machine learning approaches.

In rule-based approaches, a set of rules is derived based on standard NLP strategies or domain-specific knowledge and then used to recognize possible entities. For example, names and organizations are capitalized, and dates are written in formats like YYYY/MM/DD.

  • Pros:
    • Straightforward and easy to implement for well-structured text
    • Domain knowledge can be easily integrated
    • Usually computationally fast and efficient
  • Cons:
    • Rule sets can get very large, very fast for complicated text structures, requiring a lot of work
    • General purpose rule sets not easily adaptable to specific domains
    • Changes to the text structure further complicate rule additions and interactions.

In machine learning approaches, a model is trained using a dataset annotated specifically for the task at hand. The model learns all the different ways in which relevant entities appear in text and can then be used to identify them in the future.

  • Pros:
    • Training process is domain-agnostic with easily customizable entity tags
    • Well-suited for unstructured text and easily adaptable to structure changes
    • Pre-trained models can be customized and used to speed up the training process
  • Cons:
    • The process requires large amounts of annotated entries to create a robust model
    • May require annotators with specific domain expertise
    • Training process can be costly in terms of time and money depending on the use-case

Our Project

What we wanted to accomplish was to build a baseline entity extraction process which could potentially later be used to improve our matching and other services.

Dataset

A good starting point for that would be the YouTube catalogs we ingest. These are catalogs of unmatched sound recordings. As mentioned earlier, video title structures are usually a bit chaotic. Therefore, this use case is an excellent candidate to test the potential and limitations of NER.

In the video titles, the most relevant entities that are present and we would like to identify are recording TITLE, PERSON and VERSION (remix, official video, live, etc)

We investigated both a rule-based and a machine learning approach. For their evaluation, however, we needed an annotated dataset tailored to our use case. For that reason we turned to LabelStudio and our Operations Team. LabelStudio is an open-source online data annotation tool with an intuitive UI, where we uploaded a catalog sample. The sample was split into sub-tasks which then were handled by our Operations Team.

Label Studio – Open Source Data Labeling 

At this point, we would like to say a big thank you to the Operations Team for their help. Dataset annotations are almost always quite tedious and repetitive work, but an incredibly important first step in our testing.

Rule-based approach

For the construction of our rules, we first needed to investigate whether there was any kind of structure in the video title text. We found a few patterns.

Information inside parentheses

The first thing we noticed is that when parentheses ( (), [], {} ) were present, they mostly contained featured artists or version information, like live, acoustic, remix, etc. This information was rarely found outside parentheses.

For these reasons we wrote a few simple rules for attributes inside parentheses:

  • If they contained any version keywords (live, acoustic, etc.), tag them as VERSION
  • If “feat” was present, then tag the tokens after that as PERSON

Segmentation

One other thing we noticed was that some entries could be split into segments using certain delimiters ( -, |, / ). These entries could be generally split into 2-4 segments. Also, “|” and “/” have higher priority than “-”. When split by | or /, the first segment mostly contained recording titles and sometimes also artists. When split by -, the picture was not quite as clear, since titles and artists appeared both in the first segment as well as the rest. The most prevalent case, however, was the artist appearing in the first segment and the title in the second.

Based on the above we have the following rules for splittable entries:

  • When split by | or / tokens in the first segment, they are tagged as TITLE. In the second segment, tag them as PERSON
  • When split by – tokens in the first segment are tagged as PERSON. In the second segment, tag them as TITLE

Finally tokens in entries that did not belong in the above categories, were tagged as TITLE.

Machine learning approach

Our work for the machine learning approach was much more straightforward. We decided to go with transfer learning. This is a process where we take a state-of-the-art pre-trained (usually on public and general-use datasets) model and partly extend the training with a custom dataset. This is very efficient, since we didn’t have to waste time training the model from scratch, but we still get to tailor it to our needs.

spaCy · Industrial-strength Natural Language Processing in Python 

For that purpose, we used Spacy, which is a well-established and open-source python package for NLP. It supports multiple languages and NLP algorithms, including NER. Its models are easily re-trained and integrated with few lines of code. It’s also great that there are some Spacy models optimized for accuracy and others for speed. Spacy models were retrained using the annotated dataset provided by our Operations Team.

Results

Both approaches performed very well and identified the majority of the TITLE entities. As far as PERSON and VERSION entities are concerned, the rule-based approach struggled a bit, while the machine learning one did a decent job. Below we have some examples of wrong predictions:

We also faced a few common issues with both approaches, which made their predictions less accurate.

Conclusion

Here is where today’s journey comes to an end. We had a chance to briefly introduce the concept of Named Entity Recognition, describe a few of its general and more custom uses, and learned that, despite the variety of approaches, they all come with caveats and we usually have to make compromises depending on our needs. Is our text well-structured? Are our entities generic or do they require specific domain knowledge? How do different approaches adapt to changes? Are we able to annotate our own datasets?

We also started this article with a question. Did NER help us to answer it? Our models certainly tried. Both our rule-based and machine learning approaches gave us the following result when asked to identify the entities in “Ed Sheeran – Shape of You”:

But what do we know? They seem to perform very well, so they might be right.

Theodoros Palamas

Machine Learning Researcher/Data Scientist @ ORFIUM

https://www.linkedin.com/in/theodoros-palamas-a755b623b/

]]>
Video Similarity with Self-supervised Transformer Network. https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/video-similarity-with-self-supervised-transformer-network/ Wed, 02 Nov 2022 08:49:29 +0000 https://orfium.com/?p=4701

Do you ever wonder how many times your favorite movie exists on digital platforms?

My favorite animated video when I was a child was Happy Hippo. I spent innumerable hours watching videos of the plump hippopotamus on YouTube. One thing that I remember clearly, however, is how many videos of the same clip were uploaded. Some were reaction vids, others had funny songs and photo themes in the background. Now that I am older, I am wondering: Can we figure out actually how many versions of the same video exist. Or, more scientifically, can we extract a probability score for video pair match?

Introduction

Content-based video retrieval is drawing more and more attention more and more these days. Visual content is one of the most popular types of content on the internet. And there is an incredible degree of redundancy, as we observe a high number of near or exact duplicates of all types of videos. Even common users come across the same videos in their daily use of the web, from YouTube to Tik Tok. It plays an even more important role in many video-related applications, including copyright protection, where especially movie trailers are targets for re-uploads, general piracy or just reaction videos. Usually, techniques such as zoom, crop or slight distortions are being used to differentiate the duplicate video in order not to be taken down.

Photo by Kasra Askari on Unsplash

Approaches

From Visual Spatio-Temporal Relation-Enhanced Network and Perceptual Hashing Algorithms to Fine-grained Spatio-Temporal Video Similarity Learning and Pose-Selective Max Pooling, video similarity is a popular task in the field of computer vision. While finding exact duplicates is a task which can be executed in a variety of different ways, near duplicates or videos with modifications still pose a challenge. Most video retrieval systems require a large amount of manually annotated data for training, making them costly and inefficient. To match the current rhythm of video production, an efficient self-supervised technique needs to emerge in order to tackle the space and calculation shortcomings.

What is Self-supervised Video Retrieval Transformer Network?

Based on the research effort that has been done on “Self-supervised Video Retrieval Transformer Network” (He, Xiangteng, Yulin Pan, Mingqian Tang and Yiliang Lv) we replicate the architecture with certain modifications.

To begin, we introduce the suggested Self-Supervised Video Retrieval Transformer Network (SVRTN) for effective retrieval, by decreasing the costs of manual annotation, storage space, and similarity search. As indicated in the previous image, it primarily is comprised of two components: self-supervised video representation learning and clip-level set transformer network. Initially, we use temporal and spatial adjustments to construct the video pairings automatically. Then, via contrastive learning, we use these video pairs as supervision to learn frame-level features. Finally, we use a self-attention technique to aggregate frame-level characteristics into clip-level features, using masked frame modeling to improve robustness. It leverages self-supervised learning to learn video representation from unlabeled data, and exploits transformer structure to aggregate frame-level features into clip-level.

Self-supervised video representation learning is used to learn the representation from pairs of videos and their transformations, which are generated automatically via temporal and spatial transformations, eliminating the significant costs of manual annotation. SVRTN technique can learn better video representation from a huge number of unlabeled films due to self-generation of training data, resulting in improved generalization for its learned representation.

A clip-level set transformer network is presented for aggregating frame-level data into clip-level features, resulting in significant storage space and search complexity savings. It can learn complementary and variant information from clip frame interactions via the self-attention mechanism, as well as frame permutation and missing invariant ability to manage the issue of missing frames, all of which improve the clip-level feature’s discriminating and resilience. Furthermore, it allows more flexible retrieval methods including clip-to-clip and frame-to-clip retrieval.

Self-supervised – Self-generation

After collecting a large amount of videos, temporal and spatial transformations are sequentially performed on these clips to construct the training data.

Temporal Transformations: To create the anchor clip, evenly sample N frames with a set time interval r. Then, from the anchor clip, a frame Im is chosen at random as the identical material shared by anchor clip C and positive clip C+. We consider the chosen frame to be C+’s median frame, and we sample (N1)/2 frames forward and backward with a different sample time interval r+.

Spatial Transformation: We then apply spatial transformations on each frame. Three forms of spatial transformations are explored:  Photometric transformation (a). It covers brightness, contrast, hue, saturation, and gamma adjustments, among others. Geometric transformation (b). It offers horizontal flip, rotation, crop, resize, and translation adjustments. c) Transformation editing It includes effects such as creating a blurred background, a logo, an image in picture, and so on. For the logos, we use the sample dataset of LLD-logo which consists of 5000 logos(32×32 resolution-PNG). During the training stage, we pick one transformation from each type of spatial transformation at random and apply it to frames from positive clips in order to create new positive clips.

Triplet Loss

Triplet loss is a loss function where a reference input is compared to a matching and a non-matching input. The distance from the anchor to the positive is minimized, and to the negative input is maximized. In our project, we use triplet loss instead of the contrastive loss in both Frame-level and Clip level.

Video Representation Learning – Frame-level Model

We employ the supervised video pairs to train the video representation with frame-level triplet loss, because they have been generated. To acquire the frame-level feature, a pretrained ResNet50 is used as the feature encoder, followed by a convolutional layer to lower the channel number of the feature map, and finally average pooling and L2 normalization.

By minimizing the distance between features of the anchor clip frames and positive clip frames, as well as maximizing the distance between features of the anchor/positive clip frames and negative clip frames, video representation learning aims to capture spatial structure from individual frames while ignoring the effects of various transformations.

Clip-level Set Transformer Network

Because subsequent frames from the same clip have comparable material, frame-level features are highly redundant, and supplementary information is not completely investigated. Specifically, self-supervised video representation learning is used to extract a series of frame-level features from a clip, which are then aggregated into a single clip-level feature x.

We present a modified Transformer, the clip-level set transformer network, to encode the clip-level feature. Instead of utilizing a Transformer to encode the clip-level feature directly, we use the set retrieval concept in the clip-level encoding. Without position embedding, we just utilize one encoder layer with eight attention heads. It gives our SVRTN method the following capabilities:

  1. More robust: We increase the robustness of the learned clip-level features with the ability of frame permutation and missing invariant.
  2. More flexible: We support more retrieval manners, including clip-to-clip retrieval and frame-to-clip retrieval.

We treat the frames of one clip as a set and randomly mask some frames in clip-level encoding to improve the robustness of the learnt clip-level features. We drop some frames at random from a clip C to create a new clip C’. The purpose of this exercise is to eliminate the influence of frame blur or clip cut, and to enable the model to retrieve its corresponding clips using any combination of frames in the clip. Then we use them to calculate the triplet loss.

Video Similarity Calculation

We perform shot boundary recognition on each video to segment it into shots, and then divide the shots into clips at a set time interval, i.e. N seconds. Second, to generate the clip-level feature, the sequence of successive frames is transmitted via the clip-level set transformer network. Finally, IsoHash binarizes the clip-level functionality to further reduce storage and search costs. We use hamming distance to measure clip-to-clip similarity when retrieval.

Shots are extracted with shot boundary/transition detection with the use of TransNetV2. The lift and projection version of IsoHash has been used for the binarization.

Conclusions

We used a variety of modifications to evaluate our model with videos of sports, news, animation and movies.

  • 53 Transformations:
    • Size : Crop
    • Time : Fast Forward
    • Quality : Black & White
    • Others : Reaction
  • Most efficient categories: fast, intro-outro, watermark, contrast, slow, B&W effect
  • Less efficient categories: extras, black-white, color yellow, frame insertion, colorblue, resize
  • The model performs well and is tolerant under zoom/crop. There is no direct relation between these attributes and similarity but it seems that medium levels are the most efficient.
  • There seems to be a relation between the number of shots and similarity.
  • Reduced space and calculation cost

Useful Links

Future work: Video Similarity and Alignment Learning on Partial Video Copy Detection

Possible extension: https://www.jstage.jst.go.jp/article/ipsjtcva/5/0/5_40/_article

SVD Dataset: SVD – Short Video Dataset

]]>
My Internship at ORFIUM https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/my-internship-at-orfium/ Tue, 16 Aug 2022 11:51:00 +0000 https://orfium.com/?p=4705

Why did I want to do another internship?

After interning last summer and another year of studying, I was looking forward to getting my hands on more practical matters in the orientation I wanted my career to take, which is AI Research.

So, after finishing my studies, I felt like I wanted to put all of the knowledge I just acquired to the test against real-world problems. During my early professional steps, I feel it is important to handle a wide variety of issues and learn to work with different kinds of people. It’s not just enough to do the job, I want to be able to find the equilibrium of work-life balance.

Okay, but why intern at ORFIUM?

In my previous internship, I worked for an already scaled company that dealt with generic software engineering issues. This gave me a solid understanding of the life of an engineer. I was ready for something different. I wanted to learn at a still rapidly scaling company and have a more specific role.

Before the internship, Pantelis Vikatos, head of the Research Team, and I discussed the possible projects I could help and learn from, to make them fit both my interests and the company’s goals.

As an intern, I wanted to have the chance to apply what I’ve learned from my studies and previous working experiences. At the same time, I would love to actually contribute to a company such as ORFIUM. Seeing the passion that the people that were already working here have, motivated me further and allowed me to actually realize the value of the task at hand.

My AI and research background, in combination with the open-minded, free culture of the music industry and especially ORFIUM, was just the right match.

So, how was interning at ORFIUM? 

Not being a first-time intern, I had a realistic outlook on the whole process. This time I wanted to go a step forward and contribute even more however I could. I was ready to take on even more responsibilities and do the best I could to put my skills to the test, creating a win-win scenario for both the company and me.

At the end of the day, I understood that getting the job done was not the only goal. Being as efficient and working clean while also adding to a good working environment for my colleagues and me was exactly what I was expecting from myself and the company.

And was ORFIUM a good place to intern?

Being an intern at ORFIUM surely exceeded my expectations.

From the first moments and interactions, I realized that I was in a friendly and open environment. This made the whole process flow smoothly. Everyone was there to help or answer whatever questions I had.

The company provided whatever I needed to work on a professional level in terms of equipment and infrastructure. I was given my own laptop and peripherals, as well as instructions to get my job done easily. Virtual machines and online resources were being managed and given by experts internally in order to provide whatever was needed.

This way, the internship kicked off in the best way possible. I was entrusted to lead my project my way and have my own pace without anyone doubting me to handle my responsibilities. This was enough to let me know that not only was I in an open-minded environment but my voice was also heard. 

What did I actually do at ORFIUM?

The project I was assigned to was to replicate the work done at the paper with the title “Self-supervised Video Retrieval Transformer Network”, creating a video matching mechanism.

The main objective was to answer the question: “Can we extract a probability score for video pair match?”.

The motivation behind this question was the observation of a high number of near-duplicate videos online. We would like to be able to find similar videos, which could be re-uploads, piracy content etc.

The workflow can be described as:

  1. Common state-of-the-art approaches
  2. Model Architectures
  3. Evaluation Methods
  4. Proposed Method
  5. Documentation of online sources on:
    1. State-of-the-art literature
    2. Public datasets
  6. Implementation of a Video – visual transformers deep learning model
  7. Training & Evaluation of the proposed model

After a few modifications and a lot of questions, the results were good enough to be able to deliver the trained model and the evaluations.

At that point, having completed the basic goal of my internship, I researched the extension possibilities. I also made a presentation to the rest of the team where I presented my work, explaining the process and demonstrating the results.

So, what did I learn?

After finishing my internship, I think back and reflect on the various experiences and lessons I had during these three months. There is no comparison between the practical applications on a company level and the experience in a university semester. 

First of all, I had the opportunity to see how a scaling company like ORFIUM operates. I learned about the different hierarchy of the teams and departments, their roles and responsibilities, and the processes and workflows. I experienced hands-on how a project is planned, how it is split into simple tasks and how different teams collaborate to accomplish these tasks.

Also, I had the opportunity to talk and collaborate with teams internal and external from ORFIUM. I saw, first-hand, professional experts and the way they work. We established international communications in order to handle specific matters. 

Industry-wise, I was gently introduced to the basic concepts of the music industry. I had the chance to see the variety of the different challenges that it faces. To be honest, it was even more rich and interesting than what I had imagined.

Working on my project, I learned how to start and plan a research project and how to organize my work so that I do things faster and better as a professional. I practiced more on things that I was already familiar with, and I learned a lot by asking questions about everything I thought I knew. Wrapping up my project, I learned how to produce something well documented and reusable and how to present my work to my teammates in a structured way in order to achieve company-level awareness and leave my mark.

What was the best part of the internship?

If I had to choose something I liked the most out of my experience being an intern at ORFIUM, that would be the human-centric culture they bring to the table. The easy-going but focused way of working really allows people to feel free and do the best of their efforts to contribute.

Overall, I felt like I was accepted and trusted. The project was “tailored” to my interests, and my supervisors were there to help me with whatever I needed. The constant team urge to do activities in and out of the working environment made the whole experience feel like it was not just an internship.

What are the next steps, post-internship?

My thoughts now that I’m almost done with my internship are only positive. I am going to accumulate all the experiences, the lessons learned, the people that I met and collaborated with and all the good memories in order to finish whatever is pending at my university and be able to contribute even more in the future. Hopefully, after that, I can come back to help ORFIUM scale even more along my personal growth. 🙂

Giannis Prokopiou

Data scientist intern – Research Team @ ORFIUM

]]>
User Story Mapping https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/user-story-mapping/ Wed, 15 Dec 2021 12:03:34 +0000 https://orfium.com/?p=4709

Are you User Story Mapping yet?

It’s no secret that communication is one of the most important functions that teams need to do well in order to function. But how can teams with different areas of expertise be sure that they are talking about the same problem, the same solution and the same method to get there? Enter User Story Mapping.

It is a process that helps create shared communication among team members in order to “talk about the user’s journey through your product by building a simple model that tells your user’s story as you do” (Jeff Patton).

Shared Understanding

The primary goal of a user story mapping workshop is to create a shared understanding with your team. After a successful session you all will know what you need to build, to solve what problem for users, and be sure that you are talking about the same thing.

Shared understanding doesn’t come from writing perfect documents. A product document, even the most carefully written one, will help your team visualise problems, the users that have them and, in the best case, the solutions your product will introduce to these problems. But in order to make sure that all the team members understand the same problem, users and solution, you need to bring these people into a room (yes, virtual rooms count) and discuss.

Product/User Flows

After running a good user story mapping workshop, your output will be Product/User Flows. Having explored the business problems behind the current user flows, you are more ready to build new ones in order to solve your user’s problems.

You will not be able to explore every detail of your user flows, and that’s ok. You need to tackle the problems by priority, value, and impact.

How to run a User Story Mapping

Now that you know what User Story Mapping is and why you need to have it for your team. But how do you run a workshop effectively? The main artifact of your User Story Mapping workshop is the board where all the information should be depicted.

Step 1 of 6 – Preparation

As a PM, you need to prepare your User Story Mapping workshop. You need to build a lot of information from the teams and properly gather all the requirements ahead of time. Things you should consider gathering or writing down:

  • Business Context
  • Personas
  • Jobs to be done (for your personas)
  • Current user problems

Pro tip: You can find a lot of templates online to help you structure the above information!

Step 2 of 6 – Backbone

In the second step, you need to add all the main activities (backbone) on top of the map, in the order that users should perform them while using the product. The backbone should consist of main user activities that are clearly separated and are not part of the same solid user flow.

Each main activity should have its own internal story (as we will see in the next step) with an intro, a set of actions, and a specific result.

This can include activities like “Organize email”, “Manage email”, etc.

Story mapping flow might not be the same as the final user’s journey in the app. It can also include forks and loops.

Step 3 of 6 – User Steps

After we have established the main activities of our user, it’s time we added the smaller activities, or steps, that take place in each of the main activities. These steps should describe the user’s intro to the activity, the different screens they have to go through.  Even though having screens this early is not optimal, it will not harm you to have a visual guide. But do not design wireframes here, it’s too early and not necessary.

This can include steps like “Compose email”, “Read email”, “Delete email”, etc.

Step 4 of 6 – Activities (aka, Options)

Once you have added the steps, you can now start adding all the details that each activity contains, in regards to functionality. This is the step where you need to add all the user actions and interactions with the smaller activities/steps within each “screen”.

Each step must include a functionality of the product or a user action and not something that the development team should do, like manual work or research.

This can include actions like “Create and send basic email”, “Send RTF email”, etc.

Pro tip: Avoid the How

User Story Mapping is about mapping your user’s journey. Shocking, I know :slight_smile:. You are working on a high level now, in order to help you drill down later on and, of course, create a shared understanding of what your product is trying to achieve.

You should avoid diving into the technical implementation just yet. It’s out of the scope and it will only add obstacles in the process. You can solve a problem in many different ways, but this isn’t the time to think about the technical & design solutions, it’s all about figuring out and agreeing upon the problems themselves.

Step 5 of 6 – Annotate

Annotation is perhaps one of the most important steps in the entire process. Once you have mapped all the particular activities (or options) on the map, it’s vital to annotate on the post-its extra information like concerns or unknowns that will help you focus later on and prioritize. Annotations can include:

  • Hard to develop solutions
  • Uncertainties on the problem
  • UX research needed
  • Business obscurity

Pro Tip: You can define the possible annotations from the preparation step.

Step 6 of 6 – Prioritization

The final step of the process is to prioritize the activities and decide what we are going to do first. Prioritization should be based on the value and effort for each option.

Remember the MVP process from Henrik Kniberg? He suggests that we should work towards delivering value to the user using small iterations and re-designing the product instead of trying to deliver the end result all at once. This is where agile takes place.

So, a gentle reminder that priority goes to prioritizing iterations that make sense in both a technical and a business perspective, but most of all from a user’s perspective.

When to run a User Story Mapping

This workshop can be adjusted to any product at any stage of their life:

  • MVP: You can run a user story mapping workshop to identify what you need to build first to maximize the value you are delivering to users
  • Live product: You should run a User Story Mapping workshop to a) create the map that you hadn’t created in the beginning and b) to see where your new feature will fit in the existing user journey.

Who should participate

User Story Mapping is an extension of the 3 Amigos workshop that each development team can run prior to building each feature. The important improvement is that it adds the business perspective into the mix.

What you should bring to the table

PM: Backbone and main activity builder. The PM is the decision-maker of the workshop

Business/Sales: Brings the business knowledge and the customer experience/feedback/opinion

Engineering: Makes sure that everything discussed is doable, at least with the existing knowledge (but remember to avoid the nitty-gritty of the how)

UX/Design: Becomes the glue between Business and Product, makes sure the steps and user activities are in the right place and make sense

QA: Puts the final touches on the entire process, will make sure that user flows are structured, circles are formed (where necessary)

How we run a User Story Mapping workshop at ORFIUM

For us, the User Story Mapping workshop takes 2 sessions. This gives us more time to identify the user flows for a single product where the users are hard to get and the logic is complex. This is how we can best make sure that the entire team has a common understanding of the problem we are trying to solve and the way we are going to do that.

Step 1 – Find a room (5 minutes)

The first step is to find a room with a whiteboard (or a clean wall would do), isolated from the rest of the rooms. This room will be dedicated to the team but will also allow breaks.

Step 2 – Order post-its (5 minutes)

Post-its are the main material for running the workshop as our main goal is to create the board we’ve been discussing.

Make sure you have enough post-its for everyone to write and add to the board, even if you might end up throwing away some of them.

We use 3 sizes of post-its:

  • Large for the main activities
  • Medium square for the smaller activities and for the backbone, in a different color
  • Small for the annotations

Step 3 – Add Additional Material (Recommended for a physical workshop – 1 day)

We add additional material around the room to guide the participants through the entire workshop but also keep all the references we need.

To make it easier to link the slides on the walls with existing information on an online document, we added QR codes on each slide with a link to the original page with the full description.

The additional material included:

  • Jobs to be done
  • Glossary / reference / terminology
  • Collaboration pattern

Anyone could also annotate or add post-its to the slides and ask questions, and the Business or the Product was there to answer them.

Step 4 – Prepare the Board (30 minutes)

Since it’s the Product Manager’s duty to set up this meeting, she should have at least a draft version of how the board would look like, at least in regards to the backbone. So, we add the initial post-its for the backbone, and perhaps for the main activities, to give the team some structure.

The board could look like the one on the left, by the end of the workshop, where all the main activities have steps and options, along with all the annotations.

Step 5 – Fishbowl Collaboration Pattern

To allow the team to collaborate better and avoid overcrowded boards and too many voices, we introduced the fishbowl collaboration style.

This collaboration style allowed us to have a focus area where every member of the workshop that was in that area was allowed to talk about the board and the post-its. Everyone outside the area was free to explore the room, be on their phone or even read more online.

This way we managed to have our focus on the board and allow each person to clearly write the post-its and explain what they were writing.

Step 6 – Workshop Time

Time: It depends on the size of the product, the preparation, and the team’s experience.

Once we explained all the “rules”, the additional material, and the collaboration style, the team was ready to jump in and start adding post-its.

Clearly, for the first 20-30 minutes we added no post-its. We were asking a huge amount of questions about business specifics and backbone items, but this was absolutely necessary for the entire team to start building a shared understanding.

Prioritization & Release Versions

Keep in mind to have some time in the end to prioritize your post-its in release versions. It’s very important for the team, once they understand the big picture to put the blocks in prioritized order and know what’s next.

References

Ioannis Papikas

Senior Product Manager @ ORFIUM

https://www.linkedin.com/in/ioannis-papikas/

]]>
⏱️ Speeding up your Python & Django test suite https://www.orfium.com/ja/%e3%82%ab%e3%83%86%e3%82%b4%e3%83%aa%e3%83%bc%e3%81%aa%e3%81%97/%e2%8f%b1%ef%b8%8f-speeding-up-your-python-django-test-suite/ Thu, 04 Nov 2021 11:51:59 +0000 https://orfium.com/?p=4713
Artwork by @vitzi_art

Less time waiting, more time hacking!

Yes yes, we all know. Writing tests and thoroughly running them on our code is important. None of us enjoy doing it but we almost all see the benefits of this process. But what isn’t as great about testing is the waiting, the context shifting and the loss of focus. At least for me, this distraction is a real drag, especially when I have to run a full test suite.

This is why I find it crucial to have a fine-tuned test suite that runs as fast as possible, and why I always put some effort into speeding up my test runs, both locally and in the CI. While working on different  Python / Django projects I’ve discovered some tips & tricks that can make your life easier. Plenty of them are included in various documentations, like the almighty Django docs, but I think there’s some value in collecting them all in a single place.

As a bonus, I’ll be sharing some examples / tips for enhancing your test runs when using Github Actions, as well as a case study to showcase the benefit of all these suggestions.

The quick wins

Running a part of the test suite

This first one is kind of obvious, but you don’t have to run the whole test suite every single time. You can run tests in a single package, module, class, or even function by using the path on the test command.

> python manage.py test package.module.class.function
System check identified no issues (0 silenced).
..
----------------------------------------------------------------------
Ran 2 tests in 6.570s

OK

Keeping the database between test runs

By default, Django creates a test database for each test run, which is destroyed at the end. This is a rather slow process, especially if you want to run just a few tests! The --keepdb option will not destroy and recreate the database locally on every run. This gives a huge speedup when running tests locally and is a pretty safe option to use in general.

> python manage.py test <path or nothing> --keepdb
Using existing test database for alias 'default'...          <--- Reused!
System check identified no issues (0 silenced).
..
----------------------------------------------------------------------
Ran 2 tests in 6.570s

OK
Preserving test database for alias 'default'...              <--- Not destroyed!

This is not as error-prone as it sounds, since every test usually takes care of restoring the state of the database, either by rolling back transactions or truncating tables. We’ll talk more about this later on.

If you see errors that may be related with the database not being recreated at the start of the test (like IntegrityError, etc), you can always remove the flag on the next run. This will destroy the database and recreate it.

Running tests in parallel

By default, Django runs tests sequentially. However, whether you’re running tests locally or in your CI (Github Actions, Jenkins CI, etc) more often than not you’ll have multiple cores. To leverage them, you can use the --parallel flag. Django will create additional processes to run your tests and additional databases to run them against.

You will see something like this:

> python3 manage.py test --parallel --keepdb
Using existing test database for alias 'default'...
Using existing clone for alias 'default'...        --
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |    => 12 processes!
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...         |
Using existing clone for alias 'default'...        -- 

< running tests > 

Preserving test database for alias 'default'...
... x10 ...
Preserving test database for alias 'default'...

In Github runners, it usually spawns 2-3 processes.

When running with --parallel, Django will try to pickle tracebacks from errors to display them in the end. You’ll have to add tblib as a dependency to make this work.

Caching your Python environment (CI)

Usually when running tests in CI/CD environments a step of the process is building the Python environment (creating a virtualenv, installing dependencies etc). A common practice to speed things up here is to cache this environment and keep it between builds since it doesn’t change often. Keep in mind, you’ll need to invalidate this whenever your requirements change.

An example for Github Actions could be adding something like this to your workflow’s yaml file:

- name: Cache pip
  uses: actions/cache@v2
  with:
    # This path is specific to Ubuntu
    path: ${{ env.pythonLocation }}
    # Look to see if there is a cache hit for the corresponding requirements file
    key: ${{ env.pythonLocation }}-${{ hashFiles('requirements.txt','test_requirements.txt') }}

Make sure to include all your requirements files in hashFiles.

If you search online you’ll find various guides, including the official Github guide, advising to cache just the retrieved packages (the pip cache). I prefer the above method which caches the built packages since I saw no speedup with the suggestions there.

The slow but powerful

Prefer TestCase instead of TransactionTestCase

Django offers 2 different base classes for test cases: TestCase and TransactionTestCase. Actually it offers more, but for this case we only care about those two.

But what’s the difference? Quoting the docs:

Django’s TestCase class is a more commonly used subclass of TransactionTestCase that makes use of database transaction facilities to speed up the process of resetting the database to a known state at the beginning of each test. A consequence of this, however, is that some database behaviors cannot be tested within a Django TestCase class. For instance, you cannot test that a block of code is executing within a transaction, as is required when using select_for_update(). In those cases, you should use TransactionTestCase.

TransactionTestCase and TestCase are identical except for the manner in which the database is reset to a known state and the ability for test code to test the effects of commit and rollback:

  • A TransactionTestCase resets the database after the test runs by truncating all tables. A TransactionTestCase may call commit and rollback and observe the effects of these calls on the database.
  • A TestCase, on the other hand, does not truncate tables after a test. Instead, it encloses the test code in a database transaction that is rolled back at the end of the test. This guarantees that the rollback at the end of the test restores the database to its initial state.

In each project, there’s often this one test that breaks with TestCase but works with TransactionTestCase. When engineers see this, they consider it more reliable and switch their base test classes to TransactionTestCase, without considering the performance impact. We’ll see later on that this is not negligible at all.

TL;DR:

You probably only need TestCase for most of your tests. Use TransactionTestCase wisely!

Some additional indicators where TransactionTestCase might be useful are:
– emulating transaction errors
– using on_commit hooks
– firing async tasks (which will run outside the transaction by definition)

Try to use setUpTestData instead of setUp

Whenever you want to set up data for your tests, you usually override & use setUp. This runs before every test and creates the data you need.

However, if you don’t change the data in each test case, you can also use setupTestData (docs). This runs once for every function in the same class and creates the necessary data. This is definitely faster, but if your tests alter the test data you can end up with weird cases. Use with caution.

Finding slow tests with nose

Last but not least, remember that tests are code. So there’s always the chance that some tests are really slow because you didn’t develop them with performance in mind. If this is the case, the best thing you can do is rewrite them. But figuring out a single slow test is not that easy.

Luckily, you can use django-nose (docs) and nose-timer to find the slowest tests in your suite.

To do that:

  • add django-nose, nose-timer to your requirements
  • In your settings.py, change the test runner and use some nose-specific arguments

Example settings.py:

TEST_RUNNER = 'django_nose.NoseTestSuiteRunner'
NOSE_ARGS = [
    '--nocapture',
    '--verbosity=2',
    '--with-timer',
    '--timer-top-n=10',
    '--with-id'
]

The above arguments will make nose output:

  • The name of each test
  • The time each test takes
  • The top 10 slowest tests in the end

Example output:

....
#656 test_function_name (path.to.test.module.TestCase) ... ok (1.3138s)
#657 test_function_name (path.to.test.module.TestCase) ... ok (3.0827s)
#658 test_function_name (path.to.test.module.TestCase) ... ok (5.0743s)
#659 test_function_name (path.to.test.module.TestCase) ... ok (5.3729s)
....
#665 test_function_name (path.to.test.module.TestCase) ... ok (3.1782s)
#666 test_function_name (path.to.test.module.TestCase) ... ok (0.7577s)
#667 test_function_name (path.to.test.module.TestCase) ... ok (0.7488s)

[success] 6.67% path.to.slow.test.TestCase.function: 5.3729s    ----
[success] 6.30% path.to.slow.test.TestCase.function: 5.0743s       |
[success] 5.61% path.to.slow.test.TestCase.function: 4.5148s       |
[success] 5.50% path.to.slow.test.TestCase.function: 4.4254s       |
[success] 5.09% path.to.slow.test.TestCase.function: 4.0960s       | 10 slowest
[success] 4.32% path.to.slow.test.TestCase.function: 3.4779s       |    tests
[success] 3.95% path.to.slow.test.TestCase.function: 3.1782s       |
[success] 3.83% path.to.slow.test.TestCase.function: 3.0827s       |
[success] 3.47% path.to.slow.test.TestCase.function: 2.7970s       |
[success] 3.20% path.to.slow.test.TestCase.function: 2.5786s    ---- 
----------------------------------------------------------------------
Ran 72 tests in 80.877s

OK

Now it’s easier to find out slow tests and debug why they take so much time to run.

Case study

To showcase the value of each of these suggestions, we’ll be running a series of scenarios and measuring how much time we save with each improvement.

The scenarios

  • Locally
    • Run a single test with / without --keepdb, to measure the overhead of recreating the database
    • Run a whole test suite locally with / without --parallel, to see how much faster this is
  • On Github Actions
    • Run a whole test suite with no improvements
    • Add --parallel and re-run
    • Cache the python environment and re-run
    • Change the base test case to TestCase from TransactionTestCase

Locally

Performance of --keepdb

Let’s run a single test without --keepdb:

> time python3 manage.py test package.module.TestCase.test 
Creating test database for alias 'default'...
System check identified no issues (0 silenced).
.
----------------------------------------------------------------------
Ran 1 test in 4.297s

OK
Destroying test database for alias 'default'...

real    0m50.299s
user    1m0.945s
sys     0m1.922s

Now the same test with --keepdb:

> time python3 manage.py test package.module.TestCase.test --keepdb
Using existing test database for alias 'default'...
.
----------------------------------------------------------------------
Ran 1 test in 4.148s

OK
Preserving test database for alias 'default'...

real    0m6.899s
user    0m20.640s
sys     0m1.845s

Difference: 50 sec vs 7 sec or 7 times faster

Performance of –parallel

Without --parallel:

> python3 manage.py test --keepdb
...
----------------------------------------------------------------------
Ran 591 tests in 670.560s

With --parallel (concurrency: 6):

> python3 manage.py test --keepdb --parallel 6
...
----------------------------------------------------------------------
Ran 591 tests in 305.394s

Difference: 670 sec vs 305 sec or > 2x faster

On Github Actions

Without any improvements, the build took ~25 mins to run. The breakdown for each action can be seen below.

Note that running the test suite took 20 mins
Note that running the test suite took 20 mins

When running with --parallel, the whole build took ~17 mins to run (~30% less). Running the tests took 13 mins (vs 20 mins without --parallel, an improvement of ~40%).

By caching the python environment, we can see that the Install dependencies step takes a few seconds to run instead of ~4 mins, reducing the build time to 14 mins

Finally, by changing the base test case from TransactionTestCase to TestCase and fixing the 3 tests that required it, the time dropped again:

Neat, isn’t it?

Key takeaway: We managed to reduce the build time from ~25 mins to less than 10 mins, which is less than half of the original time.

Bonus: Using coverage.py in parallel mode

If you are using coverage.py setting the --parallel flag is not enough for your tests to run in parallel.

First, you will need to set parallel = True and concurrency = multiprocessing to your .coveragerc. For example:

# .coveragerc
[run]
branch = True
omit = */__init__*
       */test*.py
       */migrations/*
       */urls.py
       */admin.py
       */apps.p

# Required for parallel
parallel = true
# Required for parallel
concurrency = multiprocessing

[report]
precision = 1
show_missing = True
ignore_errors = True
exclude_lines =
    pragma: no cover
    raise NotImplementedError
    except ImportError
    def __repr__
    if self.logger.debug
    if __name__ == .__main__.:

Then, add a sitecustomize.py to your project’s root directory (where you’ll be running your tests from).

# sitecustomize.py
import coverage

coverage.process_startup()

Finally, you’ll need to do some extra steps to run with coverage and create a report.

# change the command to something like this
COVERAGE_PROCESS_START=./.coveragerc coverage run --parallel-mode --concurrency=multiprocessing --rcfile=./.coveragerc manage.py test --parallel
# combine individual coverage files 
coverage combine --rcfile=./.coveragerc
# and then create the coverage report
coverage report -m --rcfile=./.coveragerc

Enjoy your way faster test suite!

Sergios Aftsidis

Senior Backend Software Engineer @ ΟRFIUM

https://www.linkedin.com/in/saftsidis/

https://iamsafts.com/

https://github.com/safts

]]>