ITBusiness.ca

Tips for the painless migration of your data warehouse to the cloud

From humble beginnings only a few years ago, the cloud has arrived. Now widely embraced by the IT industry, it’s the strategy of choice for organizations pursuing competitive advantage through agility, cost efficiencies and ease of collaboration.

“The cloud is definitely a good place to start for any strategy that’s focused on digital transformation initiatives,” said Gary Davenport, independent management consultant, former CIO, and past president of the CIO Association of Canada. “Even the federal government has adopted a cloud first strategy, a massive shift from its former position.”

Speaking at a networking event presented by the Sourced Group and Google, in association with IT World Canada, Davenport noted the complexity of the cloud landscape. “More than 80% of companies are using at least one kind of cloud, and the majority have multiple cloud environments,” he said. “The question is how to get the most out of these environments and pick the best service providers, applications and tools to make it all work seamlessly for your organization.”

Davenport highlighted some common themes in cloud migration, including the value of lift and shift security and turning data into actionable intelligence. His primary focus, however, was how best to migrate from a legacy service to cloud without disrupting the entire enterprise in the process.

Danil Zburivsky, senior consultant, Sourced Group, extended this focus with an example specific to the Google Cloud Platform (GCP), splitting his presentation into three sections: what drives the need for migration, the steps in achieving that migration, and implications for security and privacy.

“Business has changed and business needs are increasing when it comes to data,” said Zburivsky. “Where before, data warehouses were accessed by a few analysts, and perhaps some executives, today an increasingly large community of users needs data and analytics to do their jobs and they don’t want to wait for it. And although data volumes are growing, some systems don’t scale economically – or even at all.”

With these constraints, he said, people start building solutions on the side to meet their business needs, creating a shadow IT problem. Other issues arise when coping with new data types. Warehouses designed and built to manage relational data are suddenly faced with JSON data with flexible schemas, binary data, video, and other data types. “So you need a solution to analyze a huge variety of data,” Zburivsky said. “And you can’t really build a special solution for each of those because tomorrow we’ll get something new.”

Faced with these challenges, businesses consider moving their data warehouse to the cloud. But what’s the approach? Can you just copy an existing warehouse to the cloud and be done with it?

“There are great cloud warehouses,” Zburivsky noted. “We’ll talk about Google Cloud Platform, and Big Query, their fully managed offering for data warehousing. But my experience is, if you just take the data and pop it into a cloud warehouse, it doesn’t solve all your problems.” There are still issues with different types of data, and with the fact that more and different people want to consume data differently.

Warehouses are monolithic, he explained, and don’t provide the necessary elasticity. Many only understand SQL. What we want is a modular design with loosely couple modules; for real-time, there can be a separate layer with a message bus and other real-time capabilities.

Typical data warehouse tasks include ingesting data, ETL, and consuming data. Each task needs to be broken into modules in a cloud warehouse, each of which does one thing well. The base data warehouse itself doesn’t go anywhere; it’s an essential part of the design, an access layer that’s great for the SQL-based access demanded by tools such as MicroStrategy.

The migration journey is an iterative process consisting of three broad steps: Discover, Plan, and Execute, Zburivsky said. Discover is the process for finding what data warehouses there are, the use cases involved, and what ETLs are in use. From there, Plan the migration. Rather than trying to do a “big bang” conversion, he recommends migrating one use case at a time, picking business critical functions that are currently operational pain points, and validating each before proceeding. The Execute step comprises the actual migrations: offload the data (it flows from the existing data warehouse, since other unmigrated use cases still need it there), validate it, run ETL, convert a use case, validate its results, rinse and repeat for the next use case. And, he cautioned, don’t introduce too many changes at this stage.

Because it’s virtually impossible to do line by line validation, Zburivsky suggested using statistical methods to validate the migrated use case. “This will be really important to prove to your stakeholders that the migration is actually working,” he said. “Because a big concern that we hear is ‘we migrated the data to Big Query – is it accurate?’ The reason people ask this question is because they’ve been burned many times before with their existing warehouses where they look at the reports and catalogs, okay, but then they discover the somebody changed ETL yesterday and now everything’s different. So you need to build trust with your users, and the data validation step is extremely important for that. Automation is critical here.”

Zburivsky suggested rewriting business logic in a general-purpose language such as Java or Python during migration. “The bulk of the time (in a migration) actually goes into the business logic conversion,” he noted. “So that’s where you should expect to spend most of the time. And that’s all manual labour.”

He also touched on data governance and data privacy and security, emphasizing the importance of proper access control, and of proper foundations such as resource hierarchies that will map properly to the organization and work at a global scale.

“When you’re building your project in the cloud, you don’t have one single system that does all the processing, “ he said. “You can actually split the data for the cloud resource hierarchy perspective into different zones and build your controls and access management on different zones separately. Those zones can be controlled differently because you have different group of people and users that need access to each of them.”

Zburivsky concluded by noting that in his design it’s possible to have multiple instances of Big Query for different teams. Since it’s fully managed and billed by consumption, it’s economical enough to do so, although it would have been prohibitively expensive with a traditional data warehouse.

Exit mobile version