Google Cloud Platform: Reference architecture for Data Warehouse
It had been a great journey to learn and understand Google Cloud Platform also called as GCP. Among the top cloud providers, Google seemed to have nailed the cloud technology very well. As I was exploring the services, I got some Proof of concepts working and some reference architecture defined. One of the reference architecture defined is for the Data Warehouse. Our use case is creating an analytics dashboard and reporting platform for the internal and external users. The solution requires three standard serverless services from Google Cloud Platform:
- Cloud Dataflow – fully managed service from Google for streaming, batch-processing and enriching the data ingested into various storage options in Google
- BigQuery – serverless, highly-scalable, cloud Data Warehouse with a built-in In-memory BI Engine and Machine learning capabilities
- DataStudio – serverless BI engine, highly-scalable with flexible suite of data analytics tool
In this case, let us assume there are four sources Source 1, 2 and 3 residing within US and Source 4 residing outside of US requiring some data separation. Cloud Dataflow powered by Apache Beam can be utilized to stream or batch ingest the data from the sources. We can develop a pipeline for one source and leverage it for other sources. Dataflow can be created with Java or Python. Once we have the data, the industry practice is to have a Data Lake in BigQuery to store the raw data for in-depth analytics or run machine learning algorithms. From there dimensional modeling may or may not be required depending on the nature of the end output. For the benefit of clarity BigQuery is shown in both the Data Lake and Data Warehouse but the structures may reside as one. Both Google DataStudio which is now called Google Cloud BI Solution and Tableau are visualization solutions. Either of these could be extended to support the goal of the organization. This provides a high level overview for a Data Warehousing reference architecture in the Google Cloud Platform.