What is Redshift? Solutions for storing and processing big data

Tram Ho

Preface:

Along with the development of information and communication technology, the number of people using e-commerce sites and online services is increasing, leading to an increase in the amount of data stored. Therefore processing big data is essential. My company also encountered the same problem, and after research and analysis, the engineer team also made the decision to use an Amazon service, Amazon Redshift – a leading solution for storage. and big data processing. Let’s learn about it together.

image.png

What is Redshift?

  • Redshift is based on PostgreSQL, but not in OLTP (On-line transactional processing)
  • Redshift is OLAP – On-line analytical processing (for analytics and data warehousing)
  • Store data as columns (Column)
  • Support SQL interface screen to create query queries
  • Amazon Redshift is a fully cloud-hosted and managed data warehouse that can scale and accelerate insights with fast, easy, secure analytics at scale.
  • Amazon Redshift easily tracks and processes real-time with predictive analytics of all your data in databases and data warehouses.
  • According to AWS statistics, millions of customers are using Amazon Redshift to run complex analytical queries and store and process data from terabytes to petabytes.

Current status and needs of big data processing.

  • Currently, the amount of access and use of systems is increasing, leading to an increase in the amount of data of systems such as Google, Shopee, eBay or Facebook, …
  • Therefore, the actual need for big data processing is urgent for both large systems or small systems with large amounts of data. Amazon Redshift is one of the efficient solutions for processing and storing petabytes of data in the cloud.

Redshift architecture

image.png

  • Cluster: This is the core component of Redshift’s architecture. Each Cluster consists of one or more nodes that perform computation
  • In a Cluster there will be one or more Databases
  • The leader node handles communication with external layers, such as query execution, result aggregation
  • Compute node: execute queries, return results to leader node
  • Node slices: Each Compute Node is further subdivided into Node Slices. Each Node Slice will be equally divided CPU, Memory and Storage from that Compute Node.
  • Backup & Restore, Security VPC/IAM/KMS, Monitoring

Redshift – Snapshot & DR (Disaster recovery)

  • Redshift doesn’t have “Multi-AZ” mechanics
  • You need to use snapshot to backup cluster and store on S3
  • You can restore a snapshot to a new cluster
  • Snapshot creation can be automatic or manual
    • Automated: you can create a schedule to create every 8h/every 5GB…
    • Manual:
  • You can configure Amazon Redshift to automatically copy snapshots of a Cluster to a new Region image.png

What is Redshift Spectrum?

Redshift Spectrum: query data directly to Amazon S3 without loading data into Redshift tables. image.png

How does Redshift process and store data?

  • Amazon Redshift performs data delivery and storage on small nodes (nodes). Redshift uses a self-defined distribution key for distribution and storage to nodes.
  • When performing a retrieval, the Redshift search will rely on distributed keys and nodes. The more nodes, the faster the access speed, because the data intervals are divided. It’s easier to find in small spaces.
  • In addition, the tables will be sorted by self-defined sort key for best search and query optimization.

Benefits of using Amazon Redshift.

  • Amazon Redshift is 3x cheaper than other cloud data warehouses and 10x cheaper than traditional databases (under $1,000 per terabyte/year). With fast and easy processing speed to maximize user and customer experience.
  • Store completely in the cloud for the highest security and safety.
  • Redshift provides petabytes of large data storage, no longer having the problem of lack of storage space.
Share the news now

Source : Viblo