How to process Data from different sources using AWS Tools?

Data processing in AWS

Data can come from various sources to the AWS environment that will be crucial for businesses. Some data can be processed in batches whereas some data has to be managed in real time. This posts explains in simplied form on how to process Data from different sources using AWS Tools.

How to process Batch or Queued data in AWS ?

S3 Bucket:

Data source deposits data in S3 bucket for processing. Data will be picked up by either AWS Glue Crawler or AWS Glue ETL.

What is AWS Glue Crawler?

AWS Glue Crawler is kind of Job defined in Amazon Glue whose job is to crawl the Databse or S3 Bucket to find raw data meeting certain criteria to populate the AWS Glue Data Catalog with tables. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. More information about the crawlers can be found in AWS documentation called “Defining Crawlers”

What is AWS Glue ETL?

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.

ETL actually means (extract, transform and load) jobs which is used in conjuction with AWS Glue to build data warehouses and data lakes and generate output streams.

For more information about Amazon Glue we can refer to AWS documentation which is refered to as “AWS Glue: How it works”

2. Prepare AWS Glue Data Catalogue

Next step after the data has been sacnned from the S3 bucket is to prepare AWS Glue Data Catalog.

What is AWS Glue Data Catalog?

The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location, schema, and runtime metrics of your data. You use the information in the Data Catalog to create and monitor your ETL jobs. Information in the Data Catalog is stored as metadata tables, where each table specifies a single data store. Typically, you run a crawler to take inventory of the data in your data stores, but there are other ways to add metadata tables into your Data Catalog.

More information about AWS Glue Data Catalog can be found in AWS documentation.

3. Amazon Glue Catalog can be referenced by either Amazon EMR or Amazon Athena for further processing and ultimately data is processed in the human readable format.

What is Amazon EMR?

Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto. Amazon EMR makes it easy to set up, operate, and scale your big data environments by automating time-consuming tasks like provisioning capacity and tuning clusters. With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark.

More information can be found in AWS documentation about Amazon EMR

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning.

More deail information about Amazon Athena can be found in AWS documentation.

How to Process Realtime data in AWS ?

Kinesis Data Stream: Data source can feed data in Kinesis data stream and then the data is processed using AWS Lambda.

What is Kinesis Data Stream?

Amazon Kinesis Data Streams (KDS) is a massively scalable and durable real-time data streaming service. KDS can continuously capture gigabytes of data per second from hundreds of thousands of sources such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more.

More information and benefits about Amazon Kinesis Data Streams (KDS) can be found in AWS documentation.

What is AWS Lambda ?

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. With Lambda, you can run code for virtually any type of application or backend service – all with zero administration. Just upload your code as a ZIP file or container image, and Lambda automatically and precisely allocates compute execution power and runs your code based on the incoming request or event, for any scale of traffic. You can set up your code to automatically trigger from over 200 AWS services and SaaS applications or call it directly from any web or mobile app. You can write Lambda functions in your favorite language (Node.js, Python, Go, Java, and more) and use both serverless and container tools, such as AWS SAM or Docker CLI, to build, test, and deploy your functions.

More about AWS lambda can be found in AWS documentation.

2. Then the data is streamed into Kinesis Data Firehose and which will transform the data stream to Kinesis Data Analytics service which will ultimately transform into human readable form that is displayed to the user.

What is Kinesis Data Firehose?

Amazon Kinesis Data Firehose is the easiest way to reliably load streaming data into data lakes, data stores, and analytics services. It can capture, transform, and deliver streaming data to Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP endpoints, and service providers like Datadog, New Relic, MongoDB, and Splunk. It is a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. It can also batch, compress, transform, and encrypt your data streams before loading, minimizing the amount of storage used and increasing security.

More information about Amazon Kinesis Data Firehose can be found in AWS documentation.

What is Kinesis Data Analytics?

Amazon Kinesis Data Analytics takes care of everything required to run streaming applications continuously, and scales automatically to match the volume and throughput of your incoming data. With Amazon Kinesis Data Analytics, there are no servers to manage, no minimum fee or setup cost, and you only pay for the resources your streaming applications consume.

More about Amazon Kinesis Data Analytics can be found in AWS Documentation.

That’s all folks … hope it helps. Watch out for other post about AWS..