In this blog, my data stream is being written to Dynamo DB.
The latest data available in streams through the producer will trigger events to the Lambda function. The function will write data to Dynamo DB.
How the data is produced? Kindly refer to my earlier blogpost
NoSQL vs Relational
NoSQL database organizes data so that its shape in the database corresponds with what will be queried.
RDBMS system does reshaping of data when a query is processed. For example, SQL joins uses more computation power to denormalize the data and produce end results.
In Relation model( look at the above simple 1 to N relationship), entities are normalized and used complex queries to get results
1 to N relationship can be achieved in a single table to address the above query,
username (partition key) and order_id(sort Key)
Partition Key(PK): Used as an input to the internal hash function and the output is a partition in which the item will be stored.
example, personid is unique for each item
Sort Key(SK): Items with the same partition key are stored in sort order by the sort key.
example, Alex(PK) has two different orders us123,us124(SK)
NoSQL DB's
No SQL datastores are a key-value, document, Graph or column-based type data stores. I would like to give a quick snapshot of various NoSQL stores
Why NoSQL?
My requirement is to store data as it comes from producer stream by eliminating the traditional relational design.
I have collected access patterns of my end application, which are simple key lookup and sort patterns.
In Future need a scalable data source.
Modeling my CSV dataset
I am using credit card complaints data set ( https://data.world/dataquest/bank-and-credit-card-complaints )
my access patterns are,
get complaint by id
get complaint id for issue billing
get complaint by state
get complaint by state submitted via web, post mail
I decided to have a Complaint ID as a partition key that helps in equal distribution of data at scale. Sort key helps to sort issues for each complaint.
Use the Dynamo DB create table GUI from the management console and provide
to accommodate all access patterns I need to create a global secondary index as below,
Once the table is created, browse that table to get the option to create the index,
NOTE: make sure to adjust the Read/Write capacity units, which determine the cost. For my project, I just need each one of them.
Lambda Function
Link to the python code, which does processing the stream data and put items to DynamoDB
Query
Make a connection to Dynamo DB
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource( 'dynamodb',
region_name=,
aws_access_key_id=,
aws_secret_access_key=)
tablename = "<table_name>"
table = dynamodb.Table(tablename)
The query results satisfying the access patterns,
Study
if you are interested in Dynamodb data modeling, links below are good starting points
Data modeling with Amazon DynamoDB ( AWS Invent) https://www.youtube.com/watch?v=DIQVJqiSUkE
Conclusion
I collaborated on a series of activities that I followed to write my CSV dataset to dynamo DB. Most of my time was spent on data conversion.
Understanding the concepts of byte, bytablearray, JSON and conversion from one form to other is essential, In addition to that Lamda, test function will come handy to simulate the functionality.
I recommend not to go too far into data modeling, rather achieve your problem statement and then you can explore other dimensions.
I would love to hear some feedback in the comment section below.
Follow me @ linkedin.com/in/kvbr
Kommentarer