Data lakes are often built on top of object storage, such as AWS S3.
With object storage, data is stored and managed as objects, which consists of the data itself, any relevant metadata, such as when the object was last modified, and a unique identifier.
Object storage is particularly helpful for storing and retrieving growing amounts of data of any type, hence it’s the perfect foundation for data lakes.
AWS Data Wrangler
AWS Glue
AWS Athena
Athena is serverless
Athena is based on Presto, an open source distributed SQL engine, developed for this exact use case, running interactive queries against data sources of all sizes.
2.2 Data visualization
Week 2
1 Statistical bias
1.1 Statistical bias
1.2 Statistical bias causes
Societal: This is societal bias. These biases could be introduced because of preconceived notions that exist in society. Data generated by humans can be biased because all of us have unconscious bias.
Data drift (data shift): Data drift happens, especially when the data distribution significantly varies from the distribution of the training data that was used to initially train the model.
Covariant drift: The distribution of the independent variables or features that make up your dataset can change.
Prior probability drift: The distribution of your labels or the targeted variables might change.
Concept drift: The relationship between the features and the labels can change.
1.3 Measuring statistical bias
1.4 Detecting statistical bias
1.5 Detect statistical bias with Amazon SageMaker Clarify
1.6 Approaches to statistical bias detection
Sagemaker Data Wrangle:
Connect to multiple data scources abd explore data in more visual format.
Only use a subset of your data to detect bias
Sagemaker Clarify:
Large volumes of data
1.7 Feature importance: SHAP
Week 3
1 Automated Machine Learning
1.1 Automated Machine Learning (AutoML)
1.2 AutoML Workflow
Ingest & Analyze
1.3 Amazon SageMaker Autopilot
1.4 Running experiments with Amazon SageMaker Autopilot
1.5 Amazon SageMaker Autopilot: evaluating output
1.6 Model Hosting
Week 4
1 Build in algorithm
1.1 Build in algorithm
1.2 Use cases and algorithms
1.3 Text analysis
One challenge with Word2Vec: Out of vocabulary issues
Its vocabulary only contains three million words. The vocabulary is a set if words that the model learned in the training phase. Out of vocabulary words are words that were not present in the text data set the model was initially trained on. If the word is not found in its vocabulary, the model architecture assigns a zero to that words which is basically discarding the word.