Types of Data (according to structure)
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284653
data:image/s3,"s3://crabby-images/7290f/7290f6c15f413889dcdb5981a72f3fc38daaa92b" alt="image.png"
data:image/s3,"s3://crabby-images/297ee/297eef656cabf6f7ef53e026de9a5efd4bcbdbdd" alt="image.png"
data:image/s3,"s3://crabby-images/6cdcd/6cdcd68137b42ac42218481c40cbf961d8a3791d" alt="image.png"
Type of data |
Definition |
Characteristics |
Examples |
Structured |
Defined schema |
- Easily queryable |
|
- Rows and columns
- Consistent structure | - Relational DBs
- Spreadsheets or CSV files with consistent columns |
| Unstructured | No schema nor predefined structure | - Requires heavy preprocessing to query
- May come in various formats | - Raw text (e.g. emails, word processing docs…)
- Raw media (videos, audio, images…) |
| Semi-structured | Some level of structure (tags, hierarchies…) | - Elements might be tagged or categorized
- More flexible than structured
- Not as chaotic as unstructured | - XML & JSON files
- Email headers (structured: date, subject… + unstructured: body…)
- Log files with varied formats |
Properties of Data (3 Vs)
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284671
- Volume
- Velocity
- High velocity → Real-Time or near-RT processing
- Variety
- Structured? Mixed? Multiple sources? Multiple formats?
- 💡 …Veracity?
- Some sources say there are 4 Vs ;)
Data Warehouses, Data Lakes & Data Lakehouses
Ref: https://www.udemy.com/course/aws-certified-machine-learning-engineer-associate-mla-c01/learn/lecture/45284683
- Data Warehouse (DWH) (e.g. Amazon Redshift)
- Centralized repository optimized for analysis (read-heavy operations) where data from different sources is stored in a structured format
- Data Lake (e.g. Amazon S3 can be used as data lake)
- Storage repository that holds vast amounts of raw data in its native format (predefined structure is not necessary)
- Structured, semi-structured, & unstructured data
Property |
Data Warehouse |
Data Lake |
Schema |
Schema-on-write |
Schema-on-read |
Data Types |
Structured |
Structured, Unstructured & Semi-structured |
Agility |
Less agile/flexible |
More agile |
Processing Pipelines |
ETL |
ELT |
Cost |
Optimizations for complex queries are expensive |
Cost-effective storage, costs rise when processing lots of data |
Use when… |
- Structured data sources |
|
- Fast & complex queries requirements
- BI & analytics | - Big variety of data sources with different structure
- Scalable, flexible & cost-effective storage solution
- Advanced analytics, ML & data discovery |