Join us for an exciting workshop on working with big data. This workshop will be led by GDG member Yaakov Bressler.
Bring your laptop!
Why is this course important?
- If data is too big, don't throw in the towel, you can process it using these ways
- If you already process big data, maybe there are more efficient or cost effective ways to do it
- Build the right solution for the right problem.
If you take this course:
You will know how to process big data in multiple ways and which is the best choice for you.
Themes:
- Multiple ways to process a file
- in memory
- in chunks
- streaming (sometimes the same as chunking, sometimes not)
- map reduce
- massively parallel processing (MPP) [out of scope]
- Big data is IO bound (when downloading/uploading big files)
- Compress when possible
- Move compute closer to the data (private network / VPC / access point / or, in the actual data center)
- Don't do things twice
- Caching (via disk) - don't download a file twice
- Incrementalism: use your data to determine offsets - don't process data twice
- Orchestrate pipelines instead of executing straight code
- Simplifies complex systems
- Allows delegation to other machines
- Big powerful tools can be expensive - but sometimes they are worth it
- Perhaps demonstrate how to process this all in Snowflake or BigQuery
Prerequisites: (Complete at least 1 day in advance)
- Familiarity with python programming language
- Familiarity with SQL
- Complete the installation of necessary softwares (following this guide)
- Python installed on your machine
- Install poetry (dependency management)
- Install pyenv (python version manager)
- Install duckdb
Resources:
NOTE:
Due to limited space, we have very few spots available for this workshop. (Priority will be given to PSU students or alumni.) Feel free to join the waitlist and we'll let you know when space opens up.
👉 Want more? See all upcoming events: gdg-portland.dev