Embark on a journey through the core principles of data engineering with our tech talk. In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process.
16 RSVP'd
Overview:
In this session, we will delve into the essential building blocks of data engineering, placing a spotlight on the discovery process. From framing the problem statement to navigating the intricacies of exploratory data analysis (EDA), data modeling using Python, VSCode, Jupyter Notebooks, and GitHub, you'll gain a solid understanding of the fundamental aspects that drive effective data engineering projects.
Agenda:
1. Introduction:
- Unveiling the importance of the discovery process in data engineering.
- Setting the stage with a real-world problem statement that will guide our exploration.
2. Data Loading and Preparation:
- Downloading a CSV file from a URL directly into memory using the `requests` library.
- Creating a pandas DataFrame from the downloaded content.
3. Exploratory Data Analysis (EDA):
- Using `pandas.describe()` to generate summary statistics for numerical columns.
- Interpreting the summary statistics to understand the distribution and spread of the data.
4. Data Cleaning and Transformation:
- Identifying and separating categorical and numerical columns.
- Renaming columns to follow a consistent naming convention (e.g., converting to lowercase, renaming specific columns).
5. Data Modeling:
- Creating dimension and fact tables to organize the data for efficient querying and reporting.
- Using pandas to create `dim_station` and `dim_booth` tables and a `fact_turnstile` table.
- Demonstrating how to join these tables to display relevant information using SQL-like syntax with the `pandasql` library.
6. Visualization:
- Converting line plots to bar charts using `matplotlib` to visualize the data and gather requirements.
7. Real-World Application:
- Applying insights gained from EDA to address the initial problem statement.
- Discussing practical solutions and strategies derived from the discovery process.
Key Features:
- In-Memory Data Handling: Efficiently downloading and processing data without writing to disk.
- Comprehensive EDA: Detailed analysis of data distributions and summary statistics.
- Data Modeling: Clear separation of descriptive attributes and measurable facts using dimension and fact tables.
- SQL Integration: Leveraging SQL syntax to manipulate and join pandas Data Frames using the `pandasql` library.
- Visualization: Creating informative visualizations to represent data insights.
Key Takeaways:
- Mastery of the foundational aspects of data engineering.
- Hands-on experience with EDA techniques, emphasizing the discovery phase.
- Appreciation for the value of a code-centric approach in the data engineering discovery process.
Upcoming Talks:
Join us for subsequent sessions in our Data Engineering Process Fundamentals series, where we will delve deeper into specific facets of data engineering, exploring topics such as data modeling, pipelines, and best practices in data governance.
ozkary.com
VP of product development
ozkary.com
VP of product development
GDG Organizer
Contact Us