Google Summer of Code 2025 - Beam ML Vector DB/Feature Store integrations

What Will I Cover In This Blog Post?

I have three objectives in mind when writing this blog post:

  • Documenting the work I’ve been doing during this GSoC period in collaboration with the Apache Beam community
  • A thoughtful and cumulative thank you to my mentor and the Beam Community
  • Writing to an older version of myself before making my first ever contribution to Beam. This can be helpful for future contributors

What Was This GSoC Project About?

The goal of this project is to enhance Beam’s Python SDK by developing connectors for vector databases like Milvus and feature stores like Tecton. These integrations will improve support for ML use cases such as Retrieval-Augmented Generation (RAG) and feature engineering. By bridging Beam with these systems, this project will attract more users, particularly in the ML community.

Why Was This Project Important?

While Beam’s Python SDK supports some vector databases, feature stores and embedding generators, the current integrations are limited to a few systems as mentioned in the tables down below. Expanding this ecosystem will provide more flexibility and richness for ML workflows particularly in feature engineering and RAG applications, potentially attracting more users, particularly in the ML community.

Vector DatabaseFeature StoreEmbedding Generator
BigQueryVertex AIVertex AI
AlloyDBFeastHugging Face

Why Did I Choose Beam As Part of GSoC Among 180+ Orgs?

I chose to apply to Beam from among 180+ GSoC organizations because it aligns well with my passion for data processing systems that serve information retrieval systems and my core career values:

  • Freedom: Working on Beam supports open-source development, liberating developers from vendor lock-in through its unified programming model while enabling services like Project Shield to protect free speech globally

  • Innovation: Working on Beam allows engagement with cutting-edge data processing techniques and distributed computing paradigms

  • Accessibility: Working on Beam helps build open-source technology that makes powerful data processing capabilities available to all organizations regardless of size or resources. This accessibility enables projects like Project Shield to provide free protection to media, elections, and human rights websites worldwide

What Did I Work On During the GSoC Program?

During my GSoC program, I focused on developing connectors for vector databases, feature stores, and embedding generators to enhance Beam’s ML capabilities. Here are the artifacts I worked on and what remains to be done:

TypeSystemArtifact
Enrichment HandlerMilvusPR #35216
PR #35577
PR #35467
Sink I/OMilvusPR #35708
PR #35944
Enrichment HandlerTectonPR #36062
Sink I/OTectonPR #36078
Embedding GenOpenAIPR #36081
Embedding GenAnthropicTo Be Added

Here are side-artifacts that are not directly linked to my project:

TypeSystemArtifact
AI Code ReviewGemini Code AssistPR #35532
Enrichment HandlerCloudSQLPR #34398
PR #35473
Pytest MarkersGitHub CIPR #35655
PR #35740
PR #35816

For more granular contributions, checking out my ongoing Beam contributions.

How Did I Approach This Project?

My approach centered on community-driven design and iterative implementation, Originally inspired by my mentor’s work. Here’s how it looked:

  1. Design Document: Created a comprehensive design document outlining the proposed ML connector architecture
  2. Community Feedback: Shared the design with the Beam developer community mailing list for review
  3. Iterative Implementation: Incorporated community feedback and applied learnings in subsequent pull requests
  4. Continuous Improvement: Refined the approach based on real-world usage patterns and maintainer guidance

Here are some samples of those design docs:

ComponentTypeDesign Document
MilvusVector Enrichment Handler[Proposal][GSoC 2025] Milvus Vector Enrichment Handler for Beam
MilvusVector Sink I/O Connector[Proposal][GSoC 2025] Milvus Vector Sink I/O Connector for Beam
TectonFeature Store Enrichment Handler[Proposal][GSoC 2025] Tecton Feature Store Enrichment Handler for Beam
TectonFeature Store Sink I/O Connector[Proposal][GSoC 2025] Tecton Feature Store Sink I/O Connector for Beam

Where Did Challenges Arise During The Project?

There were 2 places where challenges arose:

  • Running Docker TestContainers in Beam Self-Hosted CI Environment: The main challenge was that Beam runs in CI on Ubuntu 20.04, which caused compatibility and connectivity issues with Milvus TestContainers due to the Docker-in-Docker environment. After several experiments with trial and error, I eventually tested with Ubuntu latest (which at the time of writing this blog post is Ubuntu 25.04), and no issues arose. This version compatibility problem led to the container startup failures and network connectivity issues

  • Triggering and Modifying the PostCommit Python Workflows: This challenge magnified the above issue since for every experiment update to the given workflow, I had to do a round trip to my mentor to include those changes in the relevant workflow files and evaluate the results. I also wasn’t aware that someone can trigger post-commit Python workflows by updating the trigger files in .github/trigger_files until near the middle of GSoC. I discovered there is actually a workflows README document in .github/workflows/README.md that was not referenced in the CONTRIBUTING.md file at the time of writing this post

How Did This Project Start To Attract Users in the ML Community?

It is observed that after we had a Milvus Enrichment Handler PR before even merging, we started to see community-driven contributions like this one that adds Qdrant. Qdrant is a competitor to Milvus in the vector space. This demonstrates how the project’s momentum and visibility in the ML community space attracted contributors who wanted to expand the Beam ML ecosystem with additional vector database integrations.

How Did This GSoC Experience Working With Beam Community Shape Me?

If I have to boil it down across three dimensions, they would be:

  • Mindset: Before I was probably working in solitude making PRs about new integrations with mental chatter in the form of fingers crossed, hoping that there will be no divergence on the design. Now I can engage people I am working with through design docs, making sure my work aligns with their vision, which potentially leads to faster PR merges
  • Skillset: It was one year before contributing to Beam where I wrote professionally in Python, so it was a great opprtunity to brush up on my Python skills and seeing how some design patterns are used in practice, like the query builder pattern seen in CloudSQL Vector Ingestion in the RAG package. I also learned about vector databases and feature stores, and also some AI integrations. I also think I got a bit better than before in root cause analysis and filtering signals from noise in long log files like PostCommit Python workflows
  • Toolset: Learning about Beam Python SDK, Milvus, Tecton, Google CloudSQL, OpenAI and Anthropic text embedding generators, and lnav for effective log file navigation, including their capabilities and limitations

Tips for Future Contributors

If I have to boil them down to three, they would be:

  • Observing: Observing how experienced developers in the Beam dev team work—how their PRs look, how they write design docs, what kind of feedback they get on their design docs and PRs, and how you can apply it (if feasible) to avoid getting the same feedback again. What kind of follow-up PRs do they create after their initial ones? How do they document and illustrate their work? What kind of comments do they post when reviewing other people’s related work? Over time, you build your own mental model and knowledge base on how the ideal contribution looks in this area. There is a lot to learn and explore in an exciting, not intimidating way
  • Orienting: Understanding your place in the ecosystem and aligning your work with the project’s context. This means grasping how your contribution fits into Beam’s architecture and roadmap, identifying your role in addressing current gaps, and mapping stakeholders who will review, use, and maintain your work. Most importantly, align with both your mentor’s vision and the community’s vision to ensure your work serves the broader goals
  • Acting: Acting on feedback from code reviews, design document discussions, and community input. This means thoughtfully addressing suggested changes in a way that moves the discussion forward, addressing concerns raised by maintainers, and iterating on your work based on community guidance. Being responsive to feedback, asking clarifying questions when needed, and demonstrating that you’re incorporating the community’s input into your contributions given that it is aligned with the project direction

Who Do I Want To Thank for Making This Journey Possible?

If I have to boil them down to three, they would be:

  • My Mentor, Danny McCormick: I wouldn’t hesitate to say that Danny is the best mentor I have worked with so far, given that I have worked with several mentors. What makes me say that:
    • Generosity: Danny is very generous with his time, feedback, and genuinely committed to reviewing my work on a regular basis. We have weekly 30-minute sync calls over almost 21 weeks (5 months) since the official community bonding period, where he shares with me his contextual expertise and addresses any questions I may have with openness to extend time if needed and flexible about skipping calls when there was no agenda
    • Flexibility: When I got accepted to GSoC, after a few days I also got accepted to a part-time internship that I had applied to before GSoC, while also managing my last semester in my Bachelor of Computer Science, which was probably the hardest semester. During our discussion about working capacity, Danny was very flexible regarding that, with more emphasis on making progress, which encouraged me to make even more progress. I have also never felt there are very hard boundaries around my project scope—I felt there was an area to explore that motivated me to think of and add some side-artifacts to Beam, e.g., adding Gemini Code Assist for AI code review
    • Proactivity: Danny was very proactive in offering support and help without originally asking, e.g., making Beam Infra tickets that add API keys to unblock my work
  • Beam Community: From my first ever contribution to Beam adding FlattenWith and Tee examples to the playground, I was welcomed with open arms and felt encouraged to make more contributions. Also, for their valuable comments on my design documents on the dev mailing list as well as the PRs
  • Google: I would like to genuinely thank Google for introducing me to open source in GSoC 2023 and giving me a second chance to interact with Apache Beam through GSoC 2025. Without it, I probably wouldn’t be here writing this blog post, nor would I have this fruitful experience

What’s Next?

I am now focusing on helping move the remaining artifacts in this project scope from the in-progress state to the merging state. After this, I would love to keep my contributions alive in Beam Python and Go SDK, to name a few. I would also love to connect with you all on my LinkedIn and GitHub.

References