Malvern, PA
Vanguard is not offering visa sponsorship for this position.
Provides subject matter expertise in the maintenance of a reliable site environment, to ensure the stability and security of multiple systems/platforms. Develops and implements improvements for all aspects of software reliability.
Core Responsibilities
1. Collaborates with internal teams to evaluate the health, stability and reliability of systems/platforms. Provides subject matter expertise on architecture and programming design decisions related to availability and resilience.
2. Leads localized failure modes when new features and architecture patterns are introduced. Facilitates post-incident reviews for any client-impacting events local to the product family.
3. Leads the planning and execution of chaos experiments to meet the development and maintenance requirements of systems/platforms for the product family. Coordinates performance tests for the product family.
4. Leads product teams in triage and troubleshooting during client impacting incidents.
5. Ensures alignment between service level indicators and objectives within the product family.
6. Maintains product-level runbooks for incident response, in collaboration with SRE Practitioners on each product team, to document the step-by-step process to recover from specific components within a system. Makes final decisions regarding usage of tools, libraries, and standards for SRE in situations where multiple options have been provided by SRE.
7. Participates in special projects and performs other duties as assigned.
Additional Details
This position is the initial role in a new Site Reliability team that will define and implement best practices for observability, establish and maintain service level indicators (SLIs) and service level objectives (SLO), tracking and addressing toil, conducting blameless root cause post-mortems, and incorporating preventative and proactive SRE practices. This will include working with Architects, Data Engineers, and Data Analysts to identify root causes, resolve issues, optimize existing systems, enhance infrastructure, and promote automation to reduce effort and increase reliability.
Additional Responsibilities
- Works closely with leaders to establish and iteratively implement the SRE practice.
- Gain insights into PI CDAO operations, demonstrates and champions site reliability culture and practices, builds relationships, and influences SRE ways of working.
- Exhibits deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platform.
- Communicates progress, issues, and solutions to management and business clients to obtain their input or buy-in as appropriate. Provides written and verbal communication to multiple organizations and audiences within Vanguard on the status of assigned projects and issues.
- Elevates more complex problems and client issues/concerns when necessary and follows up to ensure resolution.
- Maintains proactive knowledge and understanding of pending elevations, enhancements, and infrastructure changes. Proactively identifies potential failure points and designs strategies to ensure that failures remain localized, preventing widespread disruption and contagion.
Qualifications
- Minimum of eight years related experience, with at least two years of development experience.
- Undergraduate degree or equivalent combination of training and experience. Graduate degree preferred.
- 3-5+ years of Site Reliability Engineering experience
- 3-5+ years of DevOps experience
- Strong analytic and problem-solving skills.
- Self-motivated individual with the ability to prioritize and manage changing priorities.
- Extensive knowledge and understanding of working in AWS and with Python and SQL.
- Proficiency and experience in observability, and telemetry tools such as Splunk, CloudWatch, Grafana, Datadog, etc.
Special Factors
Sponsorship
About Vanguard
We are Vanguard. Together, we're changing the way the world invests.
For us, investing doesn't just end in value. It starts with values. Because when you invest with courage, when you invest with clarity, and when you invest with care, you
can
We want to make success accessible to everyone. This is our opportunity. Let's make it count.
Inclusion Statement
Vanguard's continued commitment to diversity and inclusion is firmly rooted in our culture. Every decision we make to best serve our clients, crew (internally employees are referred to as crew), and communities is guided by one simple statement Do the right thing.
We believe that a critical aspect of doing the right thing requires building diverse, inclusive, and highly effective teams of individuals who are as unique as the clients they serve. We empower our crew to contribute their distinct strengths to achieving Vanguard's core purpose through our values.
When all crew members feel valued and included, our ability to collaborate and innovate is amplified, and we are united in delivering on Vanguards core purpose.
Our core purpose To take a stand for all investors, to treat them fairly, and to give them the best chance for investment success.
How We Work
Vanguard has implemented a hybrid working model for the majority of our crew members, designed to capture the benefits of enhanced flexibility while enabling in-person learning, collaboration, and connection. We believe our mission-driven and highly collaborative culture is a critical enabler to support long-term client outcomes and enrich the employee experience.