Pyspark työt
...standing up a series of production data pipelines and need an IT professional who can move comfortably between Python scripting, SQL optimisation, PySpark transformations and Airflow orchestration. The immediate focus is end-to-end pipeline build-out: designing clean ingestion logic, transforming data in Spark, writing efficient queries and scheduling everything through well-structured Airflow DAGs. If you can demonstrate hands-on experience across all four technologies - Python, SQL, PySpark and Airflow—and enjoy owning a pipeline from raw source to curated output, I’d like to work together. Deliverables I’m expecting: • A working set of PySpark jobs that handle ingestion, transformation and output staging • Airflow DAGs that s...
...platform built on Spark and Python, and I need an experienced engineer who can sit with me in Pune three to four times a week (hardly 2 hours in a day) to keep it running smoothly. Most of the immediate work revolves around tracking down bugs in existing PySpark jobs, but the role naturally extends to writing fresh code when gaps appear, tightening our data-pipeline orchestration, and mapping end-to-end data lineage so every downstream consumer stays confident in their numbers. Typical day-to-day work you will tackle: • Debug production PySpark jobs and accompanying Python utilities • Refactor or rewrite modules where quick fixes will not suffice • Optimise and monitor pipeline schedules (Palantir Foundry) • Document lineage and hand off clear, r...
I need a reusable ETL framework built inside Databricks notebooks, version-controlled in Bitbucket and promoted automatically through a Bitbucket Pipeli...attached to any cluster. Acceptance criteria • Parameter-driven notebooks organised by layer. • Reusable GraphQL connector packaged as a .whl. • Bitbucket Pipelines yaml that runs unit tests, uses the Databricks CLI to deploy notebooks, and executes an integration test on commit. • Clear README detailing how to add a new API endpoint and where to place cleaning logic. Leverage native tools—PySpark, SQL, Delta Lake, dbutils—while keeping external libraries to a minimum and fully documented. Please share a brief outline of your approach and any relevant Databricks + Bitbucket CI experience s...
I’m a beginner looking for a 1-on-1 Databricks instructor for a very hands-on, fast-paced 2-week program. Requirements: - Strong real-world Databricks experience - Hands-on Apache Spark (PySpark), SQL, Delta Lake - Real use case / mini project (end-to-end pipeline) - Live screen sharing, coding together - Beginner-friendly but practical (no theory-only) Goal: By the end of 2 weeks, I want to confidently build and understand a real Databricks data pipeline. Availability: 5–6 sessions per week, 1–1.5 hours per session Please share: - Your Databricks experience - How you would structure these 2 weeks - Your hourly rate Thanks!
...across multiple source systems. Build and optimize Foundry pipelines using Code Workbooks (PySpark, SQL, Scala) and Quiver. Support data integration, feature engineering, and pipeline debugging for production AI workloads. Implement security and permissions architecture aligned with enterprise governance. Help develop Foundry applications using Workshop, Contour, and Slate for analytics and decision-making. Guide on best practices for CI/CD, testing, and deployment within Foundry. Provide mentorship and troubleshooting support during live client engagements. Required Skills: Strong hands-on experience with Palantir Foundry (Ontology, Code Workbooks, Quiver, Workshop). Proficiency in Python, PySpark, and SQL. Experience with data modeling, transformation logic, and pipelin...
Need a strong streaming experience person to develop design deploy Pyspark publishing and upserting job in EMR with Spark, MongoDB(documentDb) connector, AWS EMR step functions, Cloud watch, docker, Kafka cluster architecture, Airflow dags, Gitlab, Pycharm, Cursor AI IDE etc needed for environment experience
...patterns that Databricks loves to test. • Fresh practice questions (or a curated question bank) with detailed explanations so I understand not just the right answer but the thinking process. • At least one full-length mock exam under timed conditions followed by a debrief on weak areas and strategies to avoid common pitfalls. I work mainly in the Databricks notebook environment with Python, PySpark, and SQL, so please weave real-world examples into the prep. I’m flexible on session times and frequency; we can agree milestones and refine the plan as we go. If you’ve already helped others pass this exam—or you hold the certification yourself—tell me how you’d tackle my study roadmap and what materials you’d bring to the table. I...
...actual medicines and would map once the inconsistencies are ironed out, so I want the process to be fully automated, driven by a robust auto-correct algorithm rather than manual review. Remaining 0.1% could be non medical entries, and need to be deleted. I am open to proven techniques—fuzzy matching, phonetic hashing, Levenshtein, word embeddings, or a hybrid—as long as they scale. Python, pandas, PySpark, or any other big-data friendly stack is fine, provided the final solution is reproducible and well documented. Deliverables • Clean, executable scripts (Jupyter notebook or .py) that ingest both files, normalise product names, detect duplicates, and output a one-to-one mapping table. • A brief README explaining dependencies, algorithm logic, and how ...
...Infrastructure Microsoft Azure (Functions, Logic Apps, Service Bus, Blob Storage, Data Factory, Azure DevOps) AWS Cloud Docker, Kubernetes RabbitMQ CRM, ERP & Enterprise Platforms Microsoft Dynamics CRM 365 Dynamics Business Central Sage CRM NopCommerce Sitefinity v12.2 Umbraco v8.0 DotNetNuke v4.0 Python, AI & Advanced Solutions Python, Django, Flask, Pyramid REST APIs, WebSockets PySpark AI Email & Chatbot Solutions Data Science & Analytics CMS, E-Commerce & Web Platforms WordPress, Joomla, Drupal Prestashop PHP-based systems BI, Finance & Business Support Power BI Advanced Excel Accounting, Finance & Bookkeeping Data Entry & Business Reporting MS Office Suite Tools & Delivery Methodology Git (Version Control) N...
...short interpretive notes that fold easily into manuscripts. What matters most is hands-on mastery of data extraction, table linking, and general database management within MIMIC. Solid grounding in observational study design, epidemiology, and EHR quirks is essential; a background in medicine or public health will make communication smoother. Working code in SQL plus either tidyverse/R or pandas/pySpark is expected. The immediate deliverable is a fully cleaned analytic dataset with the accompanying scripts and an outline of the statistical approach. After that, I plan to keep the collaboration open for additional projects and sensitivity analyses as new questions arise....
I’m looking for a Data Engineer with strong AWS native services experience to help build and support an event-driven data platform. This project focuses on automated batch data pipelines, data governance, and making data available in a secure ...Data Engineer with strong AWS native services experience to help build and support an event-driven data platform. This project focuses on automated batch data pipelines, data governance, and making data available in a secure and scalable way. This is not ad-hoc ETL — it’s a platform-style setup. Tech stack involved: • AWS: S3, SQS, Lambda, MWAA (Airflow), EMR Serverless • Data Processing: PySpark, Apache Spark • Data Lake: Apache Iceberg, AWS Glue Catalog • Governance & Security: Lake Formatio...
...Remote Working Time: evening Budget: 22-24k monthly Duration:-2 hours per day Demo Required: Today Job Description We are seeking an experienced Senior Data Engineer with strong expertise in the Healthcare Payer domain to design, build, and maintain scalable data pipelines and reporting solutions. The ideal candidate will have hands-on experience across AWS and Microsoft Azure, strong Python/PySpark skills, and the ability to support integrated reporting and analytics using Power BI. Key Responsibilities Design, develop, and maintain end-to-end data pipelines for healthcare payer data Build and optimize ETL/ELT workflows using AWS Glue, Step Functions, and Python Work with Azure and AWS cloud services for data ingestion, processing, and storage Implement and manage Data ...
...Remote Working Time: evening Budget: 22-24k monthly Duration:-2 hours per day Demo Required: Today Job Description We are seeking an experienced Senior Data Engineer with strong expertise in the Healthcare Payer domain to design, build, and maintain scalable data pipelines and reporting solutions. The ideal candidate will have hands-on experience across AWS and Microsoft Azure, strong Python/PySpark skills, and the ability to support integrated reporting and analytics using Power BI. Key Responsibilities Design, develop, and maintain end-to-end data pipelines for healthcare payer data Build and optimize ETL/ELT workflows using AWS Glue, Step Functions, and Python Work with Azure and AWS cloud services for data ingestion, processing, and storage Implement and manage Data ...
My current résumé sells me as a data engineer, yet my next move is a Data Analyst role. I need the Work Experience and Skills sections re-worked so recruiters immediately see me as a strong analytical hire. Here’s what you’ll be working with • Hands-on background in Hadoop administration, PySpark development, Databricks workflows and day-to-day data analysis. • A solid foundation in SQL and reporting tools, though these strengths are not highlighted well in the document. What I’m after • Rewrite both sections to spotlight analytical impact, business-friendly storytelling and in-demand keywords (think SQL, dashboards, data visualization, statistical insight, KPI tracking, etc.). • Re-order bullet points around results, not...
The core of my remote-sensing crop-yield project is in place, but the code will not run from start to finish. I need a fresh set of eyes to hunt down and eliminate the blockers so that the pipeline executes smoothly on Databricks and locally. Current state • Repository already contains: – Spark-based preprocessing notebooks (PySpark) – Trained ML model scripts and saved artefacts – A handful of Databricks experiment notebooks for exploration What I need most Debugging is the priority. I am not after a full rewrite—I want the existing pieces to work together. You are free to suggest refactors where they remove obvious bottlenecks, but the first milestone is simply getting the code to run cleanly. Focus areas • Spark preprocessi...
We are seeking a freelancer proxy for a Data Engineer role to support a remote healthcare data platform. The work will be 5 to 6 hours per day. You will be required to sit alongside the engineer during work hours, explain work...operational runbooks for knowledge sharing • Support and guide production-grade pipelines built on Dagster, DBT, Airflow, AWS Glue, and SSIS Required Skills & Tech Stack: • Python (Strong) • SQL (Advanced) • Dagster, DBT, Airflow, AWS Glue • AWS: Athena, Glue, SQS, SNS, IAM, CloudWatch • Databases: PostgreSQL, AWS RDS, Oracle, Microsoft SQL Server • Data Modeling & Query Optimization • Pandas, PySpark, PyCharm • Terraform, Docker, DataGrip, VS Code • Git/GitHub and CI/CD pipelines • Experience wi...
We are seeking a freelancer proxy for a Data Engineer role to support a remote healthcare data platform. The work will be 5 to 6 hours per day. You will be required to sit alongside the engineer during work hours, explain work...operational runbooks for knowledge sharing • Support and guide production-grade pipelines built on Dagster, DBT, Airflow, AWS Glue, and SSIS Required Skills & Tech Stack: • Python (Strong) • SQL (Advanced) • Dagster, DBT, Airflow, AWS Glue • AWS: Athena, Glue, SQS, SNS, IAM, CloudWatch • Databases: PostgreSQL, AWS RDS, Oracle, Microsoft SQL Server • Data Modeling & Query Optimization • Pandas, PySpark, PyCharm • Terraform, Docker, DataGrip, VS Code • Git/GitHub and CI/CD pipelines • Experience wi...
The core of my remote-sensing crop-yield project is in place, but the code will not run from start to finish. I need a fresh set of eyes to hunt down and eliminate the blockers so that the pipeline executes smoothly on Databricks and locally. Current state • Repository already contains: – Spark-based preprocessing notebooks (PySpark) – Trained ML model scripts and saved artefacts – A handful of Databricks experiment notebooks for exploration What I need most Debugging is the priority. I am not after a full rewrite—I want the existing pieces to work together. You are free to suggest refactors where they remove obvious bottlenecks, but the first milestone is simply getting the code to run cleanly. Focus areas • Spark preprocessi...
I have an existing SAS program that handles end-to-end data processing for a single SQL Database source. The code cleans raw tables, applies a series of transformations, then produces several aggregated outputs that feed downstream reports. I now need the entire workflow re-implemented in PySpark running on Azure Databricks so I can retire the SAS environment and take advantage of Databricks’ scalability. You will receive: • The original .sas files with inline comments that explain each step • A data-dictionary of the SQL tables involved • Sample input/output datasets to verify parity What I’m expecting from you: 1. A well-structured Databricks notebook (or .py files) that reproduces the SAS logic for data cleaning, transformation, and aggregat...
...AWS and Databricks. This role is focused on hands-on execution, optimization, and support within a clearly defined scope. Key Responsibilities Enhance and maintain existing Databricks (PySpark) data pipelines Work with AWS services such as S3, Glue, Lambda, Redshift/Athena Optimize data workflows for performance and reliability Implement data transformations, validations, and incremental loads Troubleshoot and resolve pipeline and data issues Maintain documentation for assigned components Required Experience & Skills 6–8 years of experience in Data Engineering Strong hands-on expertise in Python & PySpark Proven experience with Databricks Good knowledge of AWS data services Strong SQL and data modeling skills Ability to work independently in a remote setu...
...running the usual checks for duplicates, missing values, and outliers. Once it is clean, I expect you to apply the appropriate statistical and machine-learning techniques—time-series decomposition, clustering, cohort or basket analysis, whichever combination best surfaces trend signals. Python or R is fine (Pandas, NumPy, scikit-learn, tidyverse, etc.), and if you prefer a big-data stack such as PySpark, that works too; the volume will justify it. Please package the outcome as: • A concise written report (PDF or Markdown) that explains the key trends and how you arrived at them. • Visualisations (static or interactive) that make the findings easy to consume for non-technical stakeholders—Matplotlib, Seaborn, Plotly, or Tableau Public dashboards are all a...
I need an experienced engineer who can sit with me in Pune MH India and provide hands-on, offline technical support for daily data engineering tasks. The focus is strictly on Python and PySpark: reviewing code, untangling bugs, optimising Spark jobs, and guiding me through best practices as we build and maintain data-processing pipelines. This is not a remote, on-call role; I’m looking for someone who can be physically present in Kharadi/Viman Nagar/Magarpatta Area—pair programming, white-boarding solutions, and helping me push features all the way to a clean commit. If you have solid production experience with Python, strong command of PySpark’s RDD/DataFrame APIs, and the confidence to troubleshoot performance issues on the spot, let’s talk about a regular...
...Databricks. La idea es que el usuario complete el formulario, los datos queden almacenados directamente en una tabla de Databricks y, con un clic, se genere un informe tipo resumen ejecutivo centrado en indicadores clave de rendimiento (KPI). Busco a alguien que domine tanto la parte Front-End (HTML, CSS, JavaScript) como la integración Back-End en Databricks: notebooks, Delta Lake, Databricks SQL o PySpark. El flujo debería quedar así: • El formulario se sirve como componente web incrustado en la interfaz de Databricks (o bien como un Job/Notebook con widgets). • Al enviarse, persiste la información en una tabla Delta. • Un proceso desencadenado consulta esos registros y produce el reporte ejecutivo con los KPI más relevant...
Bank Loan ETL & Visualization Project Report 1. Abstract This project builds a complete ETL (Extract, Transform, Load) pipeline for bank loan analytics using PySpark and Python. It cleans, validates, and integrates branch, customer, and loan datasets into a unified master table. The pipeline standardizes financial data, generates analytical insights, and prepares the output for reporting and automated financial analysis. 2. Technologies Used Python PySpark Pandas Matplotlib CSV Files Java JDK (required for Spark) 3. Dataset Description This project uses three CSV datasets: – Branch details (branch_id, branch_name, branch_state) – Customer demographic information – Loan records linked to customers and branches 4. ETL Workflow The
I have an existing SAS program that handles end-to-end data processing for a single SQL Database source. The code cleans raw tables, applies a series of transformations, then produces several aggregated outputs that feed downstream reports. I now need the entire workflow re-implemented in PySpark running on Azure Databricks so I can retire the SAS environment and take advantage of Databricks’ scalability. You will receive: • The original .sas files with inline comments that explain each step • A data-dictionary of the SQL tables involved • Sample input/output datasets to verify parity What I’m expecting from you: 1. A well-structured Databricks notebook (or .py files) that reproduces the SAS logic for data cleaning, transformation, and aggregat...
I need an experienced engineer who can sit with me in Pune offline and provide hands-on, offline technical support for daily software-development tasks. (entire month but as per both convenience) The focus is strictly on Python and PySpark: reviewing code, optimizing Spark jobs, and guiding me through best practices as we build and maintain data-processing pipelines. This is not a remote, on-call role; I’m looking for someone who can be physically present—pair programming, white-boarding solutions, and helping me push features all the way to a clean commit. If you have solid production experience with Python, strong command of PySpark’s RDD/DataFrame APIs, and the confidence to troubleshoot performance issues on the spot, let’s talk about a regular schedule ...
Hi, thanks for the opportunity. I can support your Databricks and AI Agents project with strong skills in PySpark, SQL, Delta Lake, and data automation. I will handle ETL pipelines, data processing, AI agent integration, and workflow optimization. My rate is 8 USD per hour, and I can work 40 hours per week (320 USD weekly). I can start immediately and will work closely with your team for smooth delivery.
Need a strong streaming experience person to help me wrote design develop and deploy Pyspark broker publishing job in EMR with Pyspark, MongoDB connector ,DocumentDB streaming(strong Kafka mongo) AWS, step functions, EMR, docker, Kafka architecture Cloudwatch, Airflow dags.
I need an experienced engineer who can sit with me in Pune and provide hands-on, offline technical support for daily software-development tasks. The focus is strictly on Python and PySpark: reviewing code, untangling bugs, optimising Spark jobs, and guiding me through best practices as we build and maintain data-processing pipelines. This is not a remote, on-call role; I’m looking for someone who can be physically present—pair programming, white-boarding solutions, and helping me push features all the way to a clean commit. If you have solid production experience with Python, strong command of PySpark’s RDD/DataFrame APIs, and the confidence to troubleshoot performance issues on the spot, let’s talk about a regular schedule that works for both of us.
...audit submissions. 13. Able to communicate, plan and execute BI platform Audit with internal audit team. Competencies for the job 1) Proven experience with big data solution design and development in Databricks, notebooks & schema design, development, best practice and notebooks Azure Dev Ops / CI-CD Pipelines 2) Hands On in Python PySpark, Spark SQL, Delta Live + Kafka; Azure SQL DB, Azure Data Factory, Azure DataBricks, Azure Synapse, Azure Data Lake, Delta, Pyspark, Python, Logic Apps, Azure DevOps, CI/CD implementation, Power BI / QlikSense, Blob Storage, ADLS, Azure Key Vault, ETL, SSIS 3) Experience in Query Development, Performance Tuning and loading data to Databricks SQL DW 4) Experience in data ingestion into ADLS, Azure Blob Storage, Azure Logic Apps 5) Prac...
...delivery 8. Ensure developments follow standard coding patterns, are fully documented for audit submissions. Competencies for the job 1) Proven experience with big data solution design and development in Databricks, notebooks & schema design, development, best practice and notebooks Azure Dev Ops / CI-CD Pipelines 2) Hands On in Python PySpark, Spark SQL, Delta Live + Kafka; Azure SQL DB, Azure Data Factory, Azure DataBricks, Azure Synapse, Azure Data Lake, Delta, Pyspark, Python, Logic Apps, Azure DevOps, CI/CD implementation, Power BI / QlikSense, Blob Storage, ADLS, Azure Key Vault, ETL, SSIS 3) Experience in Query Development, Performance Tuning and loading data to Databricks SQL DW 4) Experience in data ingestion into ADLS, Azure Blob Storage, Azure Logic Apps 5) ...
Description: Need an experienced Databricks engineer to guide me through adding logging tasks to 2 workflows in Azure Databricks. What needs to be done: Add log_success and log_failure notebook tasks to 2 existing Databricks workflows Config...CRITICAL REQUIREMENT: All work must be done via Zoom screen sharing on MY machine You will guide/instruct me while I make the changes or you can do it I need to learn the process, not just get it done Must Have: Strong Azure Databricks workflows/jobs experience Experience with pipeline logging/monitoring patterns Patient teaching approach Tech Stack: Azure Databricks Unity Catalog Python/PySpark Azure DevOps (YAML configs) Timeline: Start ASAP To Apply: Share your Databricks experience and availability for Zoom sessions (mention your ...
Project Title: Build End-to-End Data Cleaning, ETL Pipeline & SQL Analytics (PySpark) I need a skilled Data Engineer / Data Analyst to build a complete end-to-end data pipeline using the raw CSV files provided. The project involves data cleaning, transformation, building a star schema, implementing ETL logic in PySpark, writing analytical SQL queries, and performing data quality checks. The files included are: (dirty user data – nulls, duplicates, inconsistent casing) (messy categories and SKU formatting) (20k+ orders with mixed date formats, invalid numeric fields) (dirty SKUs, wrong quantities, duplicates) Scope of Work: Data Cleaning & Standardization Fix inconsistent casing, extra spaces, special characters Convert fields into
PySpark EDA & Datasets Conversion - Must make use of PySpark for the Exploratory Data Analysis. Do not have to train model using PySpark - Add comments / describe the EDA - Need to convert from PySpark to Pandas for the test train split. - And load into PySpark - Dataset Hyperparameters: * Forced images to be 48x48 * Using PyTorch not IntensiveLock I can send you the notebook so far. You just need to get it to the point where PySpark data frame converts to pandas and then the train test split before training. If you use online resources then just copy and paste the links in the relevant notebook cells W.r.t. the convolutional neural network it is basic. Pooling layer The convolutional layer Nothing hardset, flexible at the moment and nothing...
...make use of PySpark for the Exploratory Data Analysis. Do not have to train model using PySpark - Add comments / describe the EDA - Need to convert from PySpark to Pandas for the test train split. - And load into PySpark - Dataset Hyperparameters: * Forced images to be 48x48 * Using PyTorch not IntensiveLock I can send you the notebook so far. You just need to get it to the point where PySpark data frame converts to pandas and then the train test split before training. If you use online resources then just copy and paste the links in the relevant notebook cells W.r.t. the convolutional neural network it is basic. Pooling layer The convolutional layer Nothing hardset, flexible at the moment and nothing computationally complex Pyto...
I have eleven existing Databricks jobs that need to be packaged and shipped through the new Databricks Asset Bundles workflow. All code for the jobs is already written in PySpark; what’s missing is a clean, reusable that will: • Collect the Python scripts for each job into a single asset bundle • Resolve internal dependencies and set the correct task-level libraries • Push the bundle to my Databricks workspace (Repos or DBFS) • Programmatically create/update the eleven jobs with their respective schedules and cluster definitions The script must rely on PySpark for any data-processing logic that has to run during deployment, and should use the Databricks CLI or REST API (whichever you’re most comfortable with) to handle workspace inte...
I need a SQL stored procedure converted to PySpark code. The stored procedure currently interacts with a PostgreSQL database and primarily requires DataFrame operations in PySpark. Requirements: - Convert SQL stored procedure to equivalent PySpark DataFrame operations - Ensure the logic and functionality remain consistent with the original SQL - Optimize the PySpark code for performance Ideal Skills & Experience: - Proficiency in PySpark and DataFrame operations - Strong knowledge of PostgreSQL and SQL - Experience with data transformation and migration tasks - Ability to write clean, maintainable, and efficient code Please provide examples of similar work done and any relevant certifications.
...of expert hands for about 2 ½–3 hours each day. We haven’t begun the migration yet, so you’ll step in right at the planning stage and guide it all the way through execution. The work centres on three pillars: • Data migration of relational databases into Snowflake • Building and hardening ETL pipelines in Python / PySpark • Creating and maintaining a clean CI/CD path for everything we deploy You’ll work with a stack that includes AWS, Snowflake, DBT, Python, PySpark and standard DevOps tooling for CI/CD. Along the way we’ll refine data models, set up automated tests, and make sure every job is production-ready before it moves through the pipeline. I’m based in Marathahalli, Bengaluru and strongly prefer so...
I’m ready to replace ...Integration support – provide clear docs, sample calls and, where necessary, helper SDK snippets so my team can wire the API into both the React and Flutter clients without blocking on you. 4. Evaluation – an offline notebook illustrating precision/recall or NDCG on a held-out set, and an online A/B framework outline so we can monitor lift after launch. Nice-to-haves include feature engineering in PySpark, use of TensorFlow Recommenders, and deployment via AWS SageMaker, but I’m open to your preferred stack as long as latency stays low and the pipeline is maintainable. If you have shipped a recommendation system for services before, especially across web and mobile, I’d love to see it. Let’s make our users feel like the p...
...Monitor, troubleshoot, and optimize cloud-based data workflows. Participate in code reviews and follow best practices for maintainable and scalable data solutions. Required Qualifications: Bachelor’s or Master’s degree in Computer Science, Engineering, or related field. 7+ years of hands-on experience in data engineering, with strong focus on AWS services. Proficiency in Python, SQL, and PySpark. Expertise in AWS data services: S3, Redshift, Glue, EMR, Athena, Lambda, Kinesis. Experience designing ETL pipelines, data lakes, and cloud-based data warehouses. Knowledge of CI/CD processes, version control (Git), and agile methodologies. Strong analytical, troubleshooting, and problem-solving skills. AWS certifications like AWS Certified Data Analytics – Sp...
...preferred). - Experience with containerization (Docker) and deployments. - Knowledge of observability and monitoring tools (Grafana, OpenTelemetry, Application Insights, AI Instrumentation). - Solid understanding of clean coding practices and modular design. - Strong problem-solving skills, communication, and ability to work in a collaborative environment. Preferred Skills: - Experience with PySpark for big data processing and analytics. - Exposure to Kubernetes (ArgoCD). - Experience with distributed task orchestration (Celery, Airflow) or messaging (Kafka, RabbitMQ). - Familiarity with advanced logging and monitoring best practices. - Familiarity with bash, PowerShell scripting, excel VBA - Strong SQL knowledge and practical experience with relational databases. What we offe...
...data, process it in real-time with machine learning algorithms, and store analysis results for visualization. Key components proposed include Apache Kafka for data streaming, Apache Spark (Streaming) for real-time processing, Apache Hive (or Hadoop HDFS) for data warehousing, and MongoDB for storing processed results. All code will likely be written in a high-level language (such as Python via PySpark) to integrate these components. Below, we break down the project requirements and plan into specific sections. Big Data Tools and Frameworks: The pipeline will leverage the following technologies: • Apache Kafka: Kafka is a distributed publish-subscribe messaging system ideal for ingesting and transporting real-time data streams. It is highly scalable, fault-tolerant, and low-l...
I need a clear, reproducible workflow that lets GitHub Copilot sit seamlessly inside Databricks notebooks and VS Code so we can automate code generation and analysis across our data platform. The focus is on day-to-day tasks: writing and debugging Python/PySpark, generating robust SQL, and transforming data with PySpark, generate testscripts based on requirements, ‑ exactly where Copilot will save us the most time. Our team is comfortable with Python, PySpark, SQL and Git and already works in Databricks at an intermediate level, but we’ve never wired Copilot into this environment. I’m looking for someone who can: • Configure Copilot for both VS Code and the Databricks notebook experience while keeping Git history clean. • Demonstrate automated...
...Assist in resolving technical issues related to data pipelines, ETL processes, and data modeling. Optimize and maintain large-scale data systems and workflows. Support in performance tuning, debugging, and data migration activities. Offer guidance on data architecture, best practices, and real-time project execution. Technical Skills Required (any of the following): Programming: Python, SQL, PySpark ETL Tools: Apache Airflow, Talend, Informatica, or similar Cloud Platforms: AWS (Glue, Redshift, S3), Azure (Data Factory, Synapse), or GCP (BigQuery) Databases: PostgreSQL, Snowflake, MySQL, MongoDB Big Data Technologies: Hadoop, Spark, Databricks (preferred) Version Control / CI-CD: Git, Jenkins Ideal Candidate: Has 10+ years of experience in data engineering and related...
...of our payments platform and need another set of hands to move faster. The stack is PySpark running on Azure (Databricks and ADLS), with development in PyCharm. What I want to accomplish • Cleanly ingest daily files and streaming feeds into our bronze layer • Transform and enrich them for the silver/gold layers, applying business-level funding rules • Build rock-solid validation checks so exceptions are surfaced early Where I could use you most I already have the high-level pipeline design; I need focused, code-level help writing and testing each stage inside PyCharm. Environment setup is mostly there—just a few tweaks or plug-ins might be required so we stay productive. Deliverables 1. PySpark notebooks or .py modules for ingestion, p...
...preferred). - Experience with containerization (Docker) and deployments. - Knowledge of observability and monitoring tools (Grafana, OpenTelemetry, Application Insights, AI Instrumentation). - Solid understanding of clean coding practices and modular design. - Strong problem-solving skills, communication, and ability to work in a collaborative environment. Preferred Skills: - Experience with PySpark for big data processing and analytics. - Exposure to Kubernetes (ArgoCD). - Experience with distributed task orchestration (Celery, Airflow) or messaging (Kafka, RabbitMQ). - Familiarity with advanced logging and monitoring best practices. - Familiarity with bash, PowerShell scripting, excel VBA - Strong SQL knowledge and practical experience with relational databases. What we offe...
Work Description: Review Informatica mapping/design documents. Understand the business logic (source → transformation → target). Convert logic into clean, optimized Databricks SQL or PySpark code. Help debug existing notebooks where logic or performance issues occur. All work is performed directly inside Databricks (no external tools). Requirements: Strong hands-on experience with Databricks (SQL & PySpark). Familiarity with Informatica ETL concepts (mappings, expressions, joins, lookups, etc.). Ability to write readable, well-commented code. Available for short-term or on-demand screen-sharing support.
...Description: We are looking for a skilled Data Engineer with 6+ years of experience for a long-term project. You will work closely with our team in a screen-sharing environment, taking control when needed to deliver high-quality data engineering solutions. Key Responsibilities: Design, build, and manage data pipelines using Azure Data Factory. Develop and optimize data workflows in Databricks (PySpark or Scala). Write and maintain complex queries and stored procedures in MS SQL Server. Connect and integrate data from multiple sources (cloud and on-prem). Debug and resolve data quality and performance issues. Collaborate via daily screen-sharing sessions to ensure alignment and progress. Primary Skills Required: Strong experience with MS SQL Server (T-SQL, performance tuning, ...
...production-ize an end-to-end ETL flow that pulls data from several API-based sources, transforms it with PySpark inside AWS Glue, and lands the clean data in Snowflake. Today I already have a handful of working APIs as well as new endpoints that will be released over the coming weeks, so the job covers both consuming existing feeds and designing any additional lightweight APIs required to fill gaps. All code should live in a version-controlled repository (Git) and follow clear, modular patterns so future developers can extend it with minimal friction. Please structure the work so that: • Glue jobs are parameter-driven and reusable across sources. • Transformations are written in PySpark, taking advantage of Glue’s dynamic frames where helpful. &bu...
...in cloud databases, data pipelines, and performance tuning. This role initially involves supporting for one month, and if the collaboration works well, it can be extended for up to six months. Key Responsibilities Provide expert-level support in Snowflake database development and optimization. Assist in designing and implementing data pipelines using Snowpipe and Snowpark. Convert existing PySpark programs to Snowpark manually or via Snowflake Migration Accelerator tools. Enable and manage data governance and security (row/column-level security, dynamic masking). Implement log analysis, history capture, and email alerts for system monitoring. Support in data warehousing, migration, and integration activities. Conduct performance tuning and real-time monitoring of Snowfla...