An efficient workflow should do just that -- flow -- directing us seamlessly from each phase of a project to the next, optimizing task management, and ultimately guiding us from business problem to solution to value. As the data deluge continues to rain down, businesses are drowning in data but starving for insight. This makes the hiring of a data science team a vital investment. But what makes up a data science team? What are the best practices for data science workflows? And what do data scientists need to execute their data science workflow to the best of their ability?
While there is no template for solving data science problems, the OSEMN (Obtain, Scrub, Explore, Model, Interpret) data science pipeline, a popular framework introduced by data scientists Hilary Mason and Chris Wiggins in 2010, is a good place to start. Most data science workflows are variations of the OSEMN sequence of steps, having fundamental processes based on the same established principles, and with the common goal of enabling the rest of the organization to make better, data-driven decisions. The features of data science workflows depend entirely on the business goals and task at hand.
The most important step in improving your data science workflow is the development of best practices for your team’s particular needs. In doing this, you’ll want to consider the following data science workflow best practices.
Data Science as a Team Sport
The initial perception of data scientists was of one person who could just magically do everything. For obvious reasons, that's not a good idea. Data science encompasses a wide variety of disciplines and roles, including programming engineers, machine learning engineers, system architects, database administrators, business intelligence analysts, IT engineers, and more. Building data science teams should encompass individuals who will specialize in different areas. An effective team workflow starts with determining the kind of expertise needed on your team, and clearly defining the roles within your team.
If you’re building a prototype, you might not need a systems architect. A database administrator might not be necessary if you're working on a smaller project. A production engineer would be best suited for customer facing services. And some team members with experience from academia will mainly perform research that is not necessarily intended to result in a product for sale. The various roles on your data science team are determined by your business goals and tasks. The data scientist is not a one-man-band, and can often be overvalued. Having all these specialists work together towards a common goal is going to help you get farther than having a few individuals trying to do everything themselves.
Identifying Your Business Questions
What question are you answering and what are the business goals? A major component in data scientists’ productivity is the ability to break big problems down into smaller pieces, and to really focus on the business outcome that you're trying to solve, as opposed to doing research for research’s sake. Ultimately, data science teams exist to improve a business process, increase revenue, and lower costs. The ability to ask the right questions and actually solve real business problems determines your success. Identifying the abstract sets the agenda for what you want your team to accomplish. Who is your end user? What is their problem? What are you prioritizing -- accuracy, speed, or explainability?
Embracing Open Source and the Cloud
The cost prohibitive aspects associated with early data science workflows have effectively been eliminated thanks to open source data analysis solutions and cloud computing. Open source has evolved to become the predominant source of tools for data scientists. In terms of conceptual access, you won’t be required to build your own data center. If you want to use a variety of different tools, you now have the option to test them out and subscribe to them on an as-needed basis. And cloud computing provides large amounts of hardware that can be rented on an hourly basis.
There's also generally no explicit cost for using open source libraries, which provide incredible resources and flexibility. Unlike proprietary software, an open source project can be modified to suit your needs. Building on an existing project eliminates the need to start from scratch, saving an enormous amount of time and money. Switching costs should be lower as well without any actual licensing cost. With open source in combination with cloud computing, you can evaluate what you want to use, create a prototype, test it out for a period of time, determine what doesn't work, and then try something else, all at a much lower cost.
Building the Right Data Science Workflow Toolkit
The bulk of a data scientist’s time is spent understanding the business problem and communicating the results. Documenting and communicating your findings in a clear and efficient way can be one of the most challenging steps in the scientific process. Automating this process is crucial for good data science workflows and for your sanity. Some useful data science workflow tools include:
Data Science Workflows with Jupyter
Jupyter Notebook is an open source, data science front end used to capture the data preparation process, consisting of notebooks that contain live code, equations, visualizations and explanatory text. Jupyter Notebook works irrespective of whether you're using a laptop, a server, or with cloud providers. The notebook aspect of it refers to the fact that you have your code and the results in the same window. As a means of communication and interactive exploration, Jupyter Notebooks have a very desirable set of properties for the interface, in which you can add little bits of code at a time, see the result, write corresponding notes to yourself on your data sources and conclusions, and then send those files to other people. In order for these notebooks to work, you need the data and all the dependencies that are used to reproduce this data, which is where docker containers come in.
Data Science Workflows Using Docker Containers
With Docker, you can package all your code, and everything you need to run the code, in standardized, isolated software containers that can be passed into and work in any environment.
Data Science Workflows with RAPIDS
RAPIDS is an open source suite of GPU accelerated machine learning and data analytics libraries deployed on NVIDIA GPU platforms. RAPIDS is ideal for teams that are solving larger scale problems, need millisecond response times, or executing large volumes of repeated computation.
Data Science Workflows with Amazon Web Services
Amazon Web Services offers a suite of data science tools well-suited for machine learning workflows. Orchestrate and automate sequences of machine learning tasks by enabling data collection and transformation. Use Amazon Athena to perform queries, aggregate and prepare data in AWS Glue, execute model training on Amazon SageMaker, and deploy the model to the production environment. Data science workflows can be shared between data engineers and data scientists.
Comments