We create cutting-edge machine learning, artificial intelligence and data visualization tools. We then use the insights they generate to markedly increase your bottom line, and give you an edge over your competition.
We work with clients ranging from early-stage start-ups to large organizations, and:
“Any sufficiently advanced technology is indistinguishable from magic.” - Arthur C. Clarke
There appears to be a great amount of wizardry involved in artificial intelligence, especially when building applications. But the algorithms are just new tools, combining large amounts of data and computational power. The true challenge is that the development process is closer to applied research than to software engineering. You need workflows that enable fast iteration and experimentation, smart risk mitigation, and tight feedback loops including business experts. You need to monitor your models in production, and update them when new data becomes available and as concepts drift over time. Otherwise, you are just building a cargo cult.
Not-knowing is an integral part of the journey. In "classical" software engineering you likely have a good idea what will work in principle, and what won't. In data science you start with a hypothesis, but feasibility is not guaranteed. More often than not, you will revise your plans based on what you learned from the initial data exploration.
We are experienced researchers, rigorously trained to check all explicit and implicit assumptions. We translate your business problem into a set of technical problems, and define how to measure progress towards a solution. We design small experiments the right way, and rapidly iterate based on the results. A clear understanding of what to predict and which metrics to pass allows us to break down any large project into sensible milestones, have well-defined decision points, and to create a realistic timeline.
Machine learning algorithms perform tasks by learning from patterns in data, rather than using hand-coded rules. Accordingly, your models are only as good as the data you trained them with. You need a sufficient amount of data to have enough signal for those algorithms. You at the same time might only need a "large enough" sample to train a model that works. You also have to understand the processes generating your data extremely well. There is a twist to "measure twice and cut once": You should carefully evaluate what was measured in the first place, and how.
Many data science efforts go astray when it's becoming obvious that the data or metadata available is insufficient, or assumptions about the data generating process were never checked. This is why data science competition platforms are of limited value for learning how to build real-world solutions. All data is provided, sometimes even cleaned and ready to be used, including some sensible business metric to optimize. Exactly none of this happens for tricky problems you encounter in the wild.
Can you collect the right data? Will labeling enough ground truth data for training be time consuming, or expensive? If you buy data from a vendor, how do you check the usefulness and reliability? Can you use open source data, say census or geospatial? How much data do you really need for a task, say natural language generation? Does simply collecting more data give an advantage over using a smarter algorithm? We can help you answer those questions, and show you how to create valuable data assets in a fraction of the time and cost. Getting this right avoids running into dead ends, and allows you to quickly iterate on your ideas. It also saves hundreds of hours of expensive subject matter experts annotating data.
Machine learning and artificial intelligence are just starting to be used by non-specialized companies. At the same time, while storage and compute infrastructure are largely commoditized, data science tooling is still in its infancy. The vendor landscape is rapidly evolving when it comes to all aspects of the data science workflow, be it governance, documenting training metrics, or serving models in production. There are no agreed-upon design patterns for machine learning tools. Best practices that actually work are usually far from universal.
This state of affairs introduces risk. What happens if a fad did not last, that great API or service suddenly is not there anymore, or that SaaS tool is down? Do you still have a working platform to run your business on? You should avoid duct-taping third-party services together - or you will end up with fragile and hard-to-maintain systems. In addition, there often is a mismatch between your problem and the one solved by off-the-shelf software. Successful software vendors, in the end, found use cases that are either common or lucrative enough. Large tech companies also often solve problems on a way larger scale, introducing needless complexity to infrastructure. There is great value in building your own processes and tools, and having this edge on your competition.
Complexity is also an issue from a computational perspective. While we do enjoy applying the latest algorithms and testing the newest frameworks - if a linear regression or a tree based model are good enough, then that's what we will use. We habitually build simple models as both benchmark and sanity check anyway. Hardware aspects are also important. Large amounts of memory and hard drive space are cheap compared to even a decade ago. If all of your data fits on a fast drive, if it's possible to take a "large-enough" sample to work in memory, there might be no need for distributed computation frameworks and clusters. And minimal systems equal shorter feedback loops and quick iterations.
We clearly communicate what works, and which aspects are still uncertain. We want you to know where opportunities lie in the near future, so we create a log of ideas and results as we experiment. We want you to understand when and why models will begin to fail, so you recognize it when it happens in production.
Businesses need interpretable models. For some industries like finance, this is a necessity due to regulations. In any case, it helps to validate the model and build trust. While there are approaches to uncover black box behavior of neural nets, for high stakes decisions you should try not to rely on such models of models. We design everything from the ground up to be interpretable, if that is possible. And once we have a good model, we can simulate data - which often is more informative than test statistics with implicit assumptions.
Finally, interface design and visual presentation of data is crucial and not only a side-effect. We often build custom user interfaces for ourselves during data exploration, interactively visualizing millions of datapoints. This is immensely useful to spot anomalities, and to understand how a model works. We can help you design tools that allow interactive experimentation, and reports that are updating themselves as new data becomes available.