• Under an open-source license, it is accessible to data scientists and beginners alike.
• It notably accelerates the feature engineering process, which is essential for machine learning.
In the early 2000s, Orange Research Engineer Marc Boullé launched a project to design a machine learning solution dedicated to mining and utilizing large multi-table databases. Since then, this tool—called Khiops—has continued to evolve and increase its potential to better fulfill its purpose: to offer a fully automated, intuitive machine learning experience that emphasizes the model’s interpretability.
Automation, Simplicity, Speed
Recently, Khiops reached a strategic milestone with the move to an , giving it wider accessibility and a wider audience.
Before exploring this major transition, it is worth revisiting the tool’s strengths. For a machine learning model to perform properly, a data preparation process—one that is often lengthy, laborious and expensive—is required. This is where Khiops comes in, “by automating and accelerating operations that were previously done manually, such as — in other words, the process of transforming raw data into usable variables,” explained Alexis Bondu, AI Researcher at Orange. “Where data experts could spend long days on preparation, cleansing, aggregation and more for masses of data, Khiops removes these steps so they can focus their ideas on the business issue at hand.”
Khiops has adopted recognized standards from the open-source world, such as Python virtual environments, Sklearn syntax and program installation via conda.
A Unique Approach
Diving into the details, Khiops is based on an original mathematical formalism and an approach free from . This feature offers a very tangible advantage: namely, protection against the phenomenon of overfitting (where data is learned by rote), which is detrimental to the model’s performance. It also avoids highly time-consuming repeated cycles of trial and error, which saves processing time.
Another aspect that sets the tool apart is how it enhances interpretability: Each decision and result produced can be explained transparently, allowing Khiops to forestall any black box effect.
Expanded Reach with Open Source
Until now, a community within the Orange group had near-exclusive access to the first versions of Khiops. Around 1000 users were won over by the solution’s performance, which was tested before work on accessibility, standardization and documentation was completed. At that time, getting to grips with the solution required a certain level of expertise and prior training time. The tool’s migration to the open-source ecosystem, which was launched at the end of 2022, has removed the barriers to its wider usage. Khiops has been standardized by adopting recognized standards from the open-source world, such as Python virtual environments, Sklearn syntax and program installation via conda. In tandem, emphasis was placed on simplifying the process of acculturation and skills development, with accessible technical documentation, guidelines, explanatory notebooks and more made freely available on the website at khiops.org.
Constant Evolution
Luc-Aurélien Gauthier, Data Scientist and Khiops Project Coordinator, explained that, “for a young user, the prospect of working on an open, documented and scalable solution is more attractive than grappling with a cumbersome proprietary tool.”
The dedicated Khiops website boasts many informational resources and will be expanded in the coming months, for example, with the addition of materials relating to new advanced use cases. The Khiops tool itself is also continuing to evolve — v11 is expected to arrive at some point during 2024 and will introduce major changes such as support for dealing with text data and new visualization tools.
Sources :
An open-source license allows anyone to access, modify and redistribute software code.
Refers to the pre-processing of raw data so that it can be used as machine learning data.
These are parameters that are manually defined to control the model’s learning process, as opposed to other parameters whose value is derived from this learning.