AI Evaluation, Done Right.
Ensure compliance, accuracy, and reproducibility with ailusive tools.
Designed for people and backed by research.
Industry recognition
Top 5
Selected Startups
70+
Total applicants
Selected as one of the top 5 startups among 70+ applicants to join the prestigious ZOLLHOF Tech Incubator, we are recognized for our innovative approach to AI evaluation and compliance solutions.


Expertise you can trust
Our team of seven people has combined 50+ years of expertise in human-centered evaluation of NLP systems.
We contribute this expertise to standardization of AI Act mandated standards on accurate and transparent evaluation.
Moreover, we have extensive industry experience in production coding and building scalable, deployable software solutions.
What we offer
We believe AI evaluation should be accurate, transparent, and effortless.
The problem we solve
AI compliance is complex and evolving.
Companies struggle to assess AI systems correctly.
Human-centered evaluation is crucial, but current solutions fail to integrate it effectively.
The solution to solve it
Effortless Evaluation: No NLP expertise needed.
Human-Centric + Automated: Best of both worlds.
AI Act & ISO/IEC ready: Compliance made easy.

That’s why we built LET: The LLM Evaluation Tool that integrates the best practices of human evaluation and NLP compliance—so you don’t have to.
Latest Publications
We actively contribute as experts to the research community.
Which Method(s) to Pick when Evaluating Large Language Models with Humans? – A comparison of 6 methods
Human evaluations are considered the gold standard for assessing the quality of NLP systems, including large language models (LLMs), yet there is little research on how different evaluation methods impact results. This study compares six commonly used evaluation methods – four quantitative (Direct Quality Estimation, Best-Worst Scaling, AB Testing, Agreement with Quality Criterion) and two qualitative (spoken and written feedback) – to examine their influence on ranking texts generated by four LLMs…

Designing Usable Interfaces for Human Evaluation of LLM-Generated Texts: UX Challenges and Solutions
Human evaluations remain important for assessing large language models (LLMs) due to the limitations of automated metrics. However, flawed methodologies and poor user interface (UI) design can compromise the validity and reliability of such evaluations. This study investigates usability challenges and proposes solutions for UI design in evaluation LLM-generated texts. By comparing common evaluation methods, insights were gained into UX challenges, including inefficient information transfer and poor visibility of evaluation materials…
Pre-Launch Initiative
We are developing an innovative AI evaluation tool that integrates human expertise and international standards into a seamless, automated framework.
Designed for companies that want to get AI evaluation right—without needing deep NLP expertise.