keinplankarriere

Technical deep-dive

How the matching
engine works

Fuzzy deduplication, skill & experience extraction, and hybrid scoring — under the hood.

01

Deduplication

02

Skill extraction

03

Experience extraction

04

Hybrid scoring

~5 minute walkthrough

keinplankarriere

Where everything fits

The pipeline

Ingestion — per scraped job

Scrape

4 boards, with full descriptions

→

Extract skills

taxonomy + regex

→

Deduplicate

fuzzy match across sources

→

Store

SQLite — one row per job

Matching — scoring pass

Candidate

experience base → skills & preferences

+

Job

title · description · extracted skills

→

Rule score

0–100, instant & free

→

LLM refine

top-N only, + explanation

keinplankarriere

1Fuzzy deduplication

Same job, four boards

The problem. One role gets posted on LinkedIn StepStone Xing Arbeitsagentur — four near-identical rows.

Normalize first. lower-case; strip gender markers (m/w/d), company suffixes (GmbH, AG, SE) and punctuation.

Compare with difflib. SequenceMatcher ratio on title + company — Python standard library, zero dependencies, fully deterministic.

# on normalized title + company
title_sim   = ratio(a.title,   b.title)
company_sim = ratio(a.company, b.company)

is_dup = title_sim   >= 0.82 and
         company_sim >= 0.80

Confidence = 0.6 · title + 0.4 · company. Only cross-source pairs merge into one row (recorded in also_on); a board updating its own posting is handled by the job-id upsert.

keinplankarriere

2Skill extraction

From description to tags

Curated taxonomy. ~70 canonical skills, each with aliases — React / ReactJS / React.js all map to “React”. English + German.

Word-boundary regex. one pre-compiled pattern per skill, so “Java” ≠ “JavaScript” and “Go” ≠ “Google”.

Runs at ingestion. scans the title + the full scraped description, then normalizes the result onto the job.

“…REST APIs in Python and Django, deployed on AWS with Docker…”

↓ extract

PythonDjangoREST AWSDocker

keinplankarriere

3Experience extraction

Your CV → structured experience

01 · Input

CV PDF (pypdf) or pasted LaTeX / text

→

02 · LLM parse

→ JSON: title, org, stack, dates, summary, tags

→

03 · Review

human-in-the-loop popup before save

→

04 · Store

experience base drives matching

Two ways in

Upload a CV → AI parses every role & project
Add manually → AI infers the type, tags & stack
You review and edit before anything is saved

Made robust

Non-reasoning model — emits JSON, not “thinking”
Tolerant JSON extraction + schema validation
Self-healing: auto-picks an available model

keinplankarriere

4Hybrid scoring

Deterministic first, LLM where it counts

Layer 1 — rule score · every job · instant · free

Skills

45

Role

20

Location

10

Remote

10

Seniority

10

Salary

5

Skills = the share of the job’s requirements the candidate has → more experience only ever helps.

Layer 2 — LLM refinement

only the top-N rule candidates
re-scored 0–100, grounded in the real experiences
returns a written explanation
rate-limited (1.5 s + 429 back-off)
falls back to the rule score on any failure

Plus a separate call that ranks which experiences to emphasize per job.

keinplankarriere

Design principles

What ties it together

Deterministic first

Dedup, skills and the base score need no LLM — fast, free, reproducible.

LLM where it judges

Reserved for parsing CVs, refining the top matches and explaining them.

Always grounded

Scores and CVs come from real, reviewed experience — no fabrication.

Resilient by design

Standard-library core, fallbacks everywhere, self-healing model choice.

How the matchingengine works

The pipeline

Same job, four boards

From description to tags

Your CV → structured experience

Two ways in

Made robust

Deterministic first, LLM where it counts

Layer 2 — LLM refinement

What ties it together

Deterministic first

LLM where it judges

Always grounded

Resilient by design

How the matching
engine works