SkillNet: npm for agent skills? Reading a paper that scratches a real itch | Lab Diary

I have been spending a lot of time lately building agents that do things. Not chatbots, actual agents: ones that browse, code, search, analyze, plan. And the thing I keep running into is that every agent I build is basically artisanal. Every capability gets wired in by hand. Every new skill is a bespoke integration. There is no registry, no dependency resolution, no way to say "this agent can do X, and X composes well with Y."

So when I came across this paper from Zhejiang University (with folks from Alibaba, Tencent, and like 16 other institutions), it caught my attention. They call it SkillNet, and the core idea is treating AI agent skills as first-class, composable units. Think npm or pip, but for agent capabilities. They have curated over 150,000 skills and claim 40% higher average rewards and 30% fewer execution steps across three benchmarks.

I do not know if those numbers will hold up broadly. But the framing resonated with something I keep bumping into in my own work, so I wanted to think through what they are proposing and where I think the real opportunity is.

Fair warning: nothing here is a rigorous critique. I read the paper once or twice, played around with the ideas, and wrote down what I thought. No scientific methodology, no systematic comparison with other approaches. Just one person's reaction after sitting with it for a bit.

The problem they are trying to solve

Here is the thing that hit me when reading the paper. Consider how software engineering solved the dependency problem. Back in 2010, if you wanted to build a web application, you manually downloaded libraries, tracked versions in a spreadsheet, and hoped nothing broke when you updated one piece. Then npm, pip, and cargo showed up. Not because they made code better, but because they made code composable. You could declare dependencies, resolve conflicts, and share verified packages with a community.

AI agents today are roughly where software was before package managers. Every framework (LangChain, CrewAI, AutoGen, OpenClaw) defines its own ad hoc approach to skills. Some call them "tools," others "actions" or "plugins." There is no shared vocabulary for what a skill is, no standard way to evaluate whether a skill actually works, and no graph structure that tells you which skills compose together and which ones conflict.

I feel this in my own agent work constantly. I built a web search capability for one agent. Then I needed similar search for another agent in a different context. Did I reuse it? No, I rebuilt it because the first one was tangled up with that specific agent's context handling. Every team building agents is doing this same dance.

How SkillNet works (what I understood)

SkillNet proposes a three-layer ontology that looks a lot like how package ecosystems work in traditional software. I want to be honest about which parts I fully grasped and which ones I am still sitting with.

Skill taxonomy

The first layer is a hierarchical classification across ten domains: Development, AIGC, Research, Science, Business, Testing, Productivity, Security, Lifestyle, and Other. Each domain has fine-grained semantic tags. This is the equivalent of npm categories or PyPI classifiers. Straightforward stuff, I get why it is needed.

Skill relation graph

This is the layer that got me excited. Each skill has typed relationships to other skills:

similar_to: functionally equivalent skills, for redundancy detection
belong_to: hierarchical sub-component relationships
compose_with: skills frequently co-invoked in workflows
depend_on: prerequisite dependencies that must be satisfied first

The compose_with and depend_on relations are where I think the real value is. If you know that data_cleaning composes well with statistical_analysis and both depend on csv_parser, you can automatically assemble multi-step workflows instead of manually wiring them. This is dependency resolution for agent capabilities.

I keep thinking about this in terms of my own agents. Right now, when I want my research agent to do something new, I have to manually figure out which tools it needs and in what order. What if I could just say "add research_paper_analysis capability" and the system figured out that it also needs pdf_extraction and citation_parsing?

Skill package library

The third layer bundles skills into task-oriented packages for deployment. Think meta-packages in Linux distributions. Curated collections that solve a specific class of problems.

Honestly, I am less clear on how this layer works in practice. The paper describes it at a conceptual level, but the boundary between "a skill" and "a package of skills" feels fuzzy to me. When does a collection of composed skills become a package versus just a workflow? I need to think about this more.

What I do not fully understand yet

I want to flag a few things where my understanding is still forming.

The paper talks about a skill creation pipeline that ingests execution trajectories, GitHub repos, documents, and natural language prompts to generate skills. The execution trajectory part makes sense to me: watch an agent do something, extract the pattern, formalize it as a skill. But the "create a skill from a natural language prompt" part feels hand-wavy. How do you go from "I want a skill that analyzes sentiment" to a tested, evaluated, dependency-aware skill artifact? The paper does not go deep enough on this for me to understand the mechanism.

The evaluation pipeline uses LLM-based scoring across five dimensions: Safety, Completeness, Executability, Maintainability, and Cost-awareness. I get the dimensions, they make sense. But using an LLM to evaluate the output of another LLM feels circular to me. How do you validate the validator? The paper acknowledges adversarial robustness as a limitation, but I think the problem goes deeper than adversarial inputs. Even well-intentioned skills might get evaluated poorly by an LLM that does not understand the domain context. Or evaluated well when they should not be. I do not have a solution here, just a nagging feeling that this needs more work.

The Discovery, Activation, Execution phases are described at a high level, and I need to spend more time with the actual CLI toolkit to understand how these translate to concrete operations. The paper gives the conceptual model, but I learn better from code.

Sketching out what this might look like in code

So the paper describes the architecture conceptually and provides a Python CLI toolkit, but I found myself wanting to understand the primitives by trying to build them. None of this is from the paper. It is me thinking through the problem by sketching code.

A skill, at minimum, needs to declare its interface, dependencies, and metadata:

from dataclasses import dataclass, field
from typing import Callable, Any
from enum import Enum

class SkillRelation(Enum):
    SIMILAR_TO = "similar_to"
    BELONG_TO = "belong_to"
    COMPOSE_WITH = "compose_with"
    DEPEND_ON = "depend_on"

@dataclass
class SkillSpec:
    name: str
    version: str
    domain: str
    description: str
    entrypoint: Callable[..., Any]
    input_schema: dict
    output_schema: dict
    relations: dict[SkillRelation, list[str]] = field(default_factory=dict)
    eval_scores: dict[str, float] = field(default_factory=dict)

Then a registry that resolves dependencies and discovers compositions. This is the part that maps most directly to what SkillNet is doing with the relation graph:

class SkillRegistry:
    def __init__(self):
        self._skills: dict[str, SkillSpec] = {}
        self._relation_index: dict[SkillRelation, dict[str, set[str]]] = {
            r: {} for r in SkillRelation
        }

    def register(self, skill: SkillSpec) -> None:
        self._skills[skill.name] = skill
        for rel_type, targets in skill.relations.items():
            idx = self._relation_index[rel_type]
            idx.setdefault(skill.name, set()).update(targets)

    def resolve_dependencies(self, skill_name: str) -> list[str]:
        """Topological sort of depend_on relations."""
        visited, order = set(), []
        def _visit(name: str):
            if name in visited:
                return
            visited.add(name)
            deps = self._relation_index[SkillRelation.DEPEND_ON].get(name, set())
            for dep in deps:
                _visit(dep)
            order.append(name)
        _visit(skill_name)
        return order

    def discover_workflow(self, skill_name: str) -> list[str]:
        """Find skills that compose well, with deps resolved."""
        composable = self._relation_index[SkillRelation.COMPOSE_WITH].get(
            skill_name, set()
        )
        all_skills = {skill_name} | composable
        full_chain = []
        for s in all_skills:
            for dep in self.resolve_dependencies(s):
                if dep not in full_chain:
                    full_chain.append(dep)
        return full_chain

And here is where it gets interesting for me. The discover_workflow method is basically what SkillNet's relation graph enables at scale. You start with one skill, and the graph tells you what else you need. I keep thinking about how this maps to my own agent orchestration. Right now I manually compose capabilities. This kind of registry would let the agent figure out its own toolchain.

For the evaluation piece, I sketched something that mirrors SkillNet's five dimensions:

class EvalDimension(Enum):
    SAFETY = "safety"
    COMPLETENESS = "completeness"
    EXECUTABILITY = "executability"
    MAINTAINABILITY = "maintainability"
    COST_AWARENESS = "cost_awareness"

async def evaluate_skill(
    skill: SkillSpec,
    evaluator_llm: Any,
    sandbox: Any,
) -> dict[str, float]:
    scores = {}

    # Safety: check for prompt injection, dangerous operations
    safety_result = await evaluator_llm.assess(
        prompt=f"Evaluate this skill for safety risks:\n{skill.description}",
        rubric="Score 1-5: prompt injection resistance, no destructive ops",
    )
    scores[EvalDimension.SAFETY.value] = safety_result.score

    # Executability: actually run it in a sandbox
    try:
        test_input = await evaluator_llm.generate_test_input(skill.input_schema)
        result = await sandbox.execute(skill.entrypoint, test_input, timeout=30)
        scores[EvalDimension.EXECUTABILITY.value] = 5.0 if result.success else 2.0
    except Exception:
        scores[EvalDimension.EXECUTABILITY.value] = 1.0

    # ... completeness, maintainability, cost_awareness follow similar patterns
    return scores

These are rough sketches. Not production code. But writing them out helped me understand something: the primitives are not complicated. What is hard is curating the graph at scale and keeping evaluations honest. And that is where SkillNet's contribution of 150,000+ evaluated skills becomes meaningful. Anyone can build a registry. Not many can fill it.

The benchmark numbers

SkillNet was evaluated across three agent environments with three LLM backbones. Here is what caught my eye.

ALFWorld (embodied household tasks): Gemini 2.5 Pro hit 91.43 average reward with SkillNet skills. DeepSeek V3.2 reached 80.60, o4 Mini scored 68.57.

WebShop (online shopping): Gemini 2.5 Pro at 53.02, DeepSeek V3.2 at 46.18, o4 Mini at 36.21.

ScienceWorld (virtual lab experiments): The strongest results. Gemini 2.5 Pro at 86.26, DeepSeek V3.2 at 81.31, o4 Mini at 71.06.

Execution steps dropped from 18.94 to 12.35 at the extremes. That 30% reduction in steps matters more to me than the reward improvement, because fewer steps means lower latency, lower cost, and fewer chances for the agent to go off the rails.

What I find genuinely interesting is that the gains hold across seen and unseen task splits. The skills generalize rather than just memorizing solutions. If the relation graph were just a lookup table, you would expect the unseen tasks to show much weaker results. They do not. That suggests the ontology and composition logic are doing real work.

I am less sure about how these numbers translate to real-world agent systems that are messier than ALFWorld or WebShop. These are structured environments with well-defined action spaces. My agents operate in open-ended contexts where the action space is basically "anything you can do with a computer." Whether SkillNet's approach scales to that is an open question that the paper does not address.

The one-person lab thing

The paper makes a passing reference to supporting "one-person company/lab" paradigms, and this is the part that I keep coming back to.

The idea is that an individual with a well-curated skill repository could achieve what previously required a team. And this maps directly to what I have been building toward with my own agent orchestration work on the Life Command Center. The economics work out. If a skill repository lets you compose pre-evaluated, dependency-resolved agent workflows, the marginal cost of adding a new capability shrinks toward nothing. Instead of building a "research agent" from scratch, you compose existing skills: paper_search, abstract_extraction, methodology_analysis, synthesis_report. Each one has been evaluated. The composition graph tells you which skills to include and in what order.

SkillNet's integration with OpenClaw (a personal AI agent framework) does exactly this: skills are lazy-loaded on demand, quality-audited automatically, and folded into the agent's growing repertoire.

Here is where I am going with this in my own thinking: we are probably moving from "prompt engineering" to "skill engineering." The competitive advantage shifts from who writes the best prompts to who curates the best skill libraries. And unlike prompts, skills are composable, testable, and versionable. That feels like a meaningful shift.

What I think is missing

A few things the paper does not address that I think matter a lot in practice.

Skills that learn from failure. SkillNet skills are static artifacts. Once created and evaluated, they do not learn from their own execution. If a web_search skill fails repeatedly on a specific class of queries, there is no feedback loop to improve it. The evaluation happens once, at ingestion time. A production system would need skills that adapt, or at least a monitoring layer that flags degrading skills. I have seen this in my own agents: a capability that works great for a month and then starts failing because the underlying API changed or the model drifted. Static skills will rot.

Versioning. The paper does not discuss how skills evolve over time. What happens when an API a skill depends on changes? When a newer approach to the same task emerges? Package managers solved this with semantic versioning and lockfiles. SkillNet does not appear to have an equivalent mechanism, and I genuinely do not know how you would retrofit one onto an LLM-based evaluation system.

The cross-framework problem. SkillNet skills are tied to a specific execution model. There is no standard interface that lets a SkillNet skill run in LangChain, CrewAI, or any other framework. The "npm for agents" vision requires something like a universal skill interface, and the ecosystem is nowhere near agreeing on one. This might be the hardest problem here, and I am not sure the paper treats it seriously enough.

Cost modeling at the workflow level. Individual skills are evaluated for cost-awareness, which is good. But composed workflows can have non-linear cost characteristics. A workflow that chains five individually cheap skills might hit rate limits, trigger expensive model calls at composition boundaries, or produce intermediate outputs that balloon context windows. I have seen this in my own agent workflows: each step seems cheap, but the aggregate cost surprises me. The paper does not address this.

Where my head is at

I keep coming back to one thought. The AI agent community has spent enormous energy on making agents smarter: better reasoning, better planning, better tool use. What this paper argues, at least implicitly, is that we might be underinvesting in the infrastructure layer. Making agents smarter does not help much if every team has to rebuild the same capabilities from scratch with no way to share, compose, or evaluate them.

npm did not make JavaScript better as a language. What it did was make the JavaScript ecosystem composable, and that composability is what let a single developer ship things that previously needed a team. SkillNet is trying to do something similar for agent capabilities. Not making any individual skill better, but making the ecosystem of skills navigable and composable.

I am not betting that SkillNet specifically becomes the standard. But I am betting that the primitives it proposes (typed relations, multi-dimensional evaluation, compositional dependency resolution) will become expected infrastructure for agent development within the next year or two. The question I keep asking myself is not "how smart can we make agents?" but "how composable can we make what agents know how to do?"

I want to dig into the CLI toolkit next and see how the actual implementation compares to my sketches. And I want to think more about how this connects to the Context Platform, because if skills are composable packages and context is the terrain agents navigate, then the intersection of SkillNet-style composition with structured context retrieval is where things get really interesting for the one-person lab model.