Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Allow even-parentless workflow spans to always be created #817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 27 additions & 12 deletions temporalio/contrib/opentelemetry.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,12 +74,22 @@ class should return the workflow interceptor subclass from
def __init__(
self,
tracer: Optional[opentelemetry.trace.Tracer] = None,
*,
always_create_workflow_spans: bool = False,
) -> None:
"""Initialize a OpenTelemetry tracing interceptor.

Args:
tracer: The tracer to use. Defaults to
:py:func:`opentelemetry.trace.get_tracer`.
always_create_workflow_spans: When false, the default, spans are
only created in workflows when an overarching span from the
client is present. In cases of starting a workflow elsewhere,
e.g. CLI or schedules, a client-created span is not present and
workflow spans will not be created. Setting this to true will
create spans in workflows no matter what, but there is a risk of
them being orphans since they may not have a parent span after
replaying.
Copy link
Contributor

@dandavison dandavison Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not swapped into this work.

  1. Why don't we default to always creating the parent-less spans? Wouldn't that be more useful to users than dropping them?
  2. The docstring here uses the term "orphan" but couldn't they equally be viewed as roots, originating in the workflow?
  3. [Just a question, not blocking this PR] Could it make sense to allow tracing to be enabled in the CLI (and maybe even Schedule starter one day) when starting workflows?

Copy link
Member Author

@cretz cretz Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we default to always creating the parent-less spans? Wouldn't that be more useful to users than dropping them?

Why we don't now - because it was chosen not to originally and we can't just change on people. Why we didn't originally - because if you create spans without a parent you have orphans. So if it was cached (i.e. never replayed), it'd just be under RunWorkflow which is created on non-replay start, but when it is replayed, everything after has no parent so it is on its own.

The docstring here uses the term "orphan" but couldn't they equally be viewed as roots, originating in the workflow?

It means spans like StartActivity may or may not have a parent, depending on whether the workflow is running somewhere separate than when it first created the RunWorkflow span. People do not expect spans from inside a workflow to be without a parent in my experience.

Could it make sense to allow tracing to be enabled in the CLI (and maybe even Schedule starter one day) when starting workflows?

Yes it can, though OTel usually expects people to programmatically configure tracers, not outside of code. But CLI could definitely accept everything it needs to build https://pkg.go.dev/go.temporal.io/sdk/contrib/opentelemetry#NewTracingInterceptor (basically it'd be whatever was required to build a Go tracer).

To clarify what's happening here: client-side start workflow creates StartWorkflow, then first non-replay start creates RunWorkflow and sets that on context (if there's the StartWorkflow parent), then execute activity creates StartActivity (implicitly parenting to RunWorkflow if it's in this instance, StartWorkflow otherwise). So StartWorkflow is the only stable span available. There has been talk of temporalio/features#394 to help the situation where a span was not created by the starter, but in the meantime default Python (unlike some other SDKs) chose not to potentially create orphans by default. This option allows orphans to happen. I hope that's clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes that's very helpful.

Why don't we default to always creating the parent-less spans? Wouldn't that be more useful to users than dropping them?

Why we don't now - because it was chosen not to originally and we can't just change on people.

Would that really be a (bad) breaking change? Wouldn't it just mean some new traces show up in their observability platform that didn't before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that really be a (bad) breaking change?

Yes, I think it'd be a bad breaking change. I also think the default that exists is valuable even if we were ok with breaking changes. Orphaned spans not under a parent can cause those looking at traces for a workflow to not see a span.

Wouldn't it just mean some new traces show up in their observability platform that didn't before?

Yes, which can clutter a tracing platform. Today people can trust that they're not just going to have some StartActivity top-level span flood the top-level of their Jaeger list.

"""
self.tracer = tracer or opentelemetry.trace.get_tracer(__name__)
# To customize any of this, users must subclass. We intentionally don't
Expand All @@ -90,6 +100,7 @@ def __init__(
self.text_map_propagator: opentelemetry.propagators.textmap.TextMapPropagator = default_text_map_propagator
# TODO(cretz): Should I be using the configured one at the client and activity level?
self.payload_converter = temporalio.converter.PayloadConverter.default
self._always_create_workflow_spans = always_create_workflow_spans

def intercept_client(
self, next: temporalio.client.OutboundInterceptor
Expand Down Expand Up @@ -165,10 +176,15 @@ def _start_as_current_span(

def _completed_workflow_span(
self, params: _CompletedWorkflowSpanParams
) -> _CarrierDict:
) -> Optional[_CarrierDict]:
# Carrier to context, start span, set span as current on context,
# context back to carrier

# If the parent is missing and user hasn't said to always create, do not
# create
if params.parent_missing and not self._always_create_workflow_spans:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line looks like a breaking change, am I getting that wrong?

Copy link
Member Author

@cretz cretz Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was inside the workflow's form of _completed_span before as:

        # If there is no span on the context, we do not create a span
        if opentelemetry.trace.get_current_span() is opentelemetry.trace.INVALID_SPAN:
            return None

but now that I have to check a parameter from outside the sandbox, I moved the logic to the outside-of-sandbox part instead of the inside-of-sandbox part.


# Extract the context
context = self.text_map_propagator.extract(params.context)
# Create link if there is a span present
Expand Down Expand Up @@ -286,7 +302,7 @@ class _InputWithHeaders(Protocol):

class _WorkflowExternFunctions(TypedDict):
__temporal_opentelemetry_completed_span: Callable[
[_CompletedWorkflowSpanParams], _CarrierDict
[_CompletedWorkflowSpanParams], Optional[_CarrierDict]
]


Expand All @@ -299,6 +315,7 @@ class _CompletedWorkflowSpanParams:
link_context: Optional[_CarrierDict]
exception: Optional[Exception]
kind: opentelemetry.trace.SpanKind
parent_missing: bool


_interceptor_context_key = opentelemetry.context.create_key(
Expand Down Expand Up @@ -529,17 +546,13 @@ def _completed_span(
exception: Optional[Exception] = None,
kind: opentelemetry.trace.SpanKind = opentelemetry.trace.SpanKind.INTERNAL,
) -> None:
# If there is no span on the context, we do not create a span
if opentelemetry.trace.get_current_span() is opentelemetry.trace.INVALID_SPAN:
return None

# If we are replaying and they don't want a span on replay, no span
if temporalio.workflow.unsafe.is_replaying() and not new_span_even_on_replay:
return None

# Create the span. First serialize current context to carrier.
context_carrier: _CarrierDict = {}
self.text_map_propagator.inject(context_carrier)
new_context_carrier: _CarrierDict = {}
self.text_map_propagator.inject(new_context_carrier)
# Invoke
info = temporalio.workflow.info()
attributes: Dict[str, opentelemetry.util.types.AttributeValue] = {
Expand All @@ -548,25 +561,27 @@ def _completed_span(
}
if additional_attributes:
attributes.update(additional_attributes)
context_carrier = self._extern_functions[
updated_context_carrier = self._extern_functions[
"__temporal_opentelemetry_completed_span"
](
_CompletedWorkflowSpanParams(
context=context_carrier,
context=new_context_carrier,
name=span_name,
# Always set span attributes as workflow ID and run ID
attributes=attributes,
time_ns=temporalio.workflow.time_ns(),
link_context=link_context_carrier,
exception=exception,
kind=kind,
parent_missing=opentelemetry.trace.get_current_span()
is opentelemetry.trace.INVALID_SPAN,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'm not familiar enough with python style, should the 2nd line here be indented to indicate these 2 lines are actually 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the auto formatter did

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should place parens around expressions in this situation

Comment on lines +576 to +577
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
parent_missing=opentelemetry.trace.get_current_span()
is opentelemetry.trace.INVALID_SPAN,
parent_missing=(
opentelemetry.trace.get_current_span()
is opentelemetry.trace.INVALID_SPAN
),

)
)

# Add to outbound if needed
if add_to_outbound:
if add_to_outbound and updated_context_carrier:
add_to_outbound.headers = self._context_carrier_to_headers(
context_carrier, add_to_outbound.headers
updated_context_carrier, add_to_outbound.headers
)

def _set_on_context(
Expand Down
50 changes: 50 additions & 0 deletions tests/contrib/test_opentelemetry.py
Original file line number Diff line number Diff line change
Expand Up @@ -332,6 +332,56 @@ def dump_spans(
return ret


@workflow.defn
class SimpleWorkflow:
@workflow.run
async def run(self) -> str:
return "done"


async def test_opentelemetry_always_create_workflow_spans(client: Client):
# Create a tracer that has an in-memory exporter
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
tracer = get_tracer(__name__, tracer_provider=provider)

# Create a worker with an interceptor without always create
async with Worker(
client,
task_queue=f"task_queue_{uuid.uuid4()}",
workflows=[SimpleWorkflow],
interceptors=[TracingInterceptor(tracer)],
) as worker:
assert "done" == await client.execute_workflow(
SimpleWorkflow.run,
id=f"workflow_{uuid.uuid4()}",
task_queue=worker.task_queue,
)
# Confirm the spans are not there
spans = exporter.get_finished_spans()
logging.debug("Spans:\n%s", "\n".join(dump_spans(spans, with_attributes=False)))
assert len(spans) == 0

# Now create a worker with an interceptor with always create
async with Worker(
client,
task_queue=f"task_queue_{uuid.uuid4()}",
workflows=[SimpleWorkflow],
interceptors=[TracingInterceptor(tracer, always_create_workflow_spans=True)],
) as worker:
assert "done" == await client.execute_workflow(
SimpleWorkflow.run,
id=f"workflow_{uuid.uuid4()}",
task_queue=worker.task_queue,
)
# Confirm the spans are not there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Confirm the spans are not there
# Confirm the spans are there

spans = exporter.get_finished_spans()
logging.debug("Spans:\n%s", "\n".join(dump_spans(spans, with_attributes=False)))
assert len(spans) > 0
assert spans[0].name == "RunWorkflow:SimpleWorkflow"


# TODO(cretz): Additional tests to write
# * query without interceptor (no headers)
# * workflow without interceptor (no headers) but query with interceptor (headers)
Expand Down
Loading