Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@Zaba505
Copy link

@Zaba505 Zaba505 commented Nov 15, 2018

This extends PR #245 by adding context.Context support, as discussed in Issue #240. This contribution comes after discussion and example implementations from both myself and @vintik100.

Zaba505 and others added 10 commits October 24, 2018 19:25
See master branch commits for reason
See master branch for reason
Dropped support for Go 1.8
Added Go 1.11 support
Implementing my comment from PR gocolly#242
Took the best components from @vintik100s implementation and my own and merged them into a single impl.
@Zaba505 Zaba505 mentioned this pull request Nov 15, 2018
Used best practices for context.Context args and added documentation to new exported components
Updated `antchfx/{htmlquery,xmlquery}` deps to now take advantage of new API in both `FindEachWithBreak`. This now allows breaking out of the `Find` loop if the context is cancelled.
@fooofei
Copy link

fooofei commented Dec 9, 2019

I support a less example code to use context.Context with colly,

which not change anything in colly.

// contextTransport wrapper a context.Context for cancel requests
type contextTransport struct {
	ctx   context.Context
	trans *http.Transport
}

func (t *contextTransport) RoundTrip(req *http.Request) (*http.Response, error) {
	req = req.WithContext(t.ctx)
	return t.trans.RoundTrip(req)
}

func collectorWithContext(c *colly.Collector, ctx context.Context) {
    // We can stop all requests at `OnRequest` callback 
    // before send request to HTTP client.
    c.OnRequest(func(req *colly.Request) {
		select {
		case <-ctx.Done():
			req.Abort()
		default:
		}
	})

    // Use custome Transport to cancel all pending requests at HTTP client,
    // which not have chance to stop at OnRequest callback.
    trans := &contextTransport{
		ctx:   ctx,
		trans: &http.Transport{},
	}
    c.WithTransport(trans)
}

This will work when cancel the context.Context,

ctx.Done() will return, and all pending request will immediately be canceled.

@WGH-
Copy link
Collaborator

WGH- commented Oct 7, 2020

Hey! Any reason why this became stuck?

Is it because this PR changes lots of function signatures to have context argument, which is a pretty radical change?

If so, what about making it much less drastic by only adding Collector.WithContext method, which will be used for the http.Request? This way API compatibility would be kept, and it still would be possible to cancel HTTP requests in a clean way.

And since the common way to use colly is to have OnHTML etc. handlers be closures, you can simply use your own context from the closure. E.g.

func crawl(ctx context.Context, url string) {
	c := colly.NewCollector(colly.WithContext(ctx))
	c.OnResponse(func(res *colly.Response) {
		storeIntoSomeDatabase(ctx, res)
	})
	// skip
}

@asciimoo
Copy link
Member

asciimoo commented Oct 8, 2020

If so, what about making it much less drastic by only adding Collector.WithContext

I like this approach a lot! @WGH- could you work on this?

@WGH-
Copy link
Collaborator

WGH- commented Oct 8, 2020

@asciimoo yes

FWIW, I said it in other issue, but the solution outlined by @fooofei is not perfect since it loses the original request's context which might already have deadline set. See net/http/client.go, where setRequestCancel(req, rt, deadline) is called just before RoundTrip.

@asciimoo
Copy link
Member

asciimoo commented Oct 9, 2020

FWIW, I said it in other issue, but the solution outlined by @fooofei is not perfect since it loses the original request's context which might already have deadline set

I think it isn't a deal breaker. We can document this behavior or perhaps we can throw a warning if the user sets a custom timeout and also uses custom ctx. What do you think?

@WGH-
Copy link
Collaborator

WGH- commented Oct 9, 2020

FWIW, I said it in other issue, but the solution outlined by @fooofei is not perfect since it loses the original request's context which might already have deadline set

I think it isn't a deal breaker. We can document this behavior or perhaps we can throw a warning if the user sets a custom timeout and also uses custom ctx. What do you think?

What I said earlier applies only to the hack with setting context inside RoundTrip. Setting context before passing the http.Request to Client.Do does not have this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants